Aaron M. Cohen, MD A survey of current work in is a postdoctoral fellow in the medical informatics programme at OHSU. Dr Cohen works in biomedical the area of text mining, focusing on issues and Aaron M. Cohen and William R. Hersh applications important to Date received (in revised form): 25th October 2004 biomedical researchers. He was chairman of the W3C working group that produced Abstract version 2 of the Synchronized The volume of published biomedical research, and therefore the underlying biomedical Multimedia Integration knowledge base, is expanding at an increasing rate. Among the tools that can aid researchers in Language (SMIL 2.0). coping with this information overload are text mining and knowledge extraction. Significant William Hersh, MD progress has been made in applying text mining to named entity recognition, text classification, is Professor and Chair of the Department of Medical terminology extraction, relationship extraction and hypothesis generation. Several research Informatics & Clinical groups are constructing integrated flexible text-mining systems intended for multiple uses. The Epidemiology in the School of major challenge of biomedical text mining over the next 5–10 years is to make these systems Medicine at Oregon Health & useful to biomedical researchers. This will require enhanced access to full text, better Science University (OHSU) in Portland, Oregon. Dr Hersh’s understanding of the feature space of biomedical literature, better methods for measuring the research focuses on the usefulness of systems to users, and continued cooperation with the biomedical research development and evaluation of community to ensure that their needs are addressed. information retrieval systems for biomedical practitioners and researchers. INTRODUCTION: arising from different research specialties. BACKGROUND AND With the recent sequencing of the human Keywords: text-mining, PURPOSE genome, the addition of detailed genetic , natural The volume of published biomedical information to biomedical research makes language processing research, and therefore the underlying the situation even more complicated, biomedical knowledge base, is since genetics may play a role in almost all expanding at an increasing rate. While areas of health and disease and it is likely scientific information in general has that many connections between different been growing exponentially for several branches of medicine may be based on centuries,1 the absolute numbers specific related genomic mechanisms. to modern medicine are very The goal of biomedical research is to impressive. The MEDLINE 2004 discover knowledge and put it to practical database contains over 12.5 million use in the forms of diagnosis, prevention records, and the database is currently and treatment. Clearly with the current growing at the rate of 500,000 new rate of growth in published biomedical citations each year.2 With such research, it becomes increasingly likely explosive growth, it is extremely that important connections between challenging to keep up to date with all individual elements of biomedical of the new discoveries and theories knowledge that could lead toward even within one’s own field of practical use are not being recognised

Aaron Michael Cohen, biomedical research. because there is no individual in a position Postdoctoral Fellow, Biomedical research is divided into to make the necessary connections. Department of Medical Informatics highly specialised fields and subfields, Methods must be established to aid and Clinical Epidemiology, School of Medicine, with poor communication between researchers and physicians in making Oregon Health & Science disciplines.3 While this may be a necessary more efficient use of the existing research University, pre-condition for the complex and and helping them take this research to 3181 S.W. Sam Jackson Park Road, Portland, OR 97239–309, USA detailed research that biomedical science the next step along the path to practical requires, it also tends to narrow the application. While manual curation and Tel: +1 503 494 0046 Fax: +1 503 494 4551 perspective, impeding the establishment indexing can be an aid to researchers E-mail: [email protected] of connections between discoveries searching for appropriate literature, a

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 57–71. MARCH 2005 5 7 Cohen and Hersh

recent study of the information content management methods to the vast of MEDLINE records by Kostoff et al.4 amount of biomedical knowledge that found a significant amount of conceptual exists in the literature as well as the free information present only in the abstract text fields of biomedical databases. field and missing from the MeSH terms. This paper surveys the state of the art in This is not surprising since the biomedical text mining over the past 18– MEDLINE indexers and the MeSH 24 months. The next section covers vocabulary, while broadly based, cannot current active areas of research, including be expected to represent all of the the specific problems that are being concepts of interest for all potential users. addressed and the approaches used. This is Clearly, the full text of biomedical followed by an examination of the current literature contains a wealth of issues and future challenges of biomedical information important to users that may text mining. not be completely captured by reviewers and curators. CURRENT AREAS OF Text mining and knowledge extraction RESEARCH The goal of biomedical are ways to aid researchers in coping with While other authors have proposed text mining is to shift information overload. Text mining is categorisations based on stages of the burden of differentiated from both information of increasing information overload retrieval (IR) and text summarisation (TS) sophistication,9 here recent work is from the researcher to the computer in that while IR and TS focus on the grouped pragmatically with separate larger units of text such as documents, categories for each distinct type of text- text mining operates at a finer level of mining task. This is because current work granularity and examines the relationships centres around several common text- between specific kinds of information mining themes. contained both within and between documents. Text mining is also Named entity recognition differentiated from full-blown natural At first glace, the task of named entity language processing (NLP) in that NLP recognition (NER) appears attempts to understand the meaning of straightforward. The goal is to identify, text as a whole, while text mining and within a collection of text, all of the knowledge extraction concentrate on instances of a name for a specific type of solving a specific problem in a specific thing: for example, all of the drug names domain identified a priori (possibly using within a collection of journal articles, or some NLP techniques in the process). For all of the names and symbols within Recognising biological example, text mining can aid database a collection of MEDLINE abstracts. entities in text allows curators by selecting articles most likely to Hansich and de Bruijn and coworkers9,10 for further extraction of 5,6 relationships and other contain information of interest, or believed that solving this problem would information by potential new treatments for migraine allow more complex text-mining tasks to identifying the key may be determined by looking for be addressed. The idea is that recognising concepts of interest pharmacological substances that are biological entities in text allows for associated with biological processes further extraction of relationships and associated with migraine.7,8 other information by identifying the key The goal of biomedical text mining is concepts of interest and allowing those therefore to allow researchers to identify concepts to be represented in some needed information more efficiently, consistent, normalised form. uncover relationships obscured by the This task has been challenging for sheer volume of available information, several reasons. First, there does not exist and in general shift the burden of a complete dictionary for most types of information overload from the researcher biological named entities, so simple text- to the computer by applying matching algorithms do not suffice. In algorithmic, statistical and data addition, the same word or phrase can

58 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 57–71. MARCH 2005 Current work in biomedical text mining

refer to a different thing depending upon extending the Brill POS tagger11,16,17 to context (eg ferritin can be a biological include gene and names as a tag substance or a laboratory test). type with the system trained on 7,000 Conversely, many biological entities have hand-tagged sentences from biomedical several names (eg PTEN and MMAC1 text. AbGene then applies manually refer the same gene). Biological entities generated post-processing rules based on may also have multi-word names (eg lexical-statistical characteristics that help carotid artery), so the problem is further identify the context in which gene additionally complicated by the need to names are used and eliminate false determine name boundaries and resolve positives and negatives. The system overlap of candidate names. achieved a precision of 85.7 per cent at a Because of the potential utility and recall of 66.7 per cent. complexity of the problem, NER has In contrast to the tagging approach attracted the interest of many researchers, used by Tanabe and Wilbur, Chang et al. and there is a tremendous amount of created the GAPSCORE system,18 which published research in this topic. With the assigns a numerical score to each word large amount of genomic information within a sentence by examining the being generated by biomedical appearance, morphology and context of researchers, it should not be surprising the word and then applying a classifier Approaches to NER that in the genomics era, much of the trained on these features. Words with generally fall into three categories: lexicon- work in biomedical NER has focused on higher scores are more likely to be gene based, rules-based and recognising gene and protein names in and protein names or symbols. After statistically-based free text. training on the Yapex corpus,19 precision, The approaches generally fall into three recall and F-score were computed for categories: lexicon-based, rules-based and both the exact matches and ‘sloppy’ statistically based. Combined approaches matches (defined as a true positive if any also have been used. The output may be a part of gene name is predicted correctly), set of tags assigning a predicted type to with the system performing much better each word or phrase of interest, as in with sloppy matches (precision 74 per part-of-speech (POS) tagging,11 or as a cent, recall 81 per cent, F-measure 77 per score designating the confidence that a cent), than with exact matches (precision word or phrase is of a given type of 59 per cent, recall 50 per cent, F-measure interest. Systems are typically measured in 54 per cent). terms of precision (number of correct A number of other groups have predictions divided by total number of worked in this area. Hanisch et al. used a predictions) and recall (number of correct large dictionary of gene and protein predictions divided by number of actual names and semantically classified words named entities in the text). Precision and that tend to appear in context with recall are often combined into a single protein names, reporting a specificity of measure, either using the F-score, defined 95 per cent and sensitivity of 90 per as the harmonic mean of precision and cent.10 Zhou et al. trained a hidden recall (2PR/[P+R]),12 or by reporting the Markov model (HMM) on a set of balanced precision and recall, defined as features based on word formation (ie the point where precision and recall are capitalisation), morphology (ie prefix and equal. suffix), POS, semantic triggers (head One of the most successful rules-based nouns and verbs) and intra-document approaches to gene and protein NER in name aliases.20 They reported an overall biomedical texts has been the AbGene precision of 66.5 per cent at a recall of system of Tanabe and Wilbur.13 It has 66.6 per cent on the GENIA corpus.21 been used as the NER component in Other gene and protein NER systems extracting relationships by several other include those by Narayanaswamy et al.,22 researchers.14,15 AbGene works by Settles 23 and Mika and Rost.24

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 57–71. MARCH 2005 5 9 Cohen and Hersh

Chen and Friedman have adapted the names. This list was purified by applying MEDLEE system to recognise phrases thematic analysis to the names, and then that correspond to phenotype information using inductive logic programming to within biomedical text.25 This system uses learn rules for differentiating gene names natural language techniques to identify from non-gene names within a theme. phenotypic phrases present in journal Finally, a simple false-positive filter was article abstracts, and recognises phrases applied that removed obviously incorrect containing words separated in the text. names such as those containing ‘http’ or This area of biological NER is much less ending in ‘tion’. Their approach yielded a well studied than recognising gene, final set of 1,145,913 gene names. protein or chemical names, and therefore Assessment of a random sample a smaller knowledge base of phenotype- determined the precision to be associated terms is available. Nevertheless, approximately 82 per cent. Comparison the investigators were able to with a gold standard gave an estimated automatically import thousands of UMLS recall of 61 per cent for exact matches and terms associated with semantic categories 88 per cent for partial matches. such as cellular body functions and The quality of this lexicon is about cellular dysfunction, as well as several equivalent to the performance of the hundred terms from the Mammalian NER systems, and the large size of the Ontology. A few hundred other terms lexicon is a definite advantage. The were added manually. In a feasibility study lexicon could be used with simple or of 300 documents, the system achieved a fuzzy matching to efficiently identify gene precision of 64.0 per cent with a recall of and protein names as a first step in future 77.1 per cent. While, as expected for a text-mining systems. However, the list new area of study, this performance is has been generated with a snapshot of lower than that of the gene and protein MEDLINE, and given the pace of NER systems, these results were found to genomic research, will soon be out of be about the same as that of the individual date if it is not updated. Also, the list was experts used to create the study’s gold built from MEDLINE and not from full standard. text articles, so it is possible that a Overall, the performance of state-of- significant number of gene and protein Overall, the performance of state-of- the-art gene and protein NER systems names exist in the literature and are not the-art gene and achieves F-scores between 75 and 85 per found in MEDLINE. protein NER systems, cent. This number is consistent with that It is a subject of current debate how achieves F-scores found by Hirschman et al. in 2002,12 and well NER must perform in order to be between 75 and 85 the results of Task 1A for the 2004 useful for text mining.9,12 If one assumes percent BioCreative workshop.15 While peak that relationship extraction requires performance does not appear to have identification of three biomedical terms increased over the past few years, (two entities and one relationship), the investigators are obtaining consistent performance of relationship extraction results using a variety of approaches on should be approximately equal to the different data sets. cube of the performance of NER. This To address this performance plateau independence assumption appears to be true and to decrease the computational burden for news article extraction. Systems contribution of NER to text mining, performing named entity extraction on Tanabe and Wilbur have used AbGene to news stories typically perform at an generate a large and high-quality gene and F-score over 90 per cent, and the F-score protein lexicon of names found in for new relations is about 75 per cent. biomedical text.26 The application of For many biomedical applications, the AbGene to the MEDLINE database has F-score performance rates of relationship resulted in an initial collection of over mining has often been found to be two million putative gene and protein approximately equal to that of biological

60 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 57–71. MARCH 2005 Current work in biomedical text mining

NER, rather than the 60 per cent Yeh et al. ran a text-mining The independence expected by the independence competition as part of the Knowledge assumption does not assumption.27–32 Therefore, the Discovery in Databases (KDD) Challenge seem to hold for assumption does not seem to hold for Cup 2002.6 The task was a curation biological relations biological relations. It may be easier to problem to evaluate papers from the extract concepts in combination with the FlyBase data set and determine whether relationship between them owing to the the paper should be curated based on the increased local context that relationships presence of experimental evidence of provide. While some form of NER is Drosophila gene products. The best- useful in most text-mining tasks, the performing entry used a set of manually performance level of biological NER is constructed rules based on POS tagging, a not necessarily rate limiting for other lexicon and semantic constraints biological text-mining tasks. Nevertheless, determined by examining the training we have not reached the point of having documents.33 The system focused on standard methods of NER or updated figure captions, which were found to be lexicons for biomedical text mining useful. An F-score of 78 per cent was (whatever the asymptotic performance achieved on determining whether to level), so work must continue in this area. curate a paper based on the presence of experimental evidence. Another effective Text classification approach looked for manually chosen Text classification attempts to ‘keywords’ and computed the distance automatically determine whether a between keywords and gene names.34 document or part of a document has Two other well-performing systems used particular characteristics of interest, regular expressions to find patterns of usually based on whether the document words and then used a support vector discusses a given topic or contains a machine (SVM) to classify the papers.35 certain type of information. Typically the In related work, Donaldson et al. used information of interest is not specified an SVM trained on the words in explicitly by the users and, instead, they MEDLINE abstracts to distinguish provide a set of documents that have been abstracts containing information on found to contain the characteristics of protein–protein interactions, prior to interest (the positive training set), and curating this information into their BIND another set that does not (the negative database.36 They used the ‘bag-of-words’ training set). Text classification systems approach with an SVM classifier. A small must automatically extract the features evaluation with 100 abstracts found a that help determine positives from precision of 96 per cent with a recall of 84 negatives and apply those features to per cent. They estimated that the candidate documents using some kind of classification system would reduce the decision-making process. number of abstracts that the curators Accurate text Accurate text classification systems can needed to read by about two-thirds. classification systems be especially valuable to database curators, Another investigation in this area used can be especially who may have to review many a Probabilistic Latent Categoriser (PLC) valuable to database curators documents to find a few that contain the with Kullback–Leibler (KL) divergence kind of information they are collecting in to re-rank documents returned by their database. Because more biomedical PubMed searching for the purposes of information is being created in text form curating information into the Swiss-Prot then ever before, and because there are database.37 Evaluation showed a 25–45 more ongoing database curation efforts to per cent precision improvement, with a organise this information into coded balanced precision and recall point of databases than before, there is a strong about 70 per cent, compared with about need to find useful ways to apply text 40 per cent for the basic PubMed ranking. classification methods to biomedical text. Liu et al. performed a unique application

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 57–71. MARCH 2005 6 1 Cohen and Hersh

of text classification on figure captions. In name synonym lists created from online a pilot study, they classified the text in databases as a basis for further text figure legends in order to find figures mining.36,40,41 However, these gene containing representations of protein databases focus on official names and interactions and signalling events.38 alternates, and are incomplete with Applying research in text classification respect to the gene names actually found to the work processes of actual in the literature.42,43 In order to create biomedical curators and annotators is just gene and protein name synonym lists beginning. The Text Retrieval representative of the names used in the Conference (TREC) 2004 Genomics literature, Yu and Agichtein44 and Track has a classification problem as one Cohen45 have investigated automatic of its tasks.39 The task is meant to means of extracting gene name mimic the process that the human synonyms from biomedical free text. Yu annotators in the Mouse Genome and Agichtein applied a combination of Informatics (MGI) system go through in four algorithms to full text journal order to find documents that contain articles. Their system combined the experimental evidence about that AbGene gene NER system, with they are annotating using Gene statistical, SVM classifier-based, Ontology (GO) codes. A full text automatic pattern-based and manual rules collection in SGML format has been algorithms. The combined system assembled, realistically reflecting the produced a recall of about 80 per cent articles that MGI annotators currently with a precision of about 9 per cent, read. In additional, the utility measure giving an overall F-measure of about 30 used to evaluate performance of the task per cent. Cohen applied an automatic aims to reflect the priorities of the MGI pattern extraction method to MEDLINE Because of the potential annotators. Because of the potential to abstracts and a numeric analysis metric to improve annotator productivity, work on improve annotator productivity, work on the resulting name co-occurrence improving biomedical on improving biomedical text network to select the best synonym text classification must classification to meet the needs of extraction patterns. While no continue for the curators and other users must continue sophisticated gene NER was used, foreseeable future for the foreseeable future. evaluation showed a precision of 23 per cent, a recall of 21 per cent and an Synonym and abbreviation F-score of 22 per cent. The system was extraction also notable for inferring synonyms based Paralleling the growth of the increase on the logical relationship between in biomedical literature is the growth in synonyms found explicitly in the text, biomedical terminology. Because many increasing recall by about 10 per cent biomedical entities have multiple names over the same system without inference. and abbreviations, it would be Other investigators have applied text- advantageous to have an automated means mining methods to extracting lists of to collect these synonyms and biomedical abbreviations and their fully abbreviations to aid users doing literature specified forms. These methods rely on searches. Furthermore, other text-mining the proximity of full forms and their tasks could be done more efficiently if all abbreviations, and the fact that either the of the synonyms and abbreviations for an full form or the abbreviation is often entity could be mapped to a single term enclosed in parentheses. The problem is representing the concept. Most of the often reduced to finding the best work in this type of extraction has alignment of the characters in the focused on uncovering gene name abbreviation to those in the full form. A synonyms and biomedical term variety of alignment and scoring abbreviations. methods have been applied to this basic Several investigators have used gene approach. Liu and Friedman used a large

62 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 57–71. MARCH 2005 Current work in biomedical text mining

collection of MEDLINE abstracts to still be a large legacy of literature that uses determine abbreviations and phrases that non-official names. were statistically significantly co- located.32 They reported a precision of Relationship extraction 96.3 per cent with a recall of 88.5 per The goal of relationship extraction is to cent. Yu et al.46 and Schwartz and detect occurrences of a prespecified type Hearst28 applied manually created set of of relationship between a pair of entities pattern-matching rules to identify of given types. While the type of the abbreviations and their full form. Yu et entities is usually very specific (eg genes, al. achieved a precision of 95 per cent or drugs), the type of relationship with 70 per cent recall, while Schwartz may be very general (eg any biochemical and Hearst achieved a precision of 96 association) or very specific (eg a per cent at 82 per cent recall for a set of regulatory relationship). Several 1,000 MEDLINE abstracts mentioning approaches to extracting relations of yeast. Chang et al. trained a logistic interest have been reported in the regression model with abbreviation literature and are applicable to this work. specific features and used it to score Manually generated template-based candidate full forms,47 achieving a methods use patterns (usually in the form precision of 80 per cent with 83 per of regular expressions) generated by cent recall on the Medstract corpus.48 domain experts to extract concepts The automatic extraction of connected by a specific relation from biomedical abbreviations and their text.14 Automatic template methods corresponding definition as used within create similar templates automatically by an individual journal article is close to generalising patterns from text being a solved problem. Research surrounding concept pairs known to have systems uniformly produce high the relationship of interest.44,45 Statistical precision and recall. The next step is to methods identify relationships by looking integrate these automated extraction for concepts that are found with each capabilities into user systems. For other more often than would be predicted example, an online dictionary of medical by chance.7 Finally, NLP-based methods abbreviations could be integrated into perform a substantial amount of sentence PubMed to augment search queries. The parsing to decompose the text into a It is thought that, more general problem of resolving structure from which relationships can be grouping genes by common domain abbreviations readily extracted.31 functional relationships undefined in a given journal article is a In the current genomic era, most could aid gene much more difficult problem dependent investigation of this type has centred expression analysis and on expert knowledge of the specific database annotation around relationships between genes and field and additional, possibly subtle, proteins. It is thought that grouping genes context from the surrounding text. by functional relationships could aid gene Gene and protein name synonym expression analysis and database extraction has proven to be a more annotation.49 Several researchers have challenging problem. While an investigated the extraction of general automatically updated synonym list would relationships between genes. be of great value in augmenting literature Genes can be grouped or clustered searching and text mining, the precision based on how strongly they share words of automatic extraction systems is low in text containing their names. enough to introduce an unacceptable Raychaudhuri et al. used a measure of level of noise. However, work is being neighbour divergence to measure the undertaken to standardise the use of ‘functional coherence’ of a group of official gene and protein names and genes.49 They obtained 79 per cent symbols,43 so this problem may lessen in sensitivity at 100 per cent specificity for the future. On the other hand, there will distinguishing 19 true gene groups from

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 57–71. MARCH 2005 6 3 Cohen and Hersh

1,900 randomly assembled groups of yeast MeKE system of Chiang and Yu used GO genes. They later extended their work to codes as a lexicon of function names, include mouse, fly, worm and yeast genes combining it with a lexicon of gene and and obtained functional gene groups with gene product names from LocusLink, and 96, 92, 82 and 45 per cent sensitivity at used a sentence alignment system to 99.9 per cent specificity.50 Glenisson et al. determine patterns associated with similarly investigated text-based gene statements about gene function. They clustering using a vector space approach then used the patterns with a Naive Bayes and the k-medoids algorithm with a classifier to extract sentences containing cosine similarity metric.51 Wren and information about gene product Garner identified related genes by function.52 analysing the cohesiveness and specificity Raychaudhuri et al. assigned GO codes of the graph structure created by the by training text classifiers to associate GO gene–gene co-occurrences in MEDLINE codes with abstracts, and then assigning to records.27 They obtained similar results to a gene the strongest maximum entropy Raychaudhuri et al. of about 97 per cent associated GO code from the abstracts in Currently, the precision specificity at 85 per cent sensitivity. which that gene appeared. Evaluation and recall obtained for Other research has concentrated using a subset of yeast genes and GO codes relationship extraction extracting specific kinds of relationships showed that the strongest predicted GO is dependent on the between genes, protein, or other code was accurate about 72 per cent of the type of relationship to 53 be extracted and biological entities. Gaizauskas et al.’s time. Pan et al.’s Dragon TF association literature corpus to be Protein Active Site Template Acquisition miner system used linear discriminate processed system (PASTA) uses type and POS analysis on terms and neural networks to tagging along with manually created create models that recognised abstracts templates and lexicons assembled from that contained information relating biological databases to extract transcription factors (TFs) with GO codes relationships between amino acid residues and diseases. Balanced sensitivity and and their function within a protein.30 specificity was about 80 per cent.54 Balanced recall and precision was Task 2 of the BioCreative 2004 approximately 82 per cent using a workshop similarly focused on extracting manually annotated corpus of MEDLINE relevant GO codes for genes from free abstracts as a gold standard. Albert et al. text.15,55 The task was to identify text that used dictionaries of protein and contained evidence for GO code interaction terms to identify tri- assignment and to predict the correct GO occurrences of two proteins and one code that should be assigned by that text. interaction within a sentence.41 Applying The task was also notable in that full text this approach to the full MEDLINE journal articles were used and the database looking for interactions between evaluation was carried out by MGI proteins and nuclear receptors they found curators, who rated how useful the system 3,308 positive interactions, giving a output was for annotation. In general, this precision of 22 per cent. McDonald et al. was a difficult task, and was rated very combined a hybrid syntactic/semantic hard by the annotators. System precision grammar in a single parsing process to ranged between 2 and 80 per cent, with extract a variety of gene pathway the average about 30 per cent. Recall was relationships.29 Evaluation using 100 not evaluated. Part of the difficulty was abstracts manually reviewed by a biologist that the systems had to get the text, the showed 61 per cent precision at 35 per gene and the GO code simultaneously all cent recall. correct, perhaps an unnecessarily high Extracting relationships between genes standard. or proteins and GO codes is a task with A number of other investigators have immediate practical potential that has applied text mining to extract novel, received much attention lately. The interesting relationships. Eskin and

64 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 57–71. MARCH 2005 Current work in biomedical text mining

Agichtein combined text and sequence inference. He proposed a simple ‘A mining with an SVM combined text and influences B, and B influences C, genome sequence kernel to predict therefore A may influence C’ model for protein subcellular localisation.56 detecting instances of CSD that is Performance ranged from a precision of commonly referred to as Swanson’s ABC 87 per cent with 71 per cent recall for model.3,61 In several published papers in proteins located in the cytoplasm to a the 1980s and early 1990s, Swanson gave precision of 44 per cent with 21 per cent examples of discovering new hypotheses recall for proteins located in the by manually connecting concepts peroxisomes. Srinivasan and Wedemeyer between journal articles. In 1986, he have studied the relationship between a found a connection implying patient disease’s incidence and the countries in benefit between fish oil and Raynaud’s which it is studied.57 Kostoff used simple syndrome, two years before clinical trials Extraction of very MEDLINE querying to compute organ established that the benefit was real.8,62 In general, non-specific cancer asymmetries and found very similar another article, he traced 11 indirect relationships appears to be straightforward results to numbers in the National Cancer connections between migraine and Institutes SEER database.58 Xu et al. magnesium using summarisations of adapted MEDLEE to transform text into published articles60 that were later coded data from pathology reports to experimentally verified.63,64 facilitate a breast cancer study.59 While Swanson applied his model It is clear from the foregoing work that manually, several investigators have tried some types of relationships are simpler to to automate the process. Automated extract than others. Very general, non- hypothesis generation systems may specific relationships (eg gene groups) generate many potential hypotheses, and seem to be fairly straightforward, while therefore some method of evaluating very specific relationships that have to be these systems is necessary. One way these substantiated by the precise supporting evaluations have been done is by text (eg GO code assignment) remain attempting to recreate Swanson’s challenging. Since the value of identifying discoveries. Gordon and Lindsay were very specific relationships with probably the first to use this approach,65 accompanying supporting text is high, this followed a few years later by Weeber et work must receive continued attention. al.3 More recently, Srinivasan has used this approach to demonstrate the Hypothesis generation feasibility of her approach based on While relationship extraction focuses on MeSH terms and UMLS semantic types.66 the extraction of relationships between Another way that hypothesis discovery entities explicitly found in the text, systems are evaluated is by manually hypothesis generation attempts to reviewing the literature supporting the uncover relationships that are not present extracted hypothesis for scientific in the text but instead are inferred by the plausibility and relevance. This is a natural presence of other more explicit next step after reproducing Swanson’s relationships. The goal is to uncover discoveries. Using their ‘literature-based previously unrecognised relationships scientific discovery tool’ that examined worthy of further investigation. term co-occurrences in MEDLINE titles Current work in Practically all of the work in hypothesis and abstracts, Weeber et al. found hypothesis generation generation makes use of an idea potential new uses for thalidomide.61 makes use of originated by Swanson in the 1980s called Srinivasin et al. continued to refine their ‘complementary system, discovering implicit evidence of a structures in disjoint the ‘complementary structures in disjoint 60 literatures’ literatures’ (CSD). Swanson realised that therapeutic effect of Curcuma longa large databases of scientific literature (turmeric), on retinal diseases, Crohn’s could allow discoveries to be made by disease and spinal cord injuries.67,68 connecting concepts using logical The combined increase in scientific

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 57–71. MARCH 2005 6 5 Cohen and Hersh

literature and genome expression data All these systems are still in the research may leave many scientists with the and development phase. At this point uneasy feeling that important discoveries evaluations tend to be brief and these are buried under the information systems have not been subjected to explosion, such that computerised tools thorough user evaluations. It remains to to help them sort through the vast be seen whether these systems will address Hypothesis discovery amount of available information.12,68 the needs of the biomedical research systems are not yet a While hypothesis discovery systems are community, but it is clear that they are a standard tool of not yet a standard tool of biologists, one step towards addressing the needs of biologists day they may be. Continued work is biomedical researchers beyond those needed to enhance these systems to served by search engines such as PubMed handle the vast amounts of different types and Google. of data that scientists currently must explore manually. Additionally, better CHALLENGES, FUTURE methods are needed to evaluate and DIRECTIONS AND compare the results of these systems so CONCLUSIONS that improvement can be documented From all of the foregoing, it is clear that and clear choices can be made. biomedical text mining has great potential. However, that potential is yet Integration frameworks unrealised. Text-mining tools are not part Several research groups are developing of the standard arsenal of the biomedical integrated text-mining frameworks researcher in the way that search engines intended to be able to address a variety of and sequence alignment tools are. The user needs. The MedScan system of major challenge for the next 5–10 years Novichkova et al. combines lexicons of text-mining work is the creation of with syntactic and semantic templates text-mining tools to provide a clear into a general-purpose text-mining benefit to these researchers, allowing system to extract relationships between them to be more productive given biomedical entities.69 Glenisson et al. increasing challenges due to information have developed TXTGate which growth. The focus must be more on performs gene-based text profiling and helping biomedical researchers to solve clustering using the information real-world problems that are inhibiting contained on multiple the pace of research and less on on-line biological databases.70 Becker et evaluations based on system output al. created PubMatrix, a tool that displays independent of meeting user needs. two-dimensional comparisons of gene Advances on several fronts are necessary Greater access to full names and functional terms based on for this to become a reality. text collections is combining the results of multiple queries First, there must be greater access to essential to progress in to PubMed.71 The BioRAT system of full text and test collections that use it. biomedical text-mining Corney et al. is another template-based Much of the scientific information research system that combines a template design contained in journal articles is not present tool with a web spider that locates and or mentioned in abstracts or MeSH retrieves full text journal articles.72 The terms.14 Text-mining research has Textpresso system of Mu¨ller et al. uses a recently been moving away from abstracts specially created ontology to flexibly and titles towards full text, but access to combine keyword and concept co- full text is still often limited by copyright occurrence searching of Caenorhabditis restrictions,39,55 preventing work from elegans (a small nematode worm) being done or reproduced. The research literature at the sentence level.73 Other community must work with publishers to generalised text-mining frameworks have make a wider range of content available been reported by Nenadic et al.74 and to text-mining applications. Chiang et al.40 Next, more work is necessary to

66 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 57–71. MARCH 2005 Current work in biomedical text mining

determine what features and types of mining systems. This is an especially features are useful in addressing particular challenging problem for hypothesis text-mining tasks. The feature space generation systems, eg how does one available to text mining is large, and measure the value of an untested set of includes a huge array of feature types hypotheses? Nevertheless, robust including (but not limited to) words, evaluation measures are necessary to concepts, headings, formatting, authors, compare hypothesis generation systems, references and links. The bag-of-words determine the best approaches and approach (both stemmed and unstemmed) improve the state of the art of these has been popular for quite a long time, systems. Verifying that these systems largely because it is easily applied to text produce suggestions that biomedical from a large variety of sources. However, researchers are motivated to this approach ignores document positional experimentally test is especially important. and sectional information and may not Finally, the approach of shared result in the most discriminating feature challenge tasks with consistent evaluation set from fully marked-up text that based on biomedical domain expertise provides this information. Using concepts must continue. More progress must be as a basic unit instead of words has been made toward choosing tasks and shown to be practical75 and useful.76,77 evaluating results based on real-world Including other contexts, such as the needs. Recent examples of this type of section in which the text occurs, has also cooperation include the BioCreative 2004 been shown to help.44,78 Full text with workshop, and the TREC Genomics XML mark-up may provide many more Track, both of which used assessments Better ways are needed possible features and feature types than made by biological database curators in to assess the real-world plain text. The potential feature space of their normal workflow processes as the value of text-mining XML full text has only begun to be gold standard. systems to their intended users explored. With such a wide range of Clearly, the main theme for future possible features and feature types, progress is interdisciplinary coordination additional analytical methods are needed and cooperation. Text-mining researchers to determine the optimal feature set for a must work with each other, publishers particular application. and biomedical researchers to begin to Text-mining researchers also need to meet user needs with systems that produce better understand what measures can be consistent, measurable and verifiable used to assess value to actual users, and results. This is an exciting time in how to tailor their algorithms to meet biomedical text mining, full of promise. their needs. It is well known in Researchers must lead the coordination information retrieval that increases in effort to realise the full scientific potential precision or recall do not necessarily of biomedical text mining. correlate with user success in the searching task.79 As such, simply Annotated bibliography optimising these metrics may not result in * Papers of particular interest published within systems that meet users’ needs. The triage the period of this review. task of the TREC 2004 Genomics Track ** Papers of extreme interest published within the begins to address this with an evaluation period of this review. measure based on estimating the utility of 1. *Yeh, A. S., Hirschman, L. and Morgan, A. A. MGI’s current triage process.39 Other (2003), ‘Evaluation of text data mining for researchers have addressed this as well.36 database curation: Lessons learned from the KDD Challenge Cup’, Bioinformatics, Vol. 19 In order to design systems that deliver Suppl. 1, pp. i331–339. value to biomedical researchers, much 2. **Tanabe, L. and Wilbur, W. J. (2002), more work is needed in the area of ‘Tagging gene and protein names in biomedical creating evaluation metrics and methods text’, Bioinformatics, Vol. 18(8), pp. 1124–1132. that measure the real-world value of text- 3. *Chang, J. T., Schutze, H. and Altman, R. B.

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 57–71. MARCH 2005 6 7 Cohen and Hersh

(2004), ‘GAPSCORE: Finding gene and 16. *Chiang, J. H. and Yu, H. C. (2003), ‘MeKE: protein names one word at a time’, Discovering the functions of gene products Bioinformatics, Vol. 20(2), pp. 216–225. from biomedical literature via sentence alignment’, Bioinformatics, Vol. 19(11), pp. 4. **Hirschman, L., Morgan, A. A. and Yeh, A. 1417–1422. S. (2002), ‘Rutabaga by any other name: Extracting biological names’, J. Biomed. 17. **Eskin, E. and Agichtein, E. (2004), Inform., Vol. 35(4), pp. 247–259. ‘Combining text mining and sequence analysis to discover protein functional regions’, in 5. **Tanabe, L. and Wilbur, W. J. (2004), ‘Proceedings of the 9th Pacific Symposium on ‘Generation of a large gene/protein lexicon by Biocomputing’, 6th–10th January, Hawaii, pp. morphological pattern analysis’, J. Bioinform. 288–299. Comput. Biol., Vol. 1(4), pp. 611–626. 18. *Swanson, D. R. (1991), ‘Complementary 6. *Schwartz, A. S. and Hearst, M. A. (2003), ‘A structures in disjoint science literatures’, in simple algorithm for identifying abbreviation ‘Proceedings of the 14th Annual International definitions in biomedical text’, in ‘Proceedings ACM SIGIR Conference on Research and of the 8th Pacific Symposium on Development in Information Retrieval’, ACM Biocomputing’, 3rd–7th January, Hawaii, pp. Press, Chicago, IL, pp. 280–289. 451–462. 19. *Weeber, M., Vos, R., Klein, H. et al. (2003), 7. *Liu, H. and Friedman, C. (2003), ‘Mining ,Generating hypotheses by discovering implicit terminological knowledge in large biomedical associations in the literature: A case report of a corpora’, in ‘Proceedings of the 8th Pacific search for new potential therapeutic uses for Symposium on Biocomputing’, 3rd–7th thalidomide’, J. Amer. Med. Inform. Assoc., Vol. January, Hawaii, pp. 415–426. 10(3), pp. 252–259. 8. *Gaizauskas, R., Demetriou, G., Artymiuk, P. J. and Willett, P. (2003), ‘Protein structures 20. *Srinivasan, P. (2004), ‘Text mining: and information extraction from biological Generating hypothesis from MEDLINE’, texts: The PASTA system’, Bioinformatics, Vol. J. Amer. Soc Inf. Sci. Technol., Vol. 55, pp. 396–413. 19(1), pp. 135–143. 9. *Friedman, C., Kra, P., Yu, H. et al. (2001), 21. *Srinivasan, P., Libbus, B. and Sehgal, A. K. ‘GENIES: A natural-language processing (2004), ‘Mining MEDLINE: Postulating a system for the extraction of molecular beneficial role for curcumin longa in retinal pathways from journal articles’, Bioinformatics, diseases’, in ‘BioLINK 2004: Linking Vol. 17, Suppl. 1, pp. S74–82. Biological Literature, Ontologies, and Databases’, Boston, MA, Association for 10. *Donaldson, I., Martin, J., de Bruijn, B. et al. Computational Linguistics, p. 33–40 (URL: (2003), ‘PreBIND and Textomy – mining the http://www.cs.brandeis.edu/~jamesp/ biomedical literature for protein–protein biolink2004/). interactions using a support vector machine’, BMC Bioinformatics, Vol. 4(1), p. 11. 22. *Corney, D. P., Buxton, B. F., Langdon, W. B. and Jones, D. T.(2004), ‘BioRAT: 11. *Dobrokhotov, P. B., Goutte, C., Veuthey, A. Extracting biological information from full- and Gaussier, E. (2003), ‘Combining NLP and length papers’, Bioinformatics (in press). probabilistic categorisation of document and term selection for Swiss-Prot medical 23. *Aronson, A. R. (2001), ‘Effective mapping of annotation’, Bioinformatics, Vol. 19, Suppl. 1, biomedical text to the UMLS Metathesaurus: pp. i91–94. the MetaMap program’, in ‘Proceedings of the AMIA Symposium’, 3rd–7th November, 12. *Oregon Health & Science University (2004), Washington, DC, pp. 17–21. ‘TREC Genomics Track Protocol, 2004’ (URL: http://medir.ohsu.edu/genomics/ 24. *Mu¨ller, H., Kenny, E. E. and Sternberg, P. 2004protocol.html, accessed 27th August, W. (2004), ‘Textpresso: An ontology-based 2004). information retrieval and extraction system for biological literature’, PLoS Biol., Vol. 2(11). 13. *The Human Genome Organisation (2003), ‘HUGO Gene Nomenclature Committee, References 2003’ (URL: http://www.gene.ucl.ac.uk/ nomenclature/, accessed 29th September, 1. Hersh, W. R. (2003), ‘Information Retrieval: 2003). A Health and Biomedical Perspective’, 2nd edn, Springer, New York. 14. *Yu, H. and Agichtein, E. (2003), ‘Extracting synonymous gene and protein terms from 2. Mitchell, J. A., Aronson, A. R., Mork, J. G. biological literature’, Bioinformatics, Vol. 19, et al. (2003), ‘Gene indexing: Characterization Suppl. 1, pp. i340–349. and analysis of NLM’s GeneRIFs’, in ‘Proceedings of the AMIA Symposium’, 15. *Chang, J. T., Schutze, H. and Altman, R. B. 8th–12th November, Washington, DC, pp. (2002), ‘Creating an online dictionary of 460–464. abbreviations from MEDLINE’, J. Amer. Med. Inform. Assoc., Vol. 9(6), pp. 612–620. 3. Weeber, M., Klein, H., Aronson, A. R. et al.

68 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 57–71. MARCH 2005 Current work in biomedical text mining

(2000), ‘Text-based discovery in : without natural language processing’, in The architecture of the DAD-system’, in ‘Computational Linguistics and Intelligent ‘Proc. AMIA Symposium’, 4th–8th Text Processing, Proceedings’, Lecture Notes November, Los Angeles, CA, pp. 903–907. in Computer Science, Vol. 2588, Gelbukh, A. F., Ed, Springer, pp. 360–369. 4. Kostoff, R. N., Block, J. A., Stump, J. A. and Pfeil, K. M. (2004), ‘Information content in 17. Brill, E. and Mooney, R. J. (1997), ‘An Medline record fields’, Int. J. Med. Inf., Vol. overview of empirical natural language 73(6), pp. 515–527. processing’, AI Magazine, Vol. 18(4), pp. 13–24. 5. Hersh, W., Bhupatiraju, R. T., Oregon 18. Chang, J. T., Schutze, H. and Altman, R. B. Health & Science University (2004), ‘TREC (2004), ‘GAPSCORE: Finding gene and Genomics Track Overview, 2003’ (URL: protein names one word at a time’, http://medir.ohsu.edu/genomics/ Bioinformatics, Vol. 20(2), pp. 216–225. overview.pdf, accessed 13th September, 2004). 19. Franzen, K., Eriksson, G., Olsson, F. et al. 6. Yeh, A. S., Hirschman, L. and Morgan, A. A. (2002), ‘Protein names and how to find them’, (2003), ‘Evaluation of text data mining for Int. J. Med. Inf., Vol. 67(1–3), pp. 49–61. database curation: Lessons learned from the KDD Challenge Cup’, Bioinformatics, Vol. 19, 20. Zhou, G., Zhang, J., Su, J. et al. (2004), Suppl. 1, pp. i331–339. ‘Recognizing names in biomedical texts: A approach’, Bioinformatics, 7. Lindsay, R. K. and Gordon, M. D. (1999), Vol. 20(7), pp. 1178–1190. ‘Literature-based discovery by lexical statistics’, J. Amer. Soc. Information Sci., Vol. 50(7), pp. 21. Kim, J. D., Ohta, T., Tateisi, Y. and Tsujii, J. 574–587. (2003), ‘GENIA corpus – a semantically annotated corpus for bio-textmining’, 8. Swanson, D. R. (1990), ‘Medical literature as a Bioinformatics, Vol. 19, Suppl. 1, pp. i180–182. potential source of new knowledge’, Bull. Med. Libr. Assoc., Vol. 78(1), pp. 29–37. 22. Narayanaswamy, M., Ravikumar, K. E. and Vijay-Shanker, K. (2003), ‘A biological named 9. de Bruijn, B. and Martin, J. (2002), ‘Getting to entity recognizer’, in ‘Proceedings of the 8th the (c)ore of knowledge: Mining biomedical Pacific Symposium on Biocomputing’, 3rd– literature’, Int. J. Med. Inf., Vol. 67(1–3), pp. 7th January, Hawaii, pp. 427–38. 7–18. 23. Settles, B. (2004), ‘Biomedical named entity 10. Hanisch, D., Fluck, J., Mevissen, H. T. and recognition using conditional random fields Zimmer, R. (2003), ‘Playing biology’s name and rich feature sets’, in ‘Proceedings of the game: Identifying protein names in scientific International Joint Workshop on Natural text’, in ‘Proceedings of the 8th Pacific Language Processing in Biomedicine and its Symposium on Biocomputing’, 3rd–7th Applications (NLPBA)’, Geneva, Switzerland. January, Hawaii, pp. 403–414. 24. Mika, S. and Rost, B. (2004), ‘Protein names 11. Brill, E. (1995), ‘Transformation-based error- precisely peeled off free text’, Bioinformatics, driven learning and natural language Vol. 20, Suppl. 1, pp. I241–247. processing: A case study in part-of-speech tagging’, Comput. Linguistics, Vol. 21(4), pp. 25. Chen, L. and Friedman, C. (2004), ‘Extracting 543–565. phenotypic information from the literature via natural language processing’, in ‘Proceedings of 12. Hirschman, L., Morgan, A. A. and Yeh, A. S. the 11th World Congress on Medical (2002), ‘Rutabaga by any other name: Informatics’, IMIA, San Francisco, CA, pp. Extracting biological names’, J. Biomed. 758–762. Inform., Vol. 35(4), pp. 247–259. 26. Tanabe, L. and Wilbur, W. J. (2004), 13. Tanabe, L. and Wilbur, W. J. (2002), ‘Tagging ‘Generation of a large gene/protein lexicon by gene and protein names in biomedical text’, morphological pattern analysis’, J. Bioinform. Bioinformatics, Vol. 18(8), pp. 1124–1132. Comput. Biol., Vol. 1(4), pp. 611–626. 14. Yu, H., Hatzivassiloglou, V., Friedman, C. et 27. Wren, J. D. and Garner, H. R. (2004), ‘Shared al. (2002), ‘Automatic extraction of gene and relationship analysis: Ranking set cohesion and protein synonyms from MEDLINE and commonalities within a literature-derived journal articles’, in ‘Proceedings of the AMIA relationship network’, Bioinformatics, Vol. Symposium’, 9th–13th November, San 20(2), pp. 191–198. Antonio, TX, pp. 919–923. 28. Schwartz, A. S. and Hearst, M. A. (2003), ‘A 15. Blaschke, C. (2004), ‘Biocreative: Critical simple algorithm for identifying abbreviation Assessment for Information Extraction in definitions in biomedical text’, in ‘Proceedings Biology’, Granada, Spain (URL: http:// of the 8th Pacific Symposium on www.pdg.cnb.uam.es/Biolink/ Biocomputing’, 3rd–7th January, Hawaii, pp. Workshop_BioCreative_04/handout/ 451–462. index.html). 29. McDonald, D. M., Chen, H., Su, H. and 16. Brill, E. (2003), ‘Processing natural language Marshall, B. B. (2004), ‘Extracting gene

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 57–71. MARCH 2005 6 9 Cohen and Hersh

pathway relations using a hybrid grammar: The receptors’, Mol. Endocrinol., Vol. 17(8), pp. Arizona relation parser’, Bioinformatics (in 1555–1567. press). 42. Editorial (2003), ‘HUGO – a UN for the 30. Gaizauskas, R., Demetriou, G., Artymiuk, P. human genome’, Nat. Genet., Vol. 34(2), pp. J. and Willett, P. (2003), ‘Protein structures 115–116. and information extraction from biological 43. The Human Genome Organisation, HUGO texts: The PASTA system’, Bioinformatics, Vol. Gene Nomenclature Committee, 2003 (URL: 19(1), pp. 135–143. http://www.gene.ucl.ac.uk/nomenclature/, 31. Friedman, C., Kra, P., Yu, H. et al. (2001), accessed 29th September, 2003). ‘GENIES: A natural-language processing 44. Yu, H. and Agichtein, E. (2003), ‘Extracting system for the extraction of molecular synonymous gene and protein terms from pathways from journal articles’, Bioinformatics, biological literature’, Bioinformatics, Vol. 19, Vol. 17, Suppl. 1, pp. S74–82. Suppl. 1, pp. i340–349. 32. Liu, H. and Friedman, C. (2003), ‘Mining 45. Cohen, A. M. (2004), ‘Using symbolic terminological knowledge in large biomedical network logical analysis as a knowledge corpora’, in ‘Proceedings of the 8th Pacific extraction method on MEDLINE abstracts’, Symposium on Biocomputing’, 3rd–7th BMC Bioinformatics 2005 (in press). January, Hawaii, pp. 415–426. 46. Yu, H., Hripcsak, G. and Friedman, C. (2002), 33. Regev, Y., Finkelstein-Landau, M. and ‘Mapping abbreviations to full forms in Feldman, R. (2002), ‘Rule-based extraction of biomedical articles’, J. Amer. Med. Inform. experimental evidence in the biomedical Assoc., Vol. 9(3), pp. 262–272. domain: The KDD Cup 2002 (task 1)’, ACM SIGKDD Explorations Newsletter, Vol. 4(2), pp. 47. Chang, J. T., Schutze, H. and Altman, R. B. 90–92. (2002), ‘Creating an online dictionary of abbreviations from MEDLINE’, J. Amer. Med. 34. Shi, M., Edwin, D. S., Menon, R. et al. Inform. Assoc., Vol. 9(6), pp. 612–620. (2002), ‘A machine learning approach for the curation of biomedical literature-KDD Cup 48. Brandeis University. Medstract Project – 2002 (task 1)’, ACM SIGKDD Explorations Initial Annotion Corpora (2001) (URL: Newsletter, Vol. 4(2), pp. 93–94. http://scylla.cs.brandeis.edu/gold- standards.html, accessed 24th June, 2003). 35. Ghanem, M. M., Guo, Y., Lodhi, H. and Zhang, Y. (2003), ‘Automatic scientific text 49. Raychaudhuri, S., Schutze, H. and Altman, R. classification using local patterns: KDD Cup B. (2002), ‘Using text analysis to identify 2002 (task 1)’, ACM SIGKDD Explorations functionally coherent gene groups’, Genome Newsletter, Vol. 4(2), pp. 95–96. Res., Vol. 12(10), pp. 1582–1590. 36. Donaldson, I., Martin, J., de Bruijn, B. et al. 50. Raychaudhuri, S. and Altman, R. B. (2003), (2003), ‘PreBIND and Textomy – mining the ‘A literature-based method for assessing the biomedical literature for protein–protein functional coherence of a gene group’, interactions using a support vector machine’, Bioinformatics, Vol. 19(3), pp. 396–401. BMC Bioinformatics, Vol. 4(1), p. 11. 51. Glenisson, P., Antal, P., Mathys, J. et al. 37. Dobrokhotov, P. B., Goutte, C., Veuthey, A. (2003), ‘Evaluation of the vector space and Gaussier, E. (2003), ‘Combining NLP and representation in text-based gene clustering’, probabilistic categorisation of document and in ‘Proceedings of the 8th Pacific Symposium term selection for Swiss-Prot medical on Biocomputing’, 3rd–7th January, Hawaii, annotation’, Bioinformatics, Vol. 19, Suppl. 1, pp. 391–402. pp. i91–94. 52. Chiang, J. H. and Yu, H. C. (2003), ‘MeKE: 38. Liu, F., Jenssen, T. K., Nygaard, V. et al. Discovering the functions of gene products (2004), ‘FigSearch: A figure legend indexing from biomedical literature via sentence and classification system’, Bioinformatics, Vol. alignment’, Bioinformatics, Vol. 19(11), pp. 20, pp. 2880–2882. 1417–1422. 39. Hersh, W.R. and Oregon Health & Science 53. Raychaudhuri, S., Chang, J. T., Sutphin, P. D. University (2004), ‘TREC Genomics Track and Altman, R. B. (2002), ‘Associating genes Protocol’ (URL: http://medir.ohsu.edu/ with codes using a maximum genomics/2004protocol.html, accessed 27th entropy analysis of biomedical literature’, August, 2004). Genome Res., Vol. 12(1), pp. 203–214. 40. Chiang, J. H., Yu, H. C. and Hsu, H. J. 54. Pan, H., Zuo, L., Choudhary, V. et al. (2004), (2004), ‘GIS: A biomedical text-mining system ‘Dragon TF Association Miner: A system for for gene information discovery’, Bioinformatics, exploring transcription factor associations Vol. 20(1), pp. 120–121. through text-mining’, Nucleic Acids Res., Vol. 32 (Web Server issue), pp. W230–234. 41. Albert, S., Gaudan, S., Knigge, H. et al. (2003), ‘Computer-assisted generation of a 55. Krallinger, M. (2004), ‘BioCreAtIvE – critical protein-interaction database for nuclear assessment of information extraction systems in

70 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 57–71. MARCH 2005 Current work in biomedical text mining

biology’ (URL: http://www.pdg.cnb.uam.es/ substances and diseases’, Bioinformatics, Vol. 20, BioLINK/BioCreative.eval.html, accessed 1st Suppl. 1, pp. I290–296. September, 2004). 68. Srinivasan, P., Libbus, B. and Sehgal, A. K. 56. Eskin, E. and Agichtein, E. (2004), (2004), ‘Mining MEDLINE: Postulating a ‘Combining text mining and sequence analysis beneficial role for curcumin longa in retinal to discover protein functional regions’, in diseases’, in ‘BioLINK 2004: Linking ‘Proceedings of the 9th Pacific Symposium on Biological Literature, Ontologies, and Biocomputing’, 6th–10th January, Hawaii, pp. Databases, Boston, MA, Association for 288–99. Computational Linguistics, pp. 33–40 (URL: http:/www.cs.brandeis.edu/~jamesp/ 57. Srinivasan, P. and Wedemeyer, M. (2003), biolink2004/). ‘Mining concept profiles with the vector model or where on earth are diseases being 69. Novichkova, S., Egorov, S. and Daraselia, N. studied?’, in ‘Proceedings of the Text Mining (2003), ‘MedScan, a natural language Workshop. Third SIAM International processing engine for MEDLINE abstracts’, Conference on Data Mining’, San Francisco. Bioinformatics, Vol. 19(13), pp. 1699–1706. 58. Kostoff, R. N. (2003), ‘Bilateral asymmetry 70. Glenisson, P., Coessens, B., Van Vooren, S. prediction’, Med. Hypotheses, Vol. 61(2), pp. et al. (2004), ‘TXTGate: Profiling gene groups 265–266. with text-based information’, Genome Biol., Vol. 5(6), p. R43. 59. Xu, H., Anderson, K., Grann, V. and Friedman, C. (2004), ‘Facilitating cancer 71. Becker, K. G., Hosack, D. A., Dennis, G. Jr research using natural language processing of et al. (2003), ‘PubMatrix: A tool for multiplex pathology reports’, in ‘Proceedings of the 11th literature mining’, BMC Bioinformatics, Vol. World Congress on Medical Informatics’, 4(1), p. 61. IMIA, San Francisco, CA, pp. 565–569. 72. Corney, D. P., Buxton, B. F., Langdon, W. B. 60. Swanson, D.R. (1991), ‘Complementary and Jones, D. T. (2004), ‘BioRAT: Extracting structures in disjoint science literatures’, in biological information from full-length papers’, ‘Proceedings of the 14th Annual International Bioinformatics (in press). ACM SIGIR Conference on Research and 73. Mu¨ller, H., Kenny, E. E. and Sternberg, P. W. Development in Information Retrieval’, ACM (2004), ‘Textpresso: An ontology-based Press, Chicago, IL, pp. 280–289. information retrieval and extraction system for biological literature. PLoS Biol., Vol. 2(11). 61. Weeber, M., Vos, R., Klein, H. et al. (2003), ‘Generating hypotheses by discovering implicit 74. Nenadic, G., Spasic, I. and Ananiadou, S. associations in the literature: A case report of a (2003), ‘Terminology-driven mining of search for new potential therapeutic uses for biomedical literature’, Bioinformatics, Vol. thalidomide’, J. Amer. Med. Inform. Assoc., Vol. 19(8), pp. 938–943. 10(3), pp. 252–259. 75. Aronson, A. R. (2001), ‘Effective mapping of 62. Swanson, D. R. (1986), ‘Fish oil, Raynaud’s biomedical text to the UMLS Metathesaurus: syndrome, and undiscovered public The MetaMap program’, in ‘Proc. AMIA knowledge’, Perspect. Biol. Med., Vol. 30(1), Symp.’, 3rd–7th November, Washington, DC, pp. 7–18. pp. 17–21. 63. Ramadan, N. M., Halvorson, H., 76. Rindflesch, T. C., Hunter, L. and Aronson, Vande-Linde, A. et al. (1989), ‘Low brain A. R. (1999), ‘Mining molecular binding magnesium in migraine’, Headache, Vol. 29(9), terminology from biomedical text’, in ‘Proc. pp. 590–593. AMIA Symp.’, 6th–10th November, Washington, DC, pp. 127–131. 64. Ferrari, M. D. (1992), ‘Biochemistry of migraine’, Pathol. Biol. (Paris), Vol. 40(4), pp. 77. Majoros, W. H., Subramanian, G. M. and 287–292. Yandell, M. D. (2003), ‘Identification of key concepts in biomedical literature using a 65. Gordon, M. D. and Lindsay, R. K. (1996), modified Markov heuristic’, Bioinformatics, Vol. ‘Toward discovery support systems: A 19(3), pp. 402–407. replication, re-examination, and extension of Swanson’s work on literature-based discovery 78. Regev, Y., Finkelstein-Landau, M. and of a connection between Raynaud’s and fish Feldman, R. (2003), ‘Rule-based extraction of oil’, J. Amer. Soc. Inf. Sci., Vol. 47(2), pp. experimental evidence in the biomedical 116–128. domain: The KDD Cup 2002 (task 1)’, ACM SIGKDD Explorations Newsletter, Vol. 4(2), pp. 66. Srinivasan, P. (2004), ‘Text mining: 90–92. Generating hypothesis from MEDLINE’, J. Amer. Soc. Inf. Sci. Technol., Vol. 55, pp. 79. Hersh, W. R., Crabtree, M. K., Hickam, D. 396–413. H. et al. (2002), ‘Factors associated with success in searching MEDLINE and applying evidence 67. Srinivasan, P. and Libbus, B. (2004), ‘Mining to answer clinical questions’, J. Amer. Med. MEDLINE for implicit links between dietary Inform. Assoc., Vol. 9(3), pp. 283–293.

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 57–71. MARCH 2005 7 1