View of Functional Genomics [12], the Generation Challenge Pro- the Proposed Methodology

BMC Bioinformatics BioMed Central Methodology article Open Access BIRI: a new approach for automatically discovering and indexing available public bioinformatics resources from the literature Guillermo de la Calle*†, Miguel García-Remesal†, Stefano Chiesa†, Diana de la Iglesia† and Victor Maojo† Address: Dept Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Campus de Montegancedo S/N, 28660 Boadilla del Monte, Madrid, Spain Email: Guillermo de la Calle* - [email protected]; Miguel García-Remesal - [email protected]; Stefano Chiesa - [email protected]; Diana de la Iglesia - [email protected]; Victor Maojo - [email protected] * Corresponding author †Equal contributors Published: 7 October 2009 Received: 12 February 2009 Accepted: 7 October 2009 BMC Bioinformatics 2009, 10:320 doi:10.1186/1471-2105-10-320 This article is available from: http://www.biomedcentral.com/1471-2105/10/320 © 2009 de la Calle et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: The rapid evolution of Internet technologies and the collaborative approaches that dominate the field have stimulated the development of numerous bioinformatics resources. To address this new framework, several initiatives have tried to organize these services and resources. In this paper, we present the BioInformatics Resource Inventory (BIRI), a new approach for automatically discovering and indexing available public bioinformatics resources using information extracted from the scientific literature. The index generated can be automatically updated by adding additional manuscripts describing new resources. We have developed web services and applications to test and validate our approach. It has not been designed to replace current indexes but to extend their capabilities with richer functionalities. Results: We developed a web service to provide a set of high-level query primitives to access the index. The web service can be used by third-party web services or web-based applications. To test the web service, we created a pilot web application to access a preliminary knowledge base of resources. We tested our tool using an initial set of 400 abstracts. Almost 90% of the resources described in the abstracts were correctly classified. More than 500 descriptions of functionalities were extracted. Conclusion: These experiments suggest the feasibility of our approach for automatically discovering and indexing current and future bioinformatics resources. Given the domain- independent characteristics of this tool, it is currently being applied by the authors in other areas, such as medical nanoinformatics. BIRI is available at http://edelman.dia.fi.upm.es/biri/. Background and services. Discovering and accessing the appropriate The number of public online bioinformatics resources has bioinformatics resource for a specific research task has grown exponentially over the past few years. Bioinformat- become increasingly important, as suggested in earlier ics professionals can access and use a large number of reports [1]. resources for their research --including databases, tools Page 1 of 14 (page number not for citation purposes) BMC Bioinformatics 2009, 10:320 http://www.biomedcentral.com/1471-2105/10/320 To address this issue, various significant projects and ini- licly available databases --e.g. Entrez Gene, OMIM, GO, tiatives have been carried out, leading to several pioneer- OBO, PDB, GeneBank, Prosite, etc. -- have been success- ing indexes of bioinformatics resources that are currently fully integrated, including more than 160 million RDF available over the Internet. For instance, BioPortal [2] is a documents. web repository of biomedical ontology resources, developed at the National Center for Biomedical Ontology In this paper we present a tool to automatically create a (NCBO) [3]. This application enables collaborative devel- searchable index of bioinformatics resources from the sci- opment of biomedical ontologies. BioPortal includes a entific literature. The resources are annotated with meta- service called Open Biomedical Resources (OBR) for data regarding their functionality. Metadata are extracted annotating and indexing biomedical resources [4]. from abstracts of research papers retrieved from PubMed® Resources are annotated using concepts from a domain [20] and the ISI Web of Knowledge® [21] using pattern- ontology. The OBR service enables researchers to locate matching techniques. The generated index can be incre- resources by specifying ontology concepts. mentally updated by feeding our tool with abstracts of articles describing new resources. In the evaluation sec- Other examples of such indexes include the biomedical tion, we present and analyze some figures that compare database collection compiled by Galperin [5]--a yearly the results of our approach with other existing indexes. updated web-based list of molecular biology databases sorted in alphabetical order-- or the Bioinformatics Links Methods Directory (BLD) [6,7] --a catalogue of links to bioinfor- We extract relevant information about bioinformatics matics resources, tools and databases classified into resources from papers published in high-impact journals. eleven major categories-- where resources can be searched We focus on manuscript abstracts, which tend to con- using keyword-based queries. dense the key information. We use classic natural lan- guage processing techniques--such as, for instance, The European Bioinformatics Institute (EBI) provides a tokenizers, parsers, transition networks or part-of-speech searchable index of an alphabetically-sorted inventory of taggers--to extract this information. These techniques pro- bioinformatics resources [8]. Resources are classified vide the basic framework for extracting data from textual according to the type of service they provide--databases, resources [22]. All the techniques we have used are tools and (web) services. The index includes both internal detailed in the following sections. The outcome of this and external resources. process is an indexed knowledge base including the most relevant information about the resources. The BioMoby platform provides an annotated registry of bioinformatics web services enabling other applications Building the knowledge base to integrate and use such services [9,10]. There are various We propose a five-phase method for automatically gener- major installations of BioMoby, such as, for instance, the ating a knowledge base of bioinformatics resources using PlaNet Consortium [11], the Australian Centre for Plant research paper abstracts. Figure 1 shows an overview of Functional Genomics [12], the Generation Challenge Pro- the proposed methodology. Next we detail the five activi- gram of the Consultative Group for International Agricul- ties. tural Research [13], Genome Canada [14], or the Spanish National Institute of Bioinformatics (INB) [15]. They pro- Abstract selection and generation of abstract surrogates vide access to different bioinformatics resources depend- First we retrieve a collection of papers from well-known ing on their own interests and needs. scientific literature web sites. Each paper is then analyzed to produce a structured surrogate, including the title, A consortium composed of the seven US National Centers authors, abstract and electronic references to the paper-- for Biomedical Computing [16] has recently developed i.e. the PubMed® Identifier (PMID). The structure of the another index of bioinformatics resources called iTools surrogate and a complete example is shown within the [17,18]. Web services are used here to provide access to "Papers" section in Table 1. Such surrogates simplify the resources that are annotated according to their functional- pattern-matching process. For instance, the title of the ity. A web-based interface enables researchers to locate the paper usually contains the name of the resources. We use resources they need using advanced search and visual nav- that knowledge as a heuristic to discard unnecessary igation tools. Another initiative for organizing bioinfor- searches in other sections of the manuscripts. matics resources is the BIO2RDF system [19]. This system considers the information contained in biological data- Surrogate pre-processing bases, providing uniform access to biological data stored Once the surrogates have been generated, the titles and in public databases. Data is converted into RDF format abstracts are divided into sentences, using strong punctu- using a common reference ontology. More than 20 pub- ation marks as sentence delimiters. Then, each sentence is Page 2 of 14 (page number not for citation purposes) BMC Bioinformatics 2009, 10:320 http://www.biomedcentral.com/1471-2105/10/320 BIRIFigure knowledge 1 base automatic construction phases BIRI knowledge base automatic construction phases. We propose a five-step methodology for automatically building the knowledge base: i) abstract selection, ii) surrogate pre-processing, iii) information extraction, iv) resource classification and v) manual curation. Abstracts are extracted from scientific papers taken from the ISI Web of Knowledge®. pre-processed by a lexical analyzer. The analyzer extracts base and software papers published in

View of Functional Genomics [12], the Generation Challenge Pro- the Proposed Methodology

High-Throughput Analysis Suggests Differences in Journal False

Tracking the Popularity and Outcomes of All Biorxiv Preprints

Event Extraction on Pubmed Scale Filip Ginter1*, Jari Björne1,2, Sampo Pyysalo3 from Workshop on Advances in Bio Text Mining Ghent, Belgium

A Comparison of Feature Selection and Classification Methods in DNA

Use and Mis-Use of Supplementary Material in Science Publications Mihai Pop1* and Steven L

Viewed: 8 Were Accepted As Full Papers, 4 As Poster Large Investment of Effort Needed to Develop Sufficiently Papers and One As a Demo Paper

Mining Locus Tags in Pubmed Central to Improve Microbial Gene Annotation Chris J Stubben* and Jean F Challacombe

Analyzing the Field of Bioinformatics with the Multi-Faceted Topic Modeling Technique Go Eun Heo1, Keun Young Kang1, Min Song1* and Jeong-Hoon Lee2

Semantic Web Applications and Tools for the Life Sciences: SWAT4LS 2010

Paperbot: Open-Source Web-Based Search and Metadata Organization of Scientific Literature Patricia Maraver1,2, Rubén Armañanzas1,2, Todd A

BMC Bioinformatics Biomed Central

Knowledge Discovery of Drug Data on the Example of Adverse Reaction Prediction Pinar Yildirim1*, Ljiljana Majnarić2, Ozgur Ilyas Ekmekci1, Andreas Holzinger3