Deriving and Prioritizing Gene Lists from Pubmed
Total Page:16
File Type:pdf, Size:1020Kb
Jourquin et al. BMC Genomics 2012, 13(Suppl 8):S20 http://www.biomedcentral.com/1471-2164/13/S8/S20 RESEARCH Open Access GLAD4U: deriving and prioritizing gene lists from PubMed literature Jérôme Jourquin1,2, Dexter Duncan1, Zhiao Shi3,4, Bing Zhang1,2* From The International Conference on Intelligent Biology and Medicine (ICIBM) Nashville, TN, USA. 22-24 April 2012 Abstract Background: Answering questions such as “Which genes are related to breast cancer?” usually requires retrieving relevant publications through the PubMed search engine, reading these publications, and creating gene lists. This process is not only time-consuming, but also prone to errors. Results: We report GLAD4U (Gene List Automatically Derived For You), a new, free web-based gene retrieval and prioritization tool. GLAD4U takes advantage of existing resources of the NCBI to ensure computational efficiency. The quality of gene lists created by GLAD4U for three Gene Ontology (GO) terms and three disease terms was assessed using corresponding “gold standard” lists curated in public databases. For all queries, GLAD4U gene lists showed very high recall but low precision, leading to low F-measure. As a comparison, EBIMed’s recall was consistently lower than GLAD4U, but its precision was higher. To present the most relevant genes at the top of a list, we studied two prioritization methods based on publication count and the hypergeometric test, and compared the ranked lists and those generated by EBIMed to the gold standards. Both GLAD4U methods outperformed EBIMed for all queries based on a variety of quality metrics. Moreover, the hypergeometric method allowed for a better performance by thresholding genes with low scores. In addition, manual examination suggests that many false-positives could be explained by the incompleteness of the gold standards. The GLAD4U user interface accepts any valid queries for PubMed, and its output page displays the ranked gene list and information associated with each gene, chronologically-ordered supporting publications, along with a summary of the run and links for file export and functional enrichment and protein interaction network analysis. Conclusions: GLAD4U has a high overall recall. Although precision is generally low, the prioritization methods successfully rank truly relevant genes at the top of the lists to facilitate efficient browsing. GLAD4U is simple to use, and its interface can be found at: http://bioinfo.vanderbilt.edu/glad4u. Background The current answers to these questions are primarily The physical development and phenotype of organisms contained in the articles indexed in the MEDLINE data- can be thought of as a product of genes interacting with base. Traditionally, answering these questions requires each other and with the environment. Therefore, it is individuals to retrieve relevant publications through the common for a scientist to ask questions like “Which PubMed search engine and then to create gene lists by genes are related to breast cancer?”, “Which genes are manually extracting gene-centered information from involved in embryonic development?”, and “Which genes retrieved literature. This process is not only time-consum- are functionally related to TP53?” ing, but also prone to errors. First, it is difficult to ascertain that all relevant literature is processed. Second, it is unli- kely that all relationships in a publication will be detected. * Correspondence: [email protected] Third, individual researchers tend to extrapolate based on 1Department of Biomedical Informatics, Vanderbilt University School of Medicine, 400 Eskind Biomedical Library, 2209 Garland Avenue, Nashville, TN domain knowledge. 37232, USA Full list of author information is available at the end of the article © 2012 Jourquin et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Jourquin et al. BMC Genomics 2012, 13(Suppl 8):S20 Page 2 of 12 http://www.biomedcentral.com/1471-2164/13/S8/S20 Over the past decade, bioinformatics approaches have Results been developed to address this issue. One of the most suc- Overall quality of the retrieved gene lists cessful projects in this area is the Gene Ontology (GO) GLAD4U relies on the NCBI eSearch API to find publi- project [1]. GO produces a structured, precisely defined, cations related to a user query and on the gene-to-publi- and controlled vocabulary (i.e., GO terms) for describing cation link table to identify genes from the retrieved the roles of genes and gene products in different species. publications. We used three GO biological process terms Genes are associated with GO terms through manual (apoptosis, cell adhesion and DNA repair) and three dis- curation as well as computational inference. A researcher ease terms (hypertension, obesity and schizophrenia) as can now go to the GO website [2] to get a list of genes queries to evaluate the overall quality of the retrieved related to a GO term of interest. However, as the GO gene lists. For each query, using a corresponding gene list vocabulary only describes gene products in terms of their curated by GO or GAD/OMIM as a gold standard, we associated biological processes, cellular components and calculated the precision, recall and F-measure of the molecular functions, users are limited by questions linked retrieved gene list. As shown in Table 1, gene lists to this limited vocabulary. Moreover, processes, functions retrieved for all queries showed very high recall (0.90 or components that are unique to diseases, such as onco- ±0.03 for GO terms and 0.96±0.05 for disease terms). In genesis, are not included in GO because causing cancer is contrast to the high recall, the precision was generally not the normal function of any gene. low (0.16±0.04 for GO terms and 0.06±0.02 for disease A useful resource specifically designed for disease stu- terms), leading to low F-measures (0.27±0.05 for GO dies is the Online Mendelian Inheritance in Man (OMIM terms and 0.12±0.03 for disease terms). EBIMed’s recall [3]) project. OMIM is a comprehensive, authoritative, is consistently lower than GLAD4U (0.47±0.15 for GO and timely compendium of human genes and genetic terms and 0.44±0.11 for disease terms). However, its pre- phenotypes. It contains information on all known Men- cision is higher than GLAD4U (0.20±0.05 for GO terms delian disorders. However, information on complex dis- and 0.16±0.04 for disease terms), resulting in better eases such as cancer and diabetes is lacking in OMIM. F-measures (0.27±0.03 for GO terms and 0.23±0.04 for In addition to manual curation, text mining tools have disease terms). beendevelopedtoassistgenelistcreation[4].Asan The low precision of GLAD4U may be partially attrib- example, EBIMed [5,6] combines text mining with co- uted to the incompleteness of the annotation in GO and occurrence-based analysis to generate a prioritized list of GAD/OMIM. However, it is likely that the original gene genes for a user-provided query. Specifically, EBIMed lists include many irrelevant genes. In this case, a priori- collects MEDLINE records and available full text docu- tization step that ranks truly relevant genes at the top ments for a user-provided query, identifies protein of a list would certainly facilitate efficient browsing. names, drugs, species, or GO terms in the documents, and prioritizes genes/proteins based on the number of Performance of the prioritization methods co-occurrences of the different pairs (protein/protein, We studied the performance of two methods to prioritize protein/drug, protein/species, protein/GO term) in the the gene lists. The first, “GLAD4U Counts”, is based solely sentences of the documents in which they appear. on the number of supporting publications as commonly EBIMed and similar tools, such as FACTA [7] and SciMi- implemented in other software [10,11]. The second, ner [8], provide more flexible ways to create gene lists “GLAD4U Hypergeometric”,isproposedinthisstudy, that are not limited to certain aspects of biology. Never- which is based on the Hypergeometric test (see the Meth- theless, they usually require heavy computation, and the ods section for details). We used the above mentioned relevance of the resulted gene lists to the input queries three GO terms and three disease terms as queries to eval- has not been systematically evaluated. uate the performance of our prioritization methods. We Here, we report GLAD4U (Gene List Automatically also included the prioritized gene lists returned by Derived For You), a new web-based gene retrieval and EBIMed for comparison. prioritization tool. GLAD4U takes advantage of existing Figure 1 depicts the precision/recall curves from this resources at the National Center for Biotechnology Infor- comparative evaluation. For all queries, based on manual mation (NCBI) to ensure computational efficiency. It inspection of the curves, both GLAD4U Counts and provides a simple user interface that facilitates intuitive GLAD4U Hypergeometric outperformed EBIMed, espe- usage and interpretation of results. The quality of gene cially at the high precision range. Between the two lists created by GLAD4U is assessed using corresponding GLAD4U methods, the Hypergeometric method per- “gold standard” lists curated in GO, GAD (Genetic Asso- formed better than the Counts method for GO term ciation