The Bioc-Biogrid Corpus: Full Text Articles Anno- Tated for Curation Of
Total Page:16
File Type:pdf, Size:1020Kb
Database, 2017, 1–11 doi: 10.1093/database/baw147 Original article Original article The BioC-BioGRID corpus: full text articles anno- tated for curation of protein–protein and genetic interactions Rezarta Islamaj Dogan 1,†, Sun Kim1,†, Andrew Chatr-aryamontri2,†, Christie S. Chang3, Rose Oughtred3, Jennifer Rust3, W. John Wilbur1, Donald C. Comeau1, Kara Dolinski3 and Mike Tyers2,4 1National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA, 2Institute for Research in Immunology and Cancer, Universite´de Montre´al, Canada Montre´al, QC H3C 3J7, 3Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA and 4Mount Sinai Hospital, The Lunenfeld-Tanenbaum Research Institute, Canada *Corresponding author: Tel: 301 435 8769; E-mail: [email protected] †These authors contributed equally to this work. Citation details: Islamaj Dogan,R., Kim,S., Chatr-Aryamontri,A. et al. The BioC-BioGRID corpus: full text articles annotated for curation of protein–protein and genetic interactions. Database (2016) Vol. 2016: article ID baw147; doi:10.1093/data- base/baw147 Received 30 July 2016; Revised 14 October 2016; Accepted 18 October 2016 Abstract A great deal of information on the molecular genetics and biochemistry of model organ- isms has been reported in the scientific literature. However, this data is typically described in free text form and is not readily amenable to computational analyses. To this end, the BioGRID database systematically curates the biomedical literature for gen- etic and protein interaction data. This data is provided in a standardized computationally tractable format and includes structured annotation of experimental evidence. BioGRID curation necessarily involves substantial human effort by expert curators who must read each publication to extract the relevant information. Computational text-mining methods offer the potential to augment and accelerate manual curation. To facilitate the develop- ment of practical text-mining strategies, a new challenge was organized in BioCreative V for the BioC task, the collaborative Biocurator Assistant Task. This was a non- competitive, cooperative task in which the participants worked together to build BioC- compatible modules into an integrated pipeline to assist BioGRID curators. As an integral part of this task, a test collection of full text articles was developed that contained both biological entity annotations (gene/protein and organism/species) and molecular inter- action annotations (protein–protein and genetic interactions (PPIs and GIs)). This collec- tion, which we call the BioC-BioGRID corpus, was annotated by four BioGRID curators over three rounds of annotation and contains 120 full text articles curated in a dataset Published by Oxford University Press 2017. This work is written by US Government employees and is in the public domain in the US. Page 1 of 11 (page number not for citation purposes) Page 2 of 11 Database, Vol. 2017, Article ID baw147 representing two major model organisms, namely budding yeast and human. The BioC- BioGRID corpus contains annotations for 6409 mentions of genes and their Entrez Gene IDs, 186 mentions of organism names and their NCBI Taxonomy IDs, 1867 mentions of PPIs and 701 annotations of PPI experimental evidence statements, 856 mentions of GIs and 399 annotations of GI evidence statements. The purpose, characteristics and pos- sible future uses of the BioC-BioGRID corpus are detailed in this report. Database URL: http://bioc.sourceforge.net/BioC-BioGRID.html Introduction The BioC task delivered an annotation interface that BioCreative (Critical Assessment of Information BioGRID curators then tested for their curation needs. The Extraction in Biology) (1–4) is a collaborative initiative to curation requirements addressed by the BioC annotation provide a common evaluation framework for monitoring interface were: (i) handling full text articles, (ii) identifying and assessing the state-of-the-art of text-mining systems molecular interactions with evidence for curation and applied to biologically relevant problems. The goal of the (iii) linking genes/proteins and organisms/species to the BioCreative challenges has been to pose tasks that will re- NCBI Entrez ID and Taxonomy ID, respectively (19). sult in systems capable of scaling for use by general biology Generally, in order for a text-mining system to be of researchers and more specialized end users such as data- genuine use to biocurators, its performance and functional- base curators. An important contribution of BioCreative ity need to be developed to optimize its usability (20). The challenges has also been the generation of shared gold usability of a text-mining system for biocurators requires standard datasets, prepared by domain experts, for the an interactive interface that allows a variable display of an- training and testing of text-mining applications. These col- notations, links to supporting evidence and the ability to lections, and the associated evaluation methods, represent edit annotations, among other features. A key feature of an important resource for continued development and im- such a system is how the text-mining results are presented provement of text-mining applications. to biocurators—i.e. what biocurators actually see in the The BioCreative V Workshop was held in September annotation tool—and whether the text describing evidence 2015 and consisted of five tasks: the Collaborative needed for curation is highlighted. Biocurator Assistant Task (BioC task) (5), the Currently, no manually curated datasets are available CHEMDNER-patents task (6), the CDR task (7), the BEL that annotate only the minimal text required by the curator task (8, 9) and the User Interactive Task (10). Our focus is to capture relevant interactions and associated experimen- the BioC task. BioC is a format for sharing text data and an- tal evidence. This is in contrast with automated text-min- notations and a minimalistic approach to interoperability ing systems, which typically over-annotate the text. for biomedical text mining (11). The first BioC task in Previous efforts for similar tasks in biomedical information BioCreative IV (12) released the BioC libraries and called extraction research have produced a corpus for PPIs as a for development of other tools and data resources that used result of the BioCreative II PPI task (21), and a corpus for BioC to facilitate interoperability. The BioC task in gene function curation via Gene Ontology (GO) annota- BioCreative V was designed to make maximal use of BioC tion as a result of the BioCreative IV GO task (22). These to promote data sharing and ease of use and reuse of soft- tasks are similar in that curators were asked to annotate ware created. The task was positioned as a collaboration ra- sentences that describe the information that was curated ther than a competition such that participating teams from the given articles. These sentences might directly de- created complementary modules that could be seamlessly scribe experimental results that provided evidence for the integrated into a system capable of assisting BioGRID cur- respective annotation, or might be summary sentences that ators. Specifically, the resulting interactive system triaged described a discovery in concise but general form. sentences from full text articles in order to identify text pas- Although the GO task asked curators to mark every occur- sages associated with protein–protein and genetic inter- rence of GO evidence text, the PPI task asked curators to actions (abbreviated PPI and GI, respectively). These find and annotate the best evidence passage. The sentences were then highlighted in the biocurator assist- BioCreative V BioC task attempted to take a more natural ant viewer (13). Nine teams, world-wide, developed one or approach by having curators mark the sentences they actu- more modules independently (13–18), integrated via BioC, ally used as the basis for each interaction annotation. For to insure the interoperability of the different systems (11). all these tasks, expert manual curation represents a Database, Vol. 2017, Article ID baw147 Page 3 of 11 laborious and time consuming activity. The corpus pre- It consists of 6409 annotated mentions of gene/protein sented here includes an important Inter-Annotator names associated with their Entrez Gene IDs, 186 anno- Agreement (IAA) component, which was not measured in tated mentions of organism/species names associated with the BioCreative II PPI task, and was reported at only 40– their NCBI Taxonomy IDs, 1867 annotated mentions of 60% for the BioCreative IV GO task. Other corpora, such protein–protein annotations, 701 annotations for descrip- as that produced for the BioCreative II.5 challenge (23), tions of their experimental methods, 856 annotated men- and the one produced for BioCreative III PPI challenge tions of GIs and 399 annotations that mark GI evidence (24) targeted the identifications of interacting pairs and ex- statements. The BioC-BioGRID corpus is a unique and perimental methods, respectively. valuable resource because: (i) it was produced by database For the BioC biocurator assistant system we aimed to curators to address their own curation needs, and (ii) it is optimize the integration of the results of individual text- the first corpus annotated for GI mentions and descriptions mining modules and display them in a curation interface of their interaction types. We believe