View of the Method to Find Incomplete Data Sets Potentially Correlated Data Types

Howe BMC Bioinformatics (2018) 19:110 https://doi.org/10.1186/s12859-018-2121-6 METHODOLOGY ARTICLE Open Access A statistical approach to identify, monitor, and manage incomplete curated data sets Douglas G. Howe Abstract Background: Many biological knowledge bases gather data through expert curation of published literature. High data volume, selective partial curation, delays in access, and publication of data prior to the ability to curate it can result in incomplete curation of published data. Knowing which data sets are incomplete and how incomplete they are remains a challenge. Awareness that a data set may be incomplete is important for proper interpretation, to avoiding flawed hypothesis generation, and can justify further exploration of published literature for additional relevant data. Computational methods to assess data set completeness are needed. One such method is presented here. Results: In this work, a multivariate linear regression model was used to identify genes in the Zebrafish Information Network (ZFIN) Database having incomplete curated gene expression data sets. Starting with 36,655 gene records from ZFIN, data aggregation, cleansing, and filtering reduced the set to 9870 gene records suitable for training and testing the model to predict the number of expression experiments per gene. Feature engineering and selection identified the following predictive variables: the number of journal publications; the number of journal publications already attributed for gene expression annotation; the percent of journal publications already attributed for expression data; the gene symbol; and the number of transgenic constructs associated with each gene. Twenty-five percent of the gene records (2483 genes) were used to train the model. The remaining 7387 genes were used to test the model. One hundred and twenty-two and 165 of the 7387 tested genes were identified as missing expression annotations based on their residuals being outside the model lower or upper 95% confidence interval respectively. The model had precision of 0. 97 and recall of 0.71 at the negative 95% confidence interval and precision of 0.76 and recall of 0.73 at the positive 95%confidenceinterval. Conclusions: This method can be used to identify data sets that are incompletely curated, as demonstrated using the gene expression data set from ZFIN. This information can help both database resources and data consumers gauge when it may be useful to look further for published data to augment the existing expertly curated information. Keywords: Zebrafish, Danio rerio, Gene expression, Machine learning, Curation Background shared and combined, increasing the impact that incom- In recent years, the biological sciences have benefited plete or incorrect data may have on downstream data immensely from new technologies and methods in both consumers. Although assessing how complete or correct biological research and computer sciences. Together a large data set may be remains a challenge, examples these advances have produced a surge of new data. have been reported. Examples include computational Biological research now relies heavily upon expertly cu- methods for identifying data updates and artifacts that rated database resources for rapid assessment of current may be of interest to downstream data consumers [1], knowledge on many topics. Management, organization, machine learning methods to identify incorrectly standardization, quality control, and crosslinking of data classified G-protein coupled receptors [2], and to im- are among the important tasks these resources provide. prove the quality of large data sets prior to quantitative It is commonplace today for these data to be widely structure-activity relationship modeling [3]. The completeness and quality of curated nanomaterial data has also been explored [4]. Correspondence: [email protected] The Institute of Neuroscience, University of Oregon, Eugene, OR, USA © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Howe BMC Bioinformatics (2018) 19:110 Page 2 of 12 What does it mean for a data set to be “complete” or Incompletely curated data sets also result when new “incomplete”? Data can be incomplete in two ways: data types are published prior to database resources missing values for variables, or missing entire records having the ability to curate those data. Curation of gene which could be included in a data set. Handling missing expression data commenced at ZFIN in 2005 [7]. Curat- variable values in statistical analyses is a complex topic ing papers from earlier years, known as “back curation”, outside the scope of this article. In the context of this is something that many curation teams don’t have work, “complete” means all currently published data of a resources to support. Gene expression data published specific type is present in the data set with no mis- earlier than 2005 may only be curated at ZFIN if they sing values for any variables. In this study, data from were brought forward as part of an ongoing project or the ZFIN Database has been used to find genes that topic focused curation effort in subsequent years. have an incomplete gene expression data set, genes Why is it important to know if a data set includes all for which there exist published but not yet curated relevant published data? This knowledge can help data- gene expression data. base resources focus expert curation effort where it is There are several reasons data repositories may not in- needed. Likewise, if a researcher is aware that a data set clude all relevant published data, including high data may be missing records, they may look further for volume, selective partial curation, delays in data access, additional relevant published data to complete the data and release of data prior to the ability to curate it. High set. Having knowledge of all the published data helps to data volume can result in the need for prioritization of avoid wasted time, money, and effort repeating work the incoming data stream. For example, ZFIN is the already done by others, and also helps to avoid flawed central data repository for expertly curated genetic and hypothesis generation based on incomplete data. genomic data generated using the zebrafish (Danio rerio) In recent years, natural language processing (NLP) and as a model system [5]. One major data input to ZFIN is machine learning methods have been widely used in the the published scientific literature. A search of PubMed field of genetics and genomics on tasks such as predic- for all zebrafish literature shows that this corpus has tion of intron/exon structure, protein binding sites, gene consistently increased in volume by 10% every year since expression, gene interactions, and gene function [8]. In 1996 resulting in a greater than 10-fold increase in the addition, model organism databases have used NLP and number of publications processed by ZFIN in 2016 machine learning methods for over a decade to manage (2865 publications) compared to 1996 . Such increases and automate processing of the increasing volume of necessitate prioritization to focus effort on the data publications that must be identified, prioritized, indexed, deemed most valuable by the research community. As a and curated [9–13]. These methods are applied to the result, curation of some publications is delayed or pre- incoming literature stream, prior to curation. Machine vented all together. ZFIN currently includes curated data learning methods can also have utility after curation in from approximately 25% of the incoming literature maintaining the quality and completeness of curated within 6 months of publication. data sets. The aim of this study was to provide a statis- Data sets can also be incomplete relative to what has tical approach to identify curated data sets that may be been published if publications are curated for selective incomplete relative to what has been published. The data types. Publications that are not fully curated when ZFIN gene expression data set was the use case for this they initially enter a database may later be partially study. Researchers and data management teams alike curated during projects focusing on specific topics. For can use the output of this method to guide resource example, the gene functions were curated from all the allocation, decision making, and interpretation of data publications associated with genes involved in kidney sets with insight into whether additional data may be development [6]. In such cases, publications may get available to augment an expertly curated data set. functional data, but no other data types, curated. Delayed data access also contributes to curated data Results sets being incomplete. There is significant variation in ZFIN gene expression annotations how soon the full text of a publication may be available. Gene expression annotations in ZFIN are assembled Some journals have embargo periods which restrict pub- using a tripartite modular structure composed of: 1. The lication access to those with personal or institutional Expression Experiment; 2. The Figure number and subscriptions. Delayed access to the full text of publica- Developmental Stage; and 3. the Anatomical Structure tions slows data entry into data repositories, such as (Fig. 1). These modules are then combined to create ZFIN, which require the full text to curate. ZFIN each complete gene expression annotation. In this study, currently obtains full text for approximately 50%, 80%, the count of gene expression experiments per gene was and 90% of the zebrafish literature within 6 months, used as a key metric in the statistical model. The 1 year, and 3 years of publication respectively.

View of the Method to Find Incomplete Data Sets Potentially Correlated Data Types

Original Article Text Mining in the Biocuration Workflow: Applications for Literature Curation at Wormbase, Dictybase and TAIR

Biocuration 2016 - Posters

Biocuration - Mapping Resources and Needs [Version 2; Peer Review: 2 Approved]

Improving the Gene Ontology Resource to Facilitate More Informative Analysis and Interpretation of Alzheimer’S Disease Data

Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases

ISB Newsletter

Ontology: Tool for Broad Spectrum Knowledge Integration Barry Smith

Integration of Proteomics Data Into Uniprotkb

Biocuration at the Saccharomyces Genome Database

Perspective on Literature Curation

OM2017 Camerafinal

Building Curation Filters for BCO and Adapting the BCO Framework for Biocuration