Howe BMC (2018) 19:110 https://doi.org/10.1186/s12859-018-2121-6

METHODOLOGY ARTICLE A statistical approach to identify, monitor, and manage incomplete curated data sets Douglas G. Howe

Abstract Background: Many biological knowledge bases gather data through expert curation of published literature. High data volume, selective partial curation, delays in access, and publication of data prior to the ability to curate it can result in incomplete curation of published data. Knowing which data sets are incomplete and how incomplete they are remains a challenge. Awareness that a data set may be incomplete is important for proper interpretation, to avoiding flawed hypothesis generation, and can justify further exploration of published literature for additional relevant data. Computational methods to assess data set completeness are needed. One such method is presented here. Results: In this work, a multivariate linear regression model was used to identify in the Information Network (ZFIN) having incomplete curated expression data sets. Starting with 36,655 gene records from ZFIN, data aggregation, cleansing, and filtering reduced the set to 9870 gene records suitable for training and testing the model to predict the number of expression experiments per gene. Feature engineering and selection identified the following predictive variables: the number of journal publications; the number of journal publications already attributed for gene expression annotation; the percent of journal publications already attributed for expression data; the gene symbol; and the number of transgenic constructs associated with each gene. Twenty-five percent of the gene records (2483 genes) were used to train the model. The remaining 7387 genes were used to test the model. One hundred and twenty-two and 165 of the 7387 tested genes were identified as missing expression annotations based on their residuals being outside the model lower or upper 95% confidence interval respectively. The model had precision of 0. 97 and recall of 0.71 at the negative 95% confidence interval and precision of 0.76 and recall of 0.73 at the positive 95%confidenceinterval. Conclusions: This method can be used to identify data sets that are incompletely curated, as demonstrated using the gene expression data set from ZFIN. This information can help both database resources and data consumers gauge when it may be useful to look further for published data to augment the existing expertly curated information. Keywords: Zebrafish, Danio rerio, Gene expression, Machine learning, Curation

Background shared and combined, increasing the impact that incom- In recent years, the biological sciences have benefited plete or incorrect data may have on downstream data immensely from new technologies and methods in both consumers. Although assessing how complete or correct biological research and computer sciences. Together a large data set may be remains a challenge, examples these advances have produced a surge of new data. have been reported. Examples include computational Biological research now relies heavily upon expertly cu- methods for identifying data updates and artifacts that rated database resources for rapid assessment of current may be of interest to downstream data consumers [1], knowledge on many topics. Management, organization, machine learning methods to identify incorrectly standardization, quality control, and crosslinking of data classified G- coupled receptors [2], and to im- are among the important tasks these resources provide. prove the quality of large data sets prior to quantitative It is commonplace today for these data to be widely structure-activity relationship modeling [3]. The completeness and quality of curated nanomaterial data has also been explored [4]. Correspondence: [email protected] The Institute of Neuroscience, University of Oregon, Eugene, OR, USA

© The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Howe BMC Bioinformatics (2018) 19:110 Page 2 of 12

What does it mean for a data set to be “complete” or Incompletely curated data sets also result when new “incomplete”? Data can be incomplete in two ways: data types are published prior to database resources missing values for variables, or missing entire records having the ability to curate those data. Curation of gene which could be included in a data set. Handling missing expression data commenced at ZFIN in 2005 [7]. Curat- variable values in statistical analyses is a complex topic ing papers from earlier years, known as “back curation”, outside the scope of this article. In the context of this is something that many curation teams don’t have work, “complete” means all currently published data of a resources to support. Gene expression data published specific type is present in the data set with no mis- earlier than 2005 may only be curated at ZFIN if they sing values for any variables. In this study, data from were brought forward as part of an ongoing project or the ZFIN Database has been used to find genes that topic focused curation effort in subsequent years. have an incomplete gene expression data set, genes Why is it important to know if a data set includes all for which there exist published but not yet curated relevant published data? This knowledge can help data- gene expression data. base resources focus expert curation effort where it is There are several reasons data repositories may not in- needed. Likewise, if a researcher is aware that a data set clude all relevant published data, including high data may be missing records, they may look further for volume, selective partial curation, delays in data access, additional relevant published data to complete the data and release of data prior to the ability to curate it. High set. Having knowledge of all the published data helps to data volume can result in the need for prioritization of avoid wasted time, money, and effort repeating work the incoming data stream. For example, ZFIN is the already done by others, and also helps to avoid flawed central data repository for expertly curated genetic and hypothesis generation based on incomplete data. genomic data generated using the zebrafish (Danio rerio) In recent years, natural language processing (NLP) and as a model system [5]. One major data input to ZFIN is machine learning methods have been widely used in the the published scientific literature. A search of PubMed field of genetics and on tasks such as predic- for all zebrafish literature shows that this corpus has tion of intron/exon structure, protein binding sites, gene consistently increased in volume by 10% every year since expression, gene interactions, and gene function [8]. In 1996 resulting in a greater than 10-fold increase in the addition, have used NLP and number of publications processed by ZFIN in 2016 machine learning methods for over a decade to manage (2865 publications) compared to 1996 . Such increases and automate processing of the increasing volume of necessitate prioritization to focus effort on the data publications that must be identified, prioritized, indexed, deemed most valuable by the research community. As a and curated [9–13]. These methods are applied to the result, curation of some publications is delayed or pre- incoming literature stream, prior to curation. Machine vented all together. ZFIN currently includes curated data learning methods can also have utility after curation in from approximately 25% of the incoming literature maintaining the quality and completeness of curated within 6 months of publication. data sets. The aim of this study was to provide a statis- Data sets can also be incomplete relative to what has tical approach to identify curated data sets that may be been published if publications are curated for selective incomplete relative to what has been published. The data types. Publications that are not fully curated when ZFIN gene expression data set was the use case for this they initially enter a database may later be partially study. Researchers and data management teams alike curated during projects focusing on specific topics. For can use the output of this method to guide resource example, the gene functions were curated from all the allocation, decision making, and interpretation of data publications associated with genes involved in kidney sets with insight into whether additional data may be development [6]. In such cases, publications may get available to augment an expertly curated data set. functional data, but no other data types, curated. Delayed data access also contributes to curated data Results sets being incomplete. There is significant variation in ZFIN gene expression annotations how soon the full text of a publication may be available. Gene expression annotations in ZFIN are assembled Some journals have embargo periods which restrict pub- using a tripartite modular structure composed of: 1. The lication access to those with personal or institutional Expression Experiment; 2. The Figure number and subscriptions. Delayed access to the full text of publica- Developmental Stage; and 3. the Anatomical Structure tions slows data entry into data repositories, such as (Fig. 1). These modules are then combined to create ZFIN, which require the full text to curate. ZFIN each complete gene expression annotation. In this study, currently obtains full text for approximately 50%, 80%, the count of gene expression experiments per gene was and 90% of the zebrafish literature within 6 months, used as a key metric in the statistical model. The 1 year, and 3 years of publication respectively. primary reason for this was largely a practical Howe BMC Bioinformatics (2018) 19:110 Page 3 of 12

Fig. 1 The structure of a gene expression annotation at ZFIN. Gene expression annotations at ZFIN are built up from three primary groupings of data: The expression experiment, the figure/developmental stage, and the anatomical structure. Each group of data contains specific pieces of information about the observed expression pattern consideration. Due to the way the data are structured, gene and number of expression experiments per gene the simplest approach was to count expression experi- suggested that these variables could be the foundation of ments per gene rather than to get a full count of a linear model to predict how many expression experi- complete gene expression annotations per gene. ments a gene should have. The higher correlation coeffi- cient observed between expression experiments and Descriptive analysis journal publications curated for expression data (Fig. 3b) Three data files were combined to build the predictive rather than total journal publications per gene (Fig. 3a) model. The MachineLearningReport.txt file included one reveals that there are journal publications associated row of data for each of the 36,655 gene records found in with genes for reasons other than gene expression, and ZFIN at the time the file was generated. The GenePubli- that those are reducing the accuracy of the linear regres- cation.txt file included 125,871 records describing which sion. For example, the locations of the tp53 and mitfa, publications are attributed to which genes and what type genes in Fig. 3a indicate that those genes have few of publications they are. ZFIN includes many publication expression experiments associated with them relative to types, but only journal publications were included in this the number of journal publications with which they are study because they are the source of the gene expression associated. Accounting for the additional reasons for as- annotations being modeled. The ConstructComponents. sociating a publication with a gene would strengthen the txt file included 12,738 records describing transgenic linear regression model. Additional variables were tested constructs and their components. Each construct that that could account for publications being associated with was related to a gene in this file was counted towards genes, including the number of annota- the number of constructs associated with a gene. There tions and their associated journal publications, the were 76 genes which had nine or more associated con- number of annotations and their associated structs, greater than 1.5 times the interquartile range for journal publications, and the number of transgenic con- this data set. Summary statistics for these files are shown structs each gene was associated with. The model was in Fig. 2. trained and tested including these data, then the Azure Machine Learning (AML) Permutation Feature Import- Feature selection ance module was used to evaluate their predictive value. Strong linear correlations were observed between the Neither the phenotype annotation data nor the Gene number of expression experiments per gene and the Ontology annotation data provided any value towards number of journal publications per gene (R2 = 0.82) or prediction of the number of gene expression experi- the number of journal publications curated for ments per gene, so these were dropped from the model. expression data per gene (R2 = 0.98; Fig. 3a and b The number of transgenic constructs and the gene respectively). symbol did have modest predictive value, so these were Of the 36,655 total gene records in this study, 12,851 left in the model. The predictive value of the construct had at least 1 expression experiment and averaged 6.8 count is attributable to the 779 genes associated with (Std. Dev. = 32) expression experiments. These strong one or more transgenic constructs, the maximum being linear relationships between journal publications per 330 constructs associated with the gene hsp70l. The Howe BMC Bioinformatics (2018) 19:110 Page 4 of 12

Fig. 2 Descriptive statistics for the data used in model training and testing. Descriptive statistics are shown for the three data files used as input for training and testing the model: GenePublication.txt (top), ConstructComponents.txt (middle), and MachineLearningReport.txt (bottom) promoter of this gene has been used for nearly 20 years AML Permutation Feature Importance module is shown to drive inducible expression of transgenes with heat in Table 1. shock [14]. The addition of a feature for associated con- struct count per gene would make the model more ac- Regression modeling curate for those 779 genes to which this situation was Once the model variables were established, the input applicable and hence improve model performance over- data set was split 25%/75% for model training and all. The final list of features included in the model and testing respectively of a linear regression model using their relative predictive value score as reported by the theAzureMachineLearningStudio.Themodelwas Howe BMC Bioinformatics (2018) 19:110 Page 5 of 12

Fig. 3 Correlation between the number of expression experiments and Publications. a The correlation between the number of expression experiments and the total number of journal publications per gene. b The correlation between the number of expression experiments and the number of journal publications having curated expression experiments per gene

trained using the count of expression experiments as and f(x) is the model predicted number of expression the label and set up to minimize the Root Mean experiments per gene. A scatter plot of y vs. f(x) pro- Square Error (RMSE). The result of the predictive duces a strong linear correlation (R2 = 0.95; Fig. 4a). model run on the test data set was examined using The frequency histogram of these residuals reveals a the AML Evaluate Model module, which reported single mode centered very close to 0, suggesting that model performance results shown in Table 2.The the major variables for predicting expression high coefficient of determination (> 0.95) indicated experiment number per gene had been accounted for that the model was a good predictor of the number in this model (Fig. 4b). Genes with residuals that fell of expression experiments per gene. outside the 95% confidence interval (CI) of the model, calculated as two times the RMSE, were Residual analysis predicted to be missing published expression The purpose for making this model was to locate genes annotations. Of the 7387 genes in the test set 122 in the ZFIN database that have incompletely curated and 165 genes had negative or positive residuals gene expression data sets relative to what has been pub- respectively that were greater than twice the RMSE of lished. The RMSE of the model output when run on the the model and thus outside the model 95% test data set was 3.368747 (Table 2). confidence interval. Those genes were identified by Residuals were calculated as y-f(x) where y is the this method as being significantly unlikely to have the actual number of expression experiments per gene complete set of gene expression experiments found in their associated journal publications. Table 1 Variables selected for model training and their predictive value score. Table 2 Results of model testing Variable Score Measure Value Journal publications with gene expression data 18.196293 Mean Absolute Error 1.582637 Percent of journal publications with expression data 0.043752 Root Mean Squared Error 3.368747 Gene symbol 0.000215 Relative Absolute Error 0.233714 Construct count − 0.000447 Relative Squared Error 0.047532 Total journal publication count −0.000692 Coefficient of Determination 0.952468 Howe BMC Bioinformatics (2018) 19:110 Page 6 of 12

Fig. 4 Actual vs. predicted expression experiment count per gene. a) Model predicted expression experiment count for the test data set was plotted against the actual expression experiment count per gene. A strong linear correlation was observed (R2 = 0.95), indicating that the model was accurate at predicting the number of expression experiments per gene. b) A histogram of expression experiment count residuals (actual number – predicted number) showed a single mode centered close to 0. Green and red bars are counts of genes inside or outside the 95% confidence interval respectively

Model validation type, and timing of curation projects involving older lit- The model predictions were tested by manually examin- erature; the number of other data types being curated ing journal publications associated with randomly concurrently; the volume of literature being managed; selected genes from inside and outside the upper and how curation priorities were set, etc. There are too many lower 95% CI. One hundred genes having negative resid- variables to draw any conclusions from this, other than uals were evaluated on each side of the lower 95% CI. that starting testing with the earliest publications was Fifty genes having positive residuals were evaluated on unlikely to have accelerated the testing process in this each side of the upper 95% CI. For each gene, publica- case. Genes were scored as having unannotated expres- tions that had not already been curated for gene expres- sion data as soon as one journal publication was found sion data were examined in chronological order starting with unannotated expression data for that gene. Genes with the earliest publication. The rationale for starting were scored as having a complete expression data set if testing with the earliest publications was that publica- all journal publications associated with the gene were tions from before 2005, when ZFIN started curating examined and uncurated expression data for that gene gene expression data, could be more likely to contain was not found. Each of the manually validated genes uncurated gene expression data. If that proved true, having a negative residual or positive residual was plot- starting testing with the earliest publications could ac- ted on a scatter plot of actual expression experiment celerate testing by identifying uncurated expression data count versus the number of expression experiments pre- in the older publications early in the testing process. dicted to be missing (the residual) (Fig. 5b and d respect- After completing data validation, the count of genes ha- ively). A line was drawn across the charts at 6.73 ving their earliest uncurated gene expression data per predicted missing expression experiments, which was year was plotted. The data did not support the premise the 95% confidence interval (2× RMSE) for the model. that earlier papers would be more likely to have uncu- That line was used to separate genes predicted to be rated expression data. Instead, a relatively random dis- missing expression experiments (above the line) from tribution was observed of earliest publication dates for those predicted to not be missing expression experi- the papers that had unannotated gene expression data ments (below the line). Red dots and green dots indicate (Additional file 1: Figure S1). These data are a complex genes that were or were not found to be missing gene function of multiple variables which change over time, expression annotations respectively during the valid- including how many curators were working; the number, ation. For genes having negative residuals, the model Howe BMC Bioinformatics (2018) 19:110 Page 7 of 12

Fig. 5 Quantification of model results. A confusion matrix of results from manual evaluation of model predictions for genes having negative (a) or positive (c) residuals. Columns show model predictions, rows show actual data status after manual validation. The actual expression experiment count was plotted against the predicted number of missing expression experiments per gene for each of the manually validated genes around the lower (b) and upper (d) 95% CIs. The horizontal line indicates the 95% confidence interval set at two times the root mean square error of the model. Genes above that line are predicted to be missing gene expression annotation, while genes below the line are predicted to not be missing gene expression annotations. Red dots are genes that were confirmed to be missing gene expression annotations. Green dots are genes that were confirmed to not be missing gene expression annotations identified genes with published, but unannotated, ex- of resource utilization, delayed data access, or publica- pression data with a precision of 0.97 and recall of 0.71 tion of data that pre-dates its ability to be stored in a (Fig. 5a). For genes having positive residuals, the model knowledge base. Efficient methods for identifying areas had a precision of 0.76 and recall of 0.73 (Fig. 5c). It is where data have been published but not yet curated are clear from this result that this method had high preci- important for curators of data resources and users of sion for finding genes in ZFIN that had published but those data resources alike. In this manuscript, the ZFIN yet to be curated gene expression data. This result also gene expression data set was used as a test case to de- showed that the more expression experiments a gene velop such a method. This method should be broadly had, the more likely it was to also be missing expression applicable to any data set of sufficient size, as long as data. Additionally, there was a trend indicating that the the proper predictive features can be identified. In the higher the volume of existing expression experiments, case of the ZFIN gene expression data, which has been the higher the number of predicted missing expression captured from published literature by expert curators experiments was (Fig. 5b and d). since 2005, the number of journal publications associ- ated with a gene was an extremely good predictor of Discussion how many gene expression experiments a gene should Expertly curated resources contain have. This resulted in a simple linear model comprised highly accurate data [15]. Sometimes accuracy comes at of five variables. When the model was initially tested, the expense of being comprehensive due to prioritization genes associated with transgenic constructs were being Howe BMC Bioinformatics (2018) 19:110 Page 8 of 12

reported with high significance as missing gene expres- as it is not always possible to know which variables, vari- sion data, when in fact they were not. In some cases, able derivatives, or variable interactions may be useful. genes associated with transgenic constructs had many In some cases data transformation, normalization, or dozens of publications associated with them which had computed values, such as the number of days since a no gene expression data for that gene. Perhaps the pro- record was last edited, may hold predictive value. There moter of the gene was used in the construct for example, are many feature engineering techniques outside the as is the case for the hsp70l gene. If that transgenic line scope of this paper which may be helpful in preparing was widely published, many publications ended up being data from other data sets. Once the data are assembled, associated with the gene because of the construct, even the model training and testing process provides mea- if there were no gene expression data for that gene in sures of model accuracy and predictive power of each those publications. This led to the identification of the variable. The best model will typically be the simplest number of transgenic constructs per gene as an import- model that still accurately represents the data. Variables ant variable in the model for those specific genes that that exhibit low predictive power should be dropped and were associated with constructs. the model re-trained and tested. This feature engineer- The process used here to identify incomplete curated ing, training and testing process is repeated iteraitively gene expression data sets at ZFIN should be until model performance is acceptable and further re- generalizable to other data sets, as summarized in Fig. 6. moval of variables degrades model performance. A linear One key to extending this method to other data types is model was a good fit for the data examined in this work, feature engineering, the use of domain knowledge to but other data sets may be better fit using other types of identify and craft variables having predictive value models. The residual histogram provides one way to towards the desired variable. To find those predictive evaluate whether significant variables remain un- variables, data exploration and domain knowledge must accounted for. Models with adequate variable represen- first be applied to create a list of variables that may have tation have a bell shaped residual histogram centered predictive power. It is good to be inclusive at this point around zero. If the histogram shape has “shoulders” or appears multi-modal, this is an indication that one or more variables are yet to be accounted for in the model, suggesting that the further feature selection and engi- neering could improve the model. At ZFIN, every incoming zebrafish publication is asso- ciated with the genes discussed, even though not all the publications are fully curated. Hence, in ZFIN, the complete available literature across all genes is well rep- resented, and thus the volume of published literature about a gene has a positive correlation with the amount of published data which exists for a gene. However, not all biological knowledgebases gather data using the same strategies. The method described here may not work as well for datasets that have more heterogenous represen- tation of the published literature or other key variable. For example, a database which is populated with data by searching the literature for information about a specific record (gene, protein, etc.) may have deep representation of existing literature on the subset of records which have been researched and shallow representation of existing literature on other records. Heterogeneity of literature coverage of this type would detract from the predictive value of pure literature counts as were used for the ZFIN example. In such cases, other types of predictive variables would need to be identified through data ex- ploration and feature engineering. These may include things such as the number of days since the last record update, number of data types associated, the presence of publications in specific journals, and presence of other Fig. 6 Generalized view of the method to find incomplete data sets potentially correlated data types. In some cases, it may Howe BMC Bioinformatics (2018) 19:110 Page 9 of 12

be helpful to bring in additional data from external approach to monitor data set completeness. It is sources that can be linked to the data being examined. concluded that this method could be used to identify in- For example, UniProt records may not be associated complete data sets of any type curated from published with the complete literature about a protein or the literature, assuming proper predictive variables can be associated gene. UniProt data for zebrafish identified to build an accurate model. could be combined with the literature set from ZFIN for each related gene. This may increase the predictive value Methods of the count of publications for identifying protein re- Method overview cords in UniProt that are missing a piece of data of The method described here uses three data sets from interest. Creative variable engineering will always be a the Zebrafish Information Network as input to a linear critical step in successful application of this method. regression model to predict the number of gene expres- The method described here produces a binary classifi- sion experiments per gene. Figure 7 provides a flow cation of genes that are predicted to be or not to be chart of the steps taken from data input through model missing expression data based on the residual values output. being inside or outside the 95% CI of the model. A binary classification model makes sense for this problem. Data files Unlike a binary classification, regression models result in Three data files were combined to build the predictive a real number prediction of the label, in this case the model. All three are provided as supplementary files to number of gene expression experiments per gene. The this manuscript. The MachineLearningReport.txt file regression model has the added possibility of providing a (Additional file 2) is a custom report consisting of one quantitative metric whose magnitude may correlate with row per gene in the ZFIN database, generated on Nov. the level of incompleteness of the data set. Confirmation 29, 2016. Data columns included the ZDB-GENE ID, of that possibility will require significant effort which gene symbol, gene name, count of gene expression ex- should be the subject of future work. periments, count of journal publications attributed for This method can provide curators with a list of genes gene expression annotations, count of Gene Ontology having published gene expression data that is yet to be annotations, and count of journal publications attributed curated. Therefore, the high precision outcome is im- for Gene Ontology annotations. The columns related to portant as it ensures that curators spend time reviewing the Gene Ontology had no value for predicting the publications for genes that are missing data. The model number of gene expression experiments, so they were resulted in a recall/sensitivity of 0.71 and 0.73 at the excluded from further analysis. lower and upper 95% CI, meaning 71% and 73% of the The GenePublication.txt (Additional file 3) and genes that were confirmed to be missing gene expression ConstructComponents.txt (Additional file 4) files are data were identified. From the perspective of a data cur- generated daily at ZFIN and made available via the ZFIN ator, modest recall is acceptable for this method because downloads page (https://zfin.org/downloads). The Gene- subsequent rounds of model training and testing could Publication.txt file was obtained on Nov. 30, 2016. The be executed to iteratively refine and complete the data columns were gene symbol, ZDB-GENE ID, ZDB-PUB set. Genes that were not identified as missing data in the ID, publication type, and PubMed ID when available. initial round of training and testing would eventually be The ConstructComponents.txt file was obtained on Dec. identified in subsequent cycles of training, testing, and 19, 2016 and included columns for the ZFIN construct data updating. From the perspective of a data consumer, ID, construct name, construct type, related gene ZDB- it would be beneficial to correctly identify as many genes GENE ID, related gene symbol, related gene type, a rela- as possible which have incomplete gene expression data tionship between the gene and the construct, and two sets. If future work finds that the magnitude of the ontology term IDs from the sequence ontology [16]to residuals correlates well with the amount of missing ex- specify the type of construct and the type of related pression data, then the residual itself could be provided marker. For this study, the only data used was a count of to downstream data consumers as a metric of data set constructs related to each gene, which was computed completeness for every gene included in the test set. from the ConstructComponents.txt file. Machine learning methods are having significant im- pact upon many areas of our experience as . As Data preparation and modeling the field of data science has matured, these methods Manipulations of input data files, feature selection and have become powerful tools for analysis, interpretation, engineering, model building, training, evaluation, model and utilization of the increasingly large and interrelated selection, and final model scoring were all done using data sets available today including numeric, free text, modules provided in Microsoft Azure Machine Learning and image data. This work provides a machine learning Studio (https://studio.azureml.net) using a free Howe BMC Bioinformatics (2018) 19:110 Page 10 of 12

Fig. 7 A summary of the method used to model the number of expression experiments per gene in the ZFIN gene expression data set

workspace level account. Features per gene used to train symbols in Fig. 3. The resulting gene set used as in- and test the linear regression model included the gene put for model training and testing included 9870 symbol, the number of journal publications attributed genes. Any null numeric values generated in the data for gene expression, the number of gene expression ex- during file joining were set to 0 using the AML Clean periments (the label), total number of journal publica- Missing Data module, and no duplicate rows were tions, the percentage of journal publications with present. A stratified split keyed on the expression ex- curated expression data, and the number of transgenic periment count was used in the Split Data module to constructs associated with each gene. select 25% of the genes (2483 genes) for training the The set of all gene records in the ZFIN database model and 75% (7387 genes) for scoring the model. (36,655 genes as of Nov. 29, 2016) was filtered to ex- The Linear Regression, Train Model, and Score Model clude genes that were unlikely to be useful in this modules were used to train and score the model. The analysis including withdrawn genes, microRNA genes, Linear Regression module used the following parame- genes with a colon in the name (typically not yet ters: Solution method: ordinary least squares; L2 studied), genes with symbols starting with “unm_” regularization weight: 10; Include intercept term: un- (typically not yet studied), and genes with no associ- checked; Allow unknown categorical levels: checked; ated journal publications as determined by data from Random number seed: 112. Model performance was the GenePublication.txt file. Genes with more than assessed using the Azure Machine Learning Evaluate 200 existing expression experiments were also ex- Model module. The trained model was used to pre- cluded because they are already heavily annotated for dict the number of expression experiments for the gene expression, many were found to be anatomical 7387 genes that were not used in model training. The marker genes of less interest for the purposes of this resulting prediction was appended as a new column work (eg. egr2b), and their heavy annotation may give to the input data set. them undesirable leverage that could negatively affect model performance for genes of interest which may Analysis and data visualizations have few annotations. Those excluded genes having Model results, including the input data plus the pre- more than 200 expression experiments have red dicted number of expression experiments, for the 7387 Howe BMC Bioinformatics (2018) 19:110 Page 11 of 12

genes were exported from Azure Machine Learning Additional files Studio as a tab delimited file and imported into Microsoft Excel for Mac v16.27 for data validation and Additional file 1: Figure S1. The count of genes having their earliest analyses (Additional file 5). Residuals were calculated as uncurated gene expression data published in each year. (PNG 7782 kb) the actual expression experiment count minus the num- Additional file 2: The MachineLearningReport.txt file. (TXT 2317 kb) ber of expression experiments predicted by the model. Additional file 3: The GenePublication.txt file. (TXT 22006 kb) The 95% confidence interval of the model, computed as Additional file 4: The ConstructComponents.txt file. (TXT 1807 kb) Additional file 5: The Excel file which includes model results and data 2 times the root mean squared error (RMSE), was used analysis results. (XLSX 8346 kb) to establish significance of the residuals. Genes with re- siduals outside or inside the 95% confidence interval Acknowledgements were then considered as being predicted to be missing I would like to thank Monte Westerfield for his support of this work, Sierra or not missing expression annotation respectively. One Taylor Moxon for generating the MachineLearningReport.txt file, and the rest hundred genes inside and outside the negative 95%CI of the ZFIN team for their consistent dedication to producing and maintaining a high-quality resource at ZFIN. were randomly selected for manual testing by sorting the genes in the Excel spread sheet based on a randomly Funding generated number column and copying the first genes This work was supported by grant U41 HG002659 from the National Human Genome Research Institute of the National Institutes of Health (https:// from each set into a new Excel sheet. To remain blinded www.nih.gov). The funders had no role in study design, data collection and during the evaluation step, those genes were randomized analysis, decision to publish, or preparation of the manuscript. again as a set by sorting based on a randomly generated number column. That gene selection process was re- Availability of data and materials The datasets supporting the conclusions of this article are included within peated for 50 genes inside and outside the positive 95% the article and its additional files. CI. A manual evaluation was then done for each jour- nal publication not already curated for expression Authors’ contributions DH created the study design, and conducted the data collection, analysis, data that was associated with each of the selected validation, and writing of the manuscript. The author read and approved the genes. The publications for each gene were sorted final manuscript. oldest to newest based on publication date and were then evaluated in order, starting with the oldest Ethics approval and consent to participate Not applicable. publications. Publication assessment for each gene continued until either all the publications were exam- Consent for publication ined for a gene or a publication with missing expres- Not applicable. sion data for that gene was identified, whichever Competing interests came first. The result was recorded along with the The author declares that he has no competing interests. assessment date and the ZDB-PUB ID for the publication that was missing the expression data, if one was found. Received: 5 July 2017 Accepted: 21 March 2018 The results of this data validation was used to produce a confusion matrix describing model precision and recall References around the upper or lower 95% CI. 1. Alqasab M, Suzanne M, Embury S, FMS d. Amplifying data curation efforts to improve the quality of life science data. Int. J. Data Curation. 2017;12:1–12. Publication records in ZFIN each have a unique 2. Shkurin A. Vellido a. Using random forests for assistance in the curation ZDB-PUB ID, for example ZDB-PUB-161203-17. The of G-protein coupled receptor databases. Biomed. Eng. Online. England. first six digits indicate the date the record was cre- 2017;16:75. 3. Mansouri K, Grulke CM, Richard AM, Judson RS, Williams AJ. An automated ated in YYMMDD format. Those data were parsed curation procedure for addressing chemical errors and inconsistencies in out of the list of IDs for publications that were re- public datasets used in QSAR modelling. SAR QSAR Environ Res England. corded as containing uncurated gene expression data. 2016;27:939–65. 4. Marchese Robinson RL, Lynch I, Peijnenburg W, Rumble J, Klaessig F, The year component was then used to group those Marquardt C, et al. How should the completeness and quality of curated data to get a count of the number of genes per year nanomaterial data be evaluated? Nanoscale. England. 2016;8:9919–43. that were found to have uncurated gene expression 5. Howe DG, Bradford YM, Eagle A, Fashena D, Frazer K, Kalita P, et al. The Zebrafish Model Organism Database: new support for human disease data. Even though it was the year of publication entry models, mutation details, gene expression and searching. into ZFIN that was being counted, only the first Nucleic Acids Res. 2017;45:D758–68. http://www.ncbi.nlm.nih.gov/pubmed/ paper encountered with uncurated expression data 27899582. 6. Alam-Faruque Y, Hill DP, Dimmer EC, Harris MA, Foulger RE, Tweedie S, et al. was recorded per gene, so the count is equal to the Representing kidney development using the gene ontology. PLoS One. number of genes in the sample having uncurated 2014;9:e99864. https://www.ncbi.nlm.nih.gov/pubmed/24941002. expression data from each year. 7. Ruzicka L, Bradford YM, Frazer K, Howe DG, Paddock H, Ramachandran S, et al. ZFIN, The zebrafish model organism database: Updates and new Data visualizations were created using both Excel and directions. Genesis. 2015;53:498–509. http://www.ncbi.nlm.nih.gov/pubmed/ Tableau Desktop Professional Edition v10.1.4. 26097180. Howe BMC Bioinformatics (2018) 19:110 Page 12 of 12

8. Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nat. Rev. genet. 2015;16:321–332. http://www.ncbi.nlm.nih.gov/ pubmed/25948244. 9. Müller H-M, Kenny EE, Sternberg PW. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS biol. 2004;2:e309. http://www.ncbi.nlm.nih.gov/pubmed/15383839. 10. Chen D, Müller H-M, Sternberg PW. Automatic document classification of biological literature. BMC bioinformatics. 2006;7:370. http://www.ncbi.nlm. nih.gov/pubmed/16893465. 11. Van Auken K, Fey P, Berardini TZ, Dodson R, Cooper L, Li D, et al. in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR. Database (Oxford). 2012;2012:bas040. http:// www.ncbi.nlm.nih.gov/pubmed/23160413. 12. Fang R, Schindelman G, Van Auken K, Fernandes J, Chen W, Wang X, et al. Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics. 2012;13:16. http://www.ncbi.nlm. nih.gov/pubmed/22280404. 13. Jiang X, Ringwald M, Blake J, Shatkay H. Effective biomedical document classification for identifying publications relevant to the mouse gene expression database (GXD). Database (Oxford). 2017;2017. http://www.ncbi. nlm.nih.gov/pubmed/28365740. 14. Adám A, Bártfai R, Lele Z, Krone PH, Orbán L. Heat-inducible expression of a reporter gene detected by transient assay in zebrafish. Exp. cell res. 2000; 256:282–290. http://www.ncbi.nlm.nih.gov/pubmed/10739675. 15. Keseler IM, Skrzypek M, Weerasinghe D, Chen AY, Fulcher C, Li G-W, et al. Curation accuracy of model organism databases. Database (Oxford). 2014; 2014. http://www.ncbi.nlm.nih.gov/pubmed/24923819. 16. Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, et al. The sequence ontology: a tool for the unification of genome annotations. Genome Biol. 2005;6:R44.

Submit your next manuscript to BioMed Central and we will help you at every step:

• We accept pre-submission inquiries • Our selector tool helps you to find the most relevant journal • We provide round the clock customer support • Convenient online submission • Thorough peer review • Inclusion in PubMed and all major indexing services • Maximum visibility for your research

Submit your manuscript at www.biomedcentral.com/submit