Building Curation Filters for BCO and Adapting the BCO Framework for Biocuration

by Ngozi Onyegbula

M.B.B.S. in Medicine and Surgery, July 1995, Nnamdi Azikiwe University

A Thesis submitted to

The Faculty of The Columbian College of Arts and Sciences of The George Washington University in partial fulfillment of the requirements for the degree of Master of Science

August 31, 2020

Thesis directed by

Jonathon Keeney Assistant Research Professor of and Molecular Medicine

© Copyright 2020 by Ngozi Onyegbula All rights reserved

ii Dedication

This work is dedicated to my dear husband, Dr. Festus Onyegbula. My lovely children; Uzoma Onyegbula, Ugonna Onyegbula, and Chigozie Onyegbula. My amazing Mother, Rose Nwadiugwu. My soul sister, Amaka Nwadiugwu, and my precious nephew Ebube Onokpite. I thank God for all of you and your unwavering, unconditional love and support throughout the duration of my program.

iii Acknowledgments

I wish to acknowledge the following people for their unconditional support

• Jonathon Keeney, PhD- My thesis adviser, always available to guide me.

• Anelia Horvath, PhD- My reader who was there for me, despite late request.

• Jack Vanderhoek, PhD- My program director.

• Raja Mazumder, PhD- My course director.

• Hayley Dingerdissen- For your advice.

• Amanda Bell- Special thanks for all the support and encouragements.

• BioCompute Lab Team- Especially Janisha Patel for your advice.

• HIVE Lab Team.

iv Abstract

Building Curation Filters for BCO and Adapting the BCO Framework for Biocuration

Biomedical research is becoming more interdisciplinary, and data sets are becoming increasingly large and complex. Many current methods in research (omics-type analyses) generate large quantities of data, and the internet generally makes access easy. However, ensuring quality and standardization of data usually requires more effort. BioCompute is a standardized method of communicating bioinformatic workflow information or an analysis pipeline. An instance of BioCompute standard is known as BioCompute Object

(BCO). Biocuration involves adding value to biomedical data by the processes of standardization, quality control and information transferring (also known as data annotation). It enhances data interoperability and consistency, and is critical in translating biomedical data into scientific discovery. This research work is focused on creating curation filters for BCOs and adapting the BCO framework for biocuration. This study looks into the feasibility of creating curation filters, which will help to ensure that BCOs have high quality. Also, since the process of biocuration can be very detailed and highly complex with multiple steps, adapting a BCO framework for biocuration will ensure standardization of the biocuration process, making BioCompute a perfect fit for biocuration. In order to carry out this research, multiple and knowledgebase that undergoes curation were studied in detail, this includes; Bgee, UniProt, SGD among others. Finally, based on the knowledge gathered from the process of biocuration, filters and steps for curation was applied to some existing BCOs. In this process, already built

BCOs in the OncoMX and BioCompute were curated, with an aim of

v producing high quality BCOs. Based on compliance to a set of criteria and completeness of the domains, the BCOs were ranked into three levels; gold, silver and bronze. The result showed that none of the existing BCO that was curated met the gold standard. The

OncoMX BCOs were on the bronze level, while those in the BCO DB were on the silver level. Suggestions were also made on ways to adapt the current BCO frame work to biocuration. However, more work needs to be done to create BCO that is specific to biocuration.

vi Table of Contents

Dedication…………………….………………………………………………………….iii

Acknowledgments ...... iv

Abstract of Thesis ...... v

Table of Contents…………………………………………………………………….....vii

List of Figures…………………………………………………………………..…...... viii

List of Tables…………………………………………………………………..………...ix

List of Abbreviations ...... x

Chapter 1: Introduction ...... 1

Chapter 2: Background Information and Review of Knowledgebases ...... 9

Chapter 3: Methods ...... 11

Chapter 4: Data and Results ...... 16

Chapter 5: Discussion, Conclusion, and Recommendations...... 21

References ...... 23

vii List of Figures

Figure 1… ...... 5

Figure 2… ...... 12

viii List of Tables

Table 1… ...... 15

Table 2… ...... 17

Table 3… ...... 18

Table 4… ...... 19

ix List of Abbreviations

BCO BioCompute Object EMBL-EBI European Molecular Laboratory-European Institute etc et cetera etag Entity Tag FAIR Findable, Accessible, Interoperable, and Reusable FDA Food and Drug Administration GWU George Washington University HTSHigh-Throughput Sequencing IEEE Institute of Electrical and Electronics Engineers JSON JavaScript Object Notation MGI Mouse Genome Database MPS Massively Parallel Sequencing NGS Next-Generation Sequencing RGD Rat Genome Database SGD Saccharomyces Genome Database uri Uniform Resource Identifier url Uniform Resource Locator ZFIN The Information Network

x Chapter 1: Introduction

In the turn of this century a lot of discovery is achieved through data analysis. Biological data management has become extremely essential and involves the acquisition, validation, storage, protection and processing of such data.

Biomedical research is becoming more interdisciplinary, and data sets are becoming increasingly large and complex. Many current methods in research (omics-type analyses) generate large quantities of data, and the internet generally makes access easy. However, ensuring quality and standardization of data is sometimes convoluted and require more effort [1]. There is the need to use a tool that will be helpful in explicitly indicating where the data came from and which version was accessed, along with access time and date. For biological data to serve its intended purpose, it has to be properly managed. Biological data management involves the acquisition, validation, storage, protection, and processing of data. In order to get the best out of biological data, it should adhere to the data principles, which aims at making data findable, accessible, interoperable, and reusable

(FAIR) [14] guiding principles to its users.

Bioinformatics is a growing field which involves the science of collecting and analyzing such complex biological data. Currently, many of the bioinformatics tools used in biological data management are written in special programming language, making them hard to use by a great number of people that would otherwise benefit from such data.

There is thus a great need to standardize the way in which these biological data analysis steps are communicated. Curation guidelines should be fully documented, versioned, and made available to the

1 users. This will ensure reproducibility and standardization of the curation process, thus

aiding both curation consistency, transparency, and utility. Curation guidelines should

also describe the extent and method of documenting data provenance (i.e., where data is

coming from and which transformations it has undergone) and of attributing steps in the

curation process to particular agents (i.e., which person or process made which change).

As recording every single step of curation is impractical, because different curators may

go through different routes to achieve the same final result, it is worth identifying key

decisions in the curation workflow and recording in sufficient detail how such decisions

are reached [1]. A possible solution to this great need is building curation specific filters

and adapting the BioCompute Object (BCO).

BioCompute

BioCompute is a standardized method to communicate bioinformatic workflow information and analysis pipeline. An instance of BioCompute standard is known as

BioCompute Object (BCO). Ideally, a BCO should contain all of the necessary information to repeat and communicate an entire pipeline from FASTQ to result, and

includes additional metadata to identify provenance and usage

[11].The BioCompute Object (BCO) Project is a community-driven initiative, originally developed to harmonize a framework for standardizing and sharing computations and analyses generated from High-throughput sequencing (HTS -- also referred to as next- generating sequencing (NGS) or massively parallel sequencing (MPS))[3]. The BCO project has now been standardized as IEEE 2791-2020, as of January, 30th, 2020, and the

project files are maintained in

2 an open source repository [11, 15]. It was originally developed to transmit Next

Generation Sequencing (NGS) analyses to the United States Food and Drug

Administration (FDA) for regulatory purposes. The George Washington University

(GWU) and the FDA have published a BioCompute Object

Specification Document for research and clinical trial use, which details a new

framework for communication of High-throughput Sequencing (HTS) computations and data analysis, known as BioCompute Objects (BCOs) [3]. This mechanism was created to standardize NGS pipelines that adhered to FAIR data principles, which was to make data findable, accessible, interoperable, and reusable. An instance of BioCompute standard is known as BioCompute Object (BCO). In creating a BCO, the entire workflow of the experiment or process is captured, and presented in categorized domains. The

BioCompute standard brings clarity to an analysis, making it clear to understand or reproduce. Since there are many different platforms and many different scripts and tools for genomic data analysis, there is a need to standardize the method of communicating the steps used in genomic data analysis, and BCO does exactly that. This makes

BioCompute standard a powerful tool for standardization of genomic data analyses.

BioCompute objects are thus used for communication and harmonization of bioinformatics protocols. BioCompute creates a standard for communicating genomic analysis workflows. It acts as an envelope for the entire pipeline, and it is written in

JSON (Java Script Object Notation), which is a lightweight data-interchange format, which is easy for humans to read and write, without any knowledge of programming.

JSON has the additional advantage of

3 being both machine and human readable and writable. Although BCO was originally

created to transmit NGS analysis, it has since grown to include most bioinformatics

analyses, which was possible because of some adaptations made to the original schema.

Since the BCO was easily adaptable to other uses, hence the need to explore its

adaptability to the process of biocuration.

A BioCompute object has the following domains as shown in figure 1:

1. Provenance domain: This defines the history, version and status of the BCO.

2. Usability domain: Free text domain that lets a researcher describe purpose, scope, and

any other relevant details. Also improves searchability.

3. Extension domain: User defined domain that allows one to add additional structured

information which is not in the BCO schema.

4. Description domain: A field for description of things like the pipeline steps,

keywords, software used.

5. Execution domain: This contains fields for describing the environment in which an

analysis was run.

6. Input/output domain: Each analyze generates files and return path to each one of

them. This represents list of global input and output files created by computational

workflow.

7. Parametric domain: Non-default parameters, customizing the computational flow.

8. Error domain: This defines the empirical and algorithmic limits and any error sources

of the BCO.

4 It is worth mentioning that five of the above eight domains of the base schema,

(Provenance domain, Usability domain, Description domain, Execution domain,

Input/output domain) are required domains, while three (Extension domain, Parametric domain, Error domain) are optional domains. The description of the BCO domains are shown in figure 1.

Figure 1: BCO Domains

5 Biocuration

It is of particular importance that researchers are able to take advantage of the vast amounts of publicly available datasets to reproduce and validate results, or to investigate new research questions. Hence the importance of biocuration of data for further analytical use. Also, some of the shortcomings in data utility, reusability, and quality can be addressed by biocuration, which is the process of identifying, organizing, annotating, standardizing, and enriching biological data [2].

According to the International Society of Biocurators, biocuration involves the translation and integration of information relevant to biology into a database or resource that enables integration of the scientific literature as well as large data sets [16]. Accurate and comprehensive representation of biological knowledge, as well as easy access to this data for working and a basis for computational analysis, are primary goals of biocuration.

Biocuration involves adding value to biomedical data by the processes of standardization, quality control and information transferring (also known as data annotation). It enhances data interoperability and consistency, and is critical in translating biomedical data into scientific discovery [4]. Many renowned knowledgebases and databases employ the process of biocuration for maintenance of their information. When the Saccharomyces

Genome Database (SGD) was established in 1993, it served to collect, organize, and make easily accessible the rapidly growing knowledge about of the Saccharomyces cerevisiae, also known as brewer's yeast, baker's yeast, or informally as yeast [13].

6 During yeast genome-sequencing project, it became apparent that the evolving flood of information required the creation of a specialized database, and this expert knowledgebase was built through the participation of scholarly : biocurators [12].

Biocuration like other scientific fields, is a necessary and continually evolving field of science.

Data biocuration differs from data analysis. Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions and supporting decision-making.

Data analysis aims at arriving at a conclusion, while data biocuration aims at standardization and quality control, which may heavily influence downstream conclusions. Hence, there is the need for a standardized way of communicating biocuration workflow. This means that the BCO would be a great fit for the biocuration process.

Currently, there is no official qualification to become a biocurator, so different groups train people for the job based on the needed skills, usually with a biology or programming background. This creates a lot of challenge with meeting the aims of biocuration. Since biocuration aims at standardization, creating specific filters and building a BCO of the processes of biocuration will help to capture and standardize all the steps involved in biocuration process. Just like a genomic pipeline which the BCO was originally created for, the workflow of the process of biocuration could be captured and built into the standard domains of BCO, with some possible modifications. The use of BCO framework for biocuration will have the advantage of creating the needed

7 standard and reproducibility of work, irrespective of who does the job, thereby minimizing possible errors, that could be associated with biocuration process.

8 Chapter 2: Background Information and Review of Knowledgebases

Since the advent of NGS, the cost of high-throughput sequencing (HTS) has been on a considerable decline (5), which is generally a positive thing, but has its added burden.

This reduced cost resulted in the proliferation of data that are being generated, and a corresponding need for analysis and biocuration of such data. Bioinformatics tools that aid the analysis of data are continuously being developed. However, prior to the analysis of data, most biological data undergo detailed curation. There are also virtually no known standardized and industry-accepted metadata schemas for reporting the computational objects that record the parameters used for computations together with the results of computations. Most of the time, it may be impossible to reproduce the results of a

previously performed computation due to lack of information on parameters, versions,

arguments, conditions, and procedures of application launch [9]. The process of

biocuration could be achieved either manually by biocurators or automatically through

the use of specific software. There is therefore a need to standardize the process of

biocuration in order to ease communication of the processes employed. It has been

established that BioCompute enables clear communication of “what was done” and “why

it was done” by tracking provenance and documenting processes in a standard format

irrespective of the platform or the programming language or even the tool used (6).

UniProt (the universal resource), is a known comprehensive resource for protein

sequence and annotation data. The central activities of the UniProt Consortium, which is

part of UniProt, is the biocuration of the UniProt Knowledgebase (UniProtKB) [7]. As a

way to respond to the overwhelming data received by UniProt, UniProtKB consists of

two sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. UniProtKB/Swiss-Prot

9 data biocuration is done manually, while that of UniProtKB/TrEMBL is done automatically [7, 8]. The depth of curation undergone by such a renowned protein resource like UniProt, goes to show the need and necessity of biocuration, and as such the need for a standardized way of presenting curated information or data. Curation can be

‘entity-based’, where the curation team prioritizes papers for a certain class of entity [10].

The Saccharomyces Genome Database (SGD) biocurators employ both manual curation from curating publications and use of specialized curation tools developed by SGD for updating their data [12]. This research work thus explores the possibility of creating curation specific filters, and adapting the BCO framework to biocuration. Using the knowledge obtained from applying curation filters to the OncoMX BCO [19] and the

BCO database [20], as well as knowledge gained from multiple databases and knowledgebases that employ the process of biocuration, it was possible to carry out this research work.

10 Chapter 3: Methods

In order to carry out this project, multiple knowledgebase and database websites that does curation of data were visited. The initial process also involved a lot of literature mining, which includes searching PubMed for key words like curation, annotation, biocuration, quality control etc. Multiple papers were read and analyzed to obtain more information and insight on data curation.

Figure 2 shows the curation process that was employed.

11 Figure 2: Curation Process Employed

12 Starting with a database for normal expression in multiple animal , known as

Bgee, one of the datasets of interest was selected [17]. In the first case human expression

from RNA-Seq calls was selected and the process of generating the dataset was studied.

Next, a search on GitHub revealed how the human RNA- seq present and absent gene

expressions are generated. The information from GitHub revealed that Bgee uses an R

package known as “BgeeCall” for automatic RNA-Seq present/absent gene expression

calls generation [18]. Further search showed that UniProt employs both automatic and

manual curation for their data curation [7]. Similarly, the search also revealed that SGD

employs manual extraction and use of specialized tools for their data curation [13]. These

detailed searches revealed that different workflows are employed by different database

and knowledgebase in curating their data. After repeating this process for multiple other

databases/knowledgebases that utilize the process of biocuration, certain steps and filters

were developed for BCO curations [see table 1].

Finally, based on the knowledge gathered from the process of biocuration, filters and

steps for curation was applied to some existing BCOs. In this process, already built BCOs

in the OncoMX and BioCompute databases were curated, based on the criteria developed

after extensive review of other curated databases, with an aim of producing high quality

BCOs.

The curation process employed includes the following steps:

1. Extensive search of the available literature for details of the contents of each BCO

domain, to verify the information built into the domain.

13 2. Select each of the ten BCOs from the BCO database, and thirty BCOs from OncoMX,

then critically scrutinize and curate its contents for validity of input, by checking the

links.

3. Analyze every domain of each BCO, re-run all the “uri”, pipelines (when possible) to

check for accuracy and functionality of the presented links, test links to make sure

that they are not broken.

4. Check for any typographical errors, omissions, or incorrect key value pair.

5. Extract and organize data from the BCO that need to be edited or corrected.

6. Write up a comprehensive document of these findings.

7. Make suggestions on areas of the BCO that need to be edited.

8. Draw up a possible BCO schema for biocuration.

9. Suggest possible additional domains/revision of existing fields/domains for the

biocuration BCO or filters for specific specialty of the biocuration BCO.

14 Table 1: Table of Some Visited Curated Resources in Life Sciences

Name Website

RGD – Rat Genome Database https://rgd.mcw.edu/

EMBL-EBI – European Molecular https://www.ebi.ac.uk/

Biology Laboratory-European

Bioinformatics Institute

UniProt – Universal Protein Resource https://www.uniprot.org/

MGI – Mouse Genome Database http://www.informatics.jax.org/

ZFIN – The Zebrafish Information https://zfin.org/

Network

Wormbase https://wormbase.org/

SGD – Saccharomyces Genome Database https://www.yeastgenome.org/

[7,13,21,22,23,24,25]

15 Chapter 4: Data and Results

Based on the results obtained by curating the ten BCOs from the BCO editor 3.0.2 database, and thirty BCOs from OncoMX database, further insight was achieved on the steps for biocuration of data. Also, based on the attempt to create a biocuration BCO using the Bgee, and SGD biocuration process as test case examples, it was possible to suggest a possible schema modification specific for creating the BCO of biocuration process. Using the results obtained, it was then reasonable to suggest a possible schema specific for the process of biocuration. It was also observed that almost all the curated

BCO did not seem to meet all the schema requirement, in order to be compliant to the set standard. Based on these results, it was suggested that BCOs be further categorized into three tiers (Gold, Silver and Bronze) [see table 2], based on the completeness of the BCO.

16 Table 2: Table of Criteria for the 3 tiers of BCO.

Gold Silver Bronze

All required fields are All required fields are Incomplete required

completed completed fields

No typos or text errors 1 -2 typos or text error 3 or more typos and

text error

All links are functional Functional link but Some links are not

and link names are has incomplete link functional or

complete names incomplete link names

Contains complete meta Missing some meta Missing some meta

data data data

Error domain is Error domain remains Error domain remains

completed optional optional

Using the criteria outlined in table 2, it was then possible to further analyze the findings of the curation of the BCO database and the OncoMX database. The results of these findings are summarized in tables 3 and 4.

17 Table 3: Summarized Table of Results of BCO DB Curation

BCO Name All required Conforms What does not Tiered level of fields are to required conform BCO completely standard filled out (Y/N) (Y/N) HCV1a Y N "spec_version", Silver ledipasvir typo and resistance SNP incomplete file tion name. HCV1a Y N "spec_version", Silver ledipasvir and incomplete resistance SNP file name. detection R Safety Y N "spec_version", Silver Assessment "object_id" and Algorithm for typos Aluminum in Infant Vaccines Modeling the Y N "object_id", and Silver Myogenic "spec_version" Response to Increased Intraluminal Pressure Detecting Y N "object_id", and Silver Antimicrobial "spec_version" Resistant Genes in Human Gut Microbiome Sample

18 Table 4: Summarized Table of Results of OncoMX BCO Curation

BCO Name All required Conforms to What does not Tiered fields are required conform level of completely standard BCO filled out (Y/N) (Y/N) ONCOMXDS000003 N- Execution N “spec_version” Bronze FDA breast domain has no biomarkers input. ONCOMXDS000012 N N “spec_version” Bronze Genes normally and expressed in human Name typo tissues (Bgee) under provenance domain ONCOMXDS000022 N N “spec_version” Bronze Cancer differentially and no license expressed genes in information tissue (BioXpress v4.0 under full data dump) provenance domain ONCOMXDS000023 N N “spec_version” Bronze Genes normally and no name, expressed in human affiliation, and tissues (Customized) license information under provenance domain ONCOMXDS000024 N N “spec_version” Bronze Cancer differentially and no expressed miRNAs affiliations, and license information under provenance domain ONCOMXDS000032 N N “spec_version” Bronze Cancer differentially and no license expressed genes in information TCGA studies under (BioXpress v4.0 full provenance data dump) domain

19 ONCOMXDS000036 N N “spec_version” Bronze Differentially and no name of expressed marker creator and genes in cancer cells affiliation information under provenance domain ONCOMXDS000037 N N “spec_version” Bronze Metadata for and differentially gene No affiliation expression analysis information of from tumor samples creator provenance domain Note: In addition to the above noted errors, all 30 BCOs in the OncoMX DB did not have the right “spec_version”, hence they are not IEEE compliant. All the BCOs also had no entry in the execution domain, which is a required domain. This puts all the BCOs in the bronze tier of the BCO category.

20 Chapter 5: Discussion, Conclusion and Recommendations

Discussion:

The main purpose of the biocuration of the BCOs was to create a high quality BCO with

the following qualities:

Criteria for selecting high quality BCO

1. Conforms to IEEE 2791-2020.

2. Has the current standard meta data fields: “etag”, “object id” and “spec version”.

3. All required domains are completed.

4. Links are not broken; all “uri” and “url” are functional.

5. Created BCO is devoid of grammatical and spelling error.

6. Includes a very descriptive usability domain, so that nothing is unclear.

7. Includes an error domain for the gold standard level.

The process of curation of some existing BCOs and creation of biocuration BCO was

both challenging and informative. The process of biocuration was noted to vary,

depending on the biological subject that is being curated as well as the database or

knowledgebase that is providing the curation. This means that the process of adapting the

BCO framework for biocuration may not be as straightforward as creating BCO for NGS

pipelines. The purpose of this research was to figure out possible filters for high quality

BCO and suggest a BCO framework that could be adapted specifically to the biocuration

process.

21 Based on completeness and conformity to required standards, using the developed filters,

it is suggested that BCOs be further classified into three levels or tiers (Gold, Silver and

Bronze), using the criteria set on table 2.

The aim is it to always achieve a BCO at the gold level, however from the results on

tables 3 and 4, it can be seen that it was not possible to attain this level. Most of the

BCOs in the BCO database as seen on table 3 were on the silver level, while that of the

OncoMX database as seen on table 4 was on the bronze level.

Conclusion and Recommendation:

In conclusion, from on this research findings, curation filters could be applied to further

group BCOs into tiers, based on completeness and the current BCO framework could be

modified and adapted for biocuration needs. The following recommendation are

suggested:

1. Create a tiered level for the different BCOs based on the set criteria needed to achieve

an acceptable BCO.

2. For biocuration BCO, the description domain could be made an optional domain,

instead of being a required domain as in the BCO base schema. This will help to

accommodate the aspects of biocuration process that does not have defined pipelines

or software used in the biocuration process, because a lot of the biocuration process

seems to be entity specific.

22 References

1. Tang, A., Pichler, K., Füllgrabe, A., Lomax, J., Malone, J., Munoz-Torres, M., Vasant, D., Williams, E., & Haendel, M. Ten quick tips for biocuration. PLoS Computational Biology. 2019; 15(5):e1006906. Published 2019 May 2. doi:10.1371/journal.pcbi.1006906

2. International Society for Biocuration. Biocuration: Distilling data into knowledge. PLoS Biology (2018) 16(4): e2002846. https://doi.org/10.1371/journal.pbio.2002846. Accessed 10 June, 2020.

3. Kassam, Z (2017). BioCompute specifications to advance genomic data analysis. European pharmaceutical review. Accessed from www.europeanpharmaceuticalreview.com/news/67524/biocompute-genomic-data/. Accessed 25 May, 2020

4. Zhang, Z., Zuo, W., & Luo, J. Bringing biocuration to china. , & Bioinformatics 12(4). Published August 2014, 152-155. Elsevier. https://www.sciencedirect.com/science/article/pii/S1672022914000771. Accessed 25 May, 2020.

5. Schwarze, Katharina et al. “The complete costs of genome sequencing: a microcosting study in cancer and rare diseases from a single center in the United Kingdom.” Genetics in medicine: official journal of the American College of Medical Genetics vol. 22,1 (2020): 85-94. doi:10.1038/s41436-019-0618-7

6. Simonyan, V., Goecks, J., & Mazumder, R. Biocompute Objects-A Step towards Evaluation and Validation of Biomedical Scientific Computations. PDA journal of pharmaceutical science and technology, 71(2), 136–146. (2017). https://doi.org/10.5731/pdajpst.2016.006734

7. UniPort Consortium. Biocuration in UniProt. UniProt in manual curation. (2018) Accessed from https://www.uniprot.org/help/biocuration.

8. UniProt Consortium (2014). Activities at the Universal Protein Resource (UniProt). Nucleic acids research, 42(Database issue), D191–D198. https://doi.org/10.1093/nar/gkt1140.

9. Simonyan, V., Goecks, J., & Mazumder, R. Biocompute Objects-A Step towards Evaluation and Validation of Biomedical Scientific Computations. PDA journal of pharmaceutical science and technology, 71(2), 136–146. (2017) https://doi.org/10.5731/pdajpst.2016.006734.

10. Hirschman, L., Burns, G. A., Krallinger, M., Arighi, C., Cohen, K. B., Valencia, A., Wu, C. H., Chatr-Aryamontri, A., Dowell, K. G., Huala, E., Lourenço, A., Nash, R., Veuthey, A. L., Wiegers, T., & Winter, A. G. (2012). for the biocuration

23 workflow. Database: the journal of biological databases and curation, 2012, bas020. https://doi.org/10.1093/database/bas020.

11. BioCompute Analysis (2018). BioCompute objectives. Retrieved from https://biocomputeobject.org/

12. Skrzypek, M. S., & Nash, R. S. (2015). Biocuration at the Saccharomyces genome database. Genesis (New York, N.Y.: 2000), 53(8), 450–457. https://doi.org/10.1002/dvg.22862

13. Saccharomyces Genome Database. Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, Weng S, Botstein D Nucleic Acids Res. 1998 Jan 1; 26(1):73-9.

14. Wilkinson, M., Dumontier, M., Aalbersberg, I. Mons, B. The FAIR guiding principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://www.nature.com/articles/sdata201618%22

15. Institute of Electrical and Electronics Engineers. Institute of Electrical and Electronics Engineers -2791-schema. Accessed 28, June 2020 from https://opensource.ieee.org/2791-object/ieee-2791-schema/

16. International Society for Biocuration. Mission of International Society for Biocuration. Accessed 24, June 2020 from https://www.biocuration.org/about/

17. Bgee Gene Expression Database. About Bgee: retrieve and compare gene expression patterns in multiple animal species. Accessed 26, June 2020 from https://bgee.org/?page=about

18. Bgee Gene Expression Database. BgeeCall, an R package for automatic RNA-Seq present/absent gene expression calls generation. Accessed 28, June 2020 from https://github.com/BgeeDB/BgeeCall

19. Dingerdissen, H. M., Bastian, F., Vijay-Shanker, K., Robinson-Rechavi, M., Bell, A., Gogate, N., Gupta, S., Holmes, E., Kahsay, R., Keeney, J., Kincaid, H., King, C. H., Liu, D., Crichton, D. J., & Mazumder, R. OncoMX: A Knowledgebase for Exploring Cancer Biomarkers in the Context of Related Cancer and Healthy Data. JCO clinical cancer informatics, 4, 210–220. (2020). https://doi.org/10.1200/CCI.19.00117

20. BioCompute Portal. BioCompute Object. Accessed 26, June 2020 from https://portal.aws.biochemistry.gwu.edu/dashboard

21. Rat Genome Database (RGD). Analysis and visualization: rat genome database. Accessed 20, June 2020 from https://rgd.mcw.edu/

22. European Molecular Biology Laboratory- European Bioinformatics Institute. EMBL- EBI: the home for big data in biology. Accessed 15, June from https://www.ebi.ac.uk/

24 23. Mouse Genome Database (MGD). Mouse genome informatics. Retrieved from http://www.informatics.jax.org/

24. Zebrafish Information Network (ZFIN). Database of genetic and genomic data for the zebrafish. Accessed 22, June 2020 from https://zfin.org/

25. National Human Genome Research Institute. Explore worm biology. Facilitating insights into nematode biology. Accessed 20, June 2020 from https://wormbase.org/

25