Building Curation Filters for BCO and Adapting the BCO Framework for Biocuration
Total Page:16
File Type:pdf, Size:1020Kb
Building Curation Filters for BCO and Adapting the BCO Framework for Biocuration by Ngozi Onyegbula M.B.B.S. in Medicine and Surgery, July 1995, Nnamdi Azikiwe University A Thesis submitted to The Faculty of The Columbian College of Arts and Sciences of The George Washington University in partial fulfillment of the requirements for the degree of Master of Science August 31, 2020 Thesis directed by Jonathon Keeney Assistant Research Professor of Biochemistry and Molecular Medicine © Copyright 2020 by Ngozi Onyegbula All rights reserved ii Dedication This work is dedicated to my dear husband, Dr. Festus Onyegbula. My lovely children; Uzoma Onyegbula, Ugonna Onyegbula, and Chigozie Onyegbula. My amazing Mother, Rose Nwadiugwu. My soul sister, Amaka Nwadiugwu, and my precious nephew Ebube Onokpite. I thank God for all of you and your unwavering, unconditional love and support throughout the duration of my program. iii Acknowledgments I wish to acknowledge the following people for their unconditional support • Jonathon Keeney, PhD- My thesis adviser, always available to guide me. • Anelia Horvath, PhD- My reader who was there for me, despite late request. • Jack Vanderhoek, PhD- My program director. • Raja Mazumder, PhD- My course director. • Hayley Dingerdissen- For your advice. • Amanda Bell- Special thanks for all the support and encouragements. • BioCompute Lab Team- Especially Janisha Patel for your advice. • HIVE Lab Team. iv Abstract Building Curation Filters for BCO and Adapting the BCO Framework for Biocuration Biomedical research is becoming more interdisciplinary, and data sets are becoming increasingly large and complex. Many current methods in research (omics-type analyses) generate large quantities of data, and the internet generally makes access easy. However, ensuring quality and standardization of data usually requires more effort. BioCompute is a standardized method of communicating bioinformatic workflow information or an analysis pipeline. An instance of BioCompute standard is known as BioCompute Object (BCO). Biocuration involves adding value to biomedical data by the processes of standardization, quality control and information transferring (also known as data annotation). It enhances data interoperability and consistency, and is critical in translating biomedical data into scientific discovery. This research work is focused on creating curation filters for BCOs and adapting the BCO framework for biocuration. This study looks into the feasibility of creating curation filters, which will help to ensure that BCOs have high quality. Also, since the process of biocuration can be very detailed and highly complex with multiple steps, adapting a BCO framework for biocuration will ensure standardization of the biocuration process, making BioCompute a perfect fit for biocuration. In order to carry out this research, multiple database and knowledgebase that undergoes curation were studied in detail, this includes; Bgee, UniProt, SGD among others. Finally, based on the knowledge gathered from the process of biocuration, filters and steps for curation was applied to some existing BCOs. In this process, already built BCOs in the OncoMX and BioCompute databases were curated, with an aim of v producing high quality BCOs. Based on compliance to a set of criteria and completeness of the domains, the BCOs were ranked into three levels; gold, silver and bronze. The result showed that none of the existing BCO that was curated met the gold standard. The OncoMX BCOs were on the bronze level, while those in the BCO DB were on the silver level. Suggestions were also made on ways to adapt the current BCO frame work to biocuration. However, more work needs to be done to create BCO that is specific to biocuration. vi Table of Contents Dedication…………………….………………………………………………………….iii Acknowledgments ............................................................................................................iv Abstract of Thesis ........................................................................................................ .....v Table of Contents…………………………………………………………………….....vii List of Figures…………………………………………………………………..….......viii List of Tables…………………………………………………………………..………...ix List of Abbreviations .........................................................................................................x Chapter 1: Introduction ..................................................................................................... ..1 Chapter 2: Background Information and Review of Knowledgebases ................................9 Chapter 3: Methods ............................................................................................................11 Chapter 4: Data and Results ...............................................................................................16 Chapter 5: Discussion, Conclusion, and Recommendations..............................................21 References ..........................................................................................................................23 vii List of Figures Figure 1… ........................................................................................................................... 5 Figure 2… ......................................................................................................................... 12 viii List of Tables Table 1… .......................................................................................................................... 15 Table 2… .......................................................................................................................... 17 Table 3… .......................................................................................................................... 18 Table 4… .......................................................................................................................... 19 ix List of Abbreviations BCO BioCompute Object EMBL-EBI European Molecular Biology Laboratory-European Bioinformatics Institute etc et cetera etag Entity Tag FAIR Findable, Accessible, Interoperable, and Reusable FDA Food and Drug Administration GWU George Washington University HTSHigh-Throughput Sequencing IEEE Institute of Electrical and Electronics Engineers JSON JavaScript Object Notation MGI Mouse Genome Database MPS Massively Parallel Sequencing NGS Next-Generation Sequencing RGD Rat Genome Database SGD Saccharomyces Genome Database uri Uniform Resource Identifier url Uniform Resource Locator ZFIN The Zebrafish Information Network x Chapter 1: Introduction In the turn of this century a lot of discovery is achieved through data analysis. Biological data management has become extremely essential and involves the acquisition, validation, storage, protection and processing of such data. Biomedical research is becoming more interdisciplinary, and data sets are becoming increasingly large and complex. Many current methods in research (omics-type analyses) generate large quantities of data, and the internet generally makes access easy. However, ensuring quality and standardization of data is sometimes convoluted and require more effort [1]. There is the need to use a tool that will be helpful in explicitly indicating where the data came from and which version was accessed, along with access time and date. For biological data to serve its intended purpose, it has to be properly managed. Biological data management involves the acquisition, validation, storage, protection, and processing of data. In order to get the best out of biological data, it should adhere to the data principles, which aims at making data findable, accessible, interoperable, and reusable (FAIR) [14] guiding principles to its users. Bioinformatics is a growing field which involves the science of collecting and analyzing such complex biological data. Currently, many of the bioinformatics tools used in biological data management are written in special programming language, making them hard to use by a great number of people that would otherwise benefit from such data. There is thus a great need to standardize the way in which these biological data analysis steps are communicated. Curation guidelines should be fully documented, versioned, and made available to the 1 users. This will ensure reproducibility and standardization of the curation process, thus aiding both curation consistency, transparency, and utility. Curation guidelines should also describe the extent and method of documenting data provenance (i.e., where data is coming from and which transformations it has undergone) and of attributing steps in the curation process to particular agents (i.e., which person or process made which change). As recording every single step of curation is impractical, because different curators may go through different routes to achieve the same final result, it is worth identifying key decisions in the curation workflow and recording in sufficient detail how such decisions are reached [1]. A possible solution to this great need is building curation specific filters and adapting the BioCompute Object (BCO). BioCompute BioCompute is a standardized method to communicate bioinformatic workflow information and analysis pipeline. An instance of BioCompute standard is known as BioCompute Object (BCO). Ideally, a BCO should contain all of the necessary information to repeat and communicate an entire pipeline from FASTQ to result, and includes additional metadata to identify provenance and usage [11].The BioCompute Object (BCO) Project