Glygen As a Case Study by Jeet Vora BS In

Infrastructure for Data Collection and Integration for Biomedical Knowledgebases – GlyGen as a Case Study by Jeet Vora B.S. in Microbiology, April 2012, University of Pune M.S. in Microbiology, April 2014, Savitribai Phule Pune University A Thesis submitted to The Faculty of The Columbian College of Arts and Sciences of The George Washington University in partial fulfillment of the requirements for the degree of Master of Science May 20, 2018 Thesis directed by Raja Mazumder Associate Professor of Biochemistry and Molecular Biology © Copyright 2018 by Jeet Vora All rights reserved ii Acknowledgments The author expresses his sincere gratitude to his mentor, Dr. Raja Mazumder, for providing valuable time, guidance and suggestions throughout the project. The author also wishes to thank Dr. Nagarajan Pattabiraman for his comments and suggestions on the thesis, Dr. Robel Kahsay for his help in understanding the semantic web technologies and IT infrastructure, Hayley Dingerdissen for her continuous support and help in familiarizing glycobiology concepts, Dacian Recce-Stremtan for providing details of GlyGen servers, Rahi Navelkar and Reza Mousavi for helping in creating datasets and Amanraj Singh for his help in using Cytoscape. The author also expresses sincere gratitude to Charles Hadley King, Amanda Bell, and all other members of the lab for their support and encouragement. The author wishes to acknowledge the members of the GlyGen project for their contribution of data and the technical discussions of this project. GlyGen Project is funded by National Institutes of Health Common Fund Glycoscience Fund program (NIH Award #U01GM125267-01; PI- William York & Raja Mazumder) iii Abstract of Thesis Infrastructure for Data Collection and Integration for Biomedical Knowledgebases – GlyGen as a Case Study The ongoing acceleration in the use of omics technologies is generating petabytes of data that has resulted in the development of several knowledgebases and tools. Even though a vast amount of knowledge is present in these resources, much of it is redundant heterogeneous and scattered. Despite of the availability of considerable number of resources, there is still a need for manual literature search and manual collection of data from multiple resources to find an answer to a specific scientific question. Hence, there is a greater need for collection and integration of such biomedical data. The slow progress in biomedical data integration is because of the data in the knowledgebases are in different file formats, have multiple identifiers for the same entity, lack of machine- readable schemas, ill-defined Application Programming Interfaces (APIs) and data licensing issues. The biomedical community is making substantial progress by implementing new infrastructure technologies and standards comprising of Semantic Web technologies, common formats and global linked identifiers for data collection, integration and retrieval. The field of glycomics is generating data at a fast pace from the high-throughput projects, and as a result, many tools and databases have been developed for the glycoinformatics community. Even in the glycobiology domain, the relevant data is scattered in various databases and knowledgebases giving rise to a need of having a global and comprehensive glycobiology knowledgebase. GlyGen, a glycoinformatics iv knowledgebase that is a free, extendable and multidisciplinary resource for glycobiology aims to address these needs. GlyGen includes data and knowledge related to glycobiology which comprises of glycans, genes, proteins, diseases, expression, and mutation. For GlyGen, the data based on pre-defined data model derived from use-cases developed from the input of more than 50 scientists are being collected and integrated from various data resources. In the initial phase of the GlyGen project, we have collected and integrated mouse and human data from resources such as UniProt, PDB, PubChem, GlyTouCan and UniCarbKB and other individual data generators based on the workflow that incorporated semantic web technologies and standards. We created 74 datasets and further categorized them as protein centric, proteoform centric and glycan centric datasets based on the content of the data. Detailed readme for each dataset was created based on the BioCompute Object specification document. A dataset collection viewer page was developed and view the dataset collection and to understand the relationship and linking of data in the dataset categories; dataset networks were created using Cytoscape. The datasets in CSV (Comma Separated Value) format were later rdfized using an RDF (Resource Description Framework) model based on the existing RDF models. The rdfized data was stored in the GlyGen triplestore in the form of triples. The GlyGen triplestore was made available to be accessed by web service APIs and by SPARQL queries. To collect, integrate and retrieve data, high availability clusters server configuration comprising three servers with preinstalled software were used. For GlyGen data, we chose Creative Commons Attribution 4.0 International (CC BY 4.0) license, and v for source code, we chose, GNU General Public License v3. A private GitHub repository was created for sharing source code with the public. As it is challenging to retrieve a list if Glycoside hydrolases (GHs) from a single resource developed a workflow that retrieved the GHs from UniProtKB and validated the entries through QuickGO, Carbohydrate-Active Enzymes database (CAZy) and Pfam. The workflow retrieved a list of 83 GHs classified by GH families. When GlyGen is fully developed and functional, we will make it available to the public through the link – www.glygen.org Data integration is challenging and a difficult task that requires meticulous planning, creative approaches to tackle issues and concentrated efforts in implementing solutions. We firmly believe that the rest of the biomedical community will take inspiration from GlyGen to collect and integrate data for a biomedical domain from diverse resources. vi Contents Acknowledgments.............................................................................................................. iii Abstract of Thesis .............................................................................................................. iv List of Figures .................................................................................................................. viii List of Tables ..................................................................................................................... ix List of Abbreviations .......................................................................................................... x Introduction ......................................................................................................................... 1 Literature Review................................................................................................................ 4 Methods............................................................................................................................... 8 Results ............................................................................................................................... 13 Discussion ......................................................................................................................... 17 References ......................................................................................................................... 38 vii List of Figures Figure 1. Data collection and integration workflow for GlyGen ...................................... 18 Figure 2: GlyGen high availability cluster server configuration ...................................... 19 Figure 3: GlyGen dataset collection viewer ...................................................................... 20 Figure 4: Protein centric network ..................................................................................... 21 Figure 5: Proteoform centric network ............................................................................... 22 Figure 6: Glycan centric network ..................................................................................... 23 Figure 7: Biocompute Object example representation...................................................... 24 Figure 8: Detailed dataset readme for Human glycoside hydrolased ............................... 25 viii List of Tables Table 1. Information on licenses that cover the biomedical resources ............................. 26 Table 2. List of human glycosidases ................................................................................. 27 Table 3. List of datasets in GlyGen .................................................................................. 33 ix List of Abbreviations API – Application programming interface BCO – BioCompute Object CAZy – Carbohydrate-Active Enzymes CPL – Common Programming Language CSV – Comma Separated Value EMBL-EBI - The European Molecular Biology Laboratory - European Bioinformatics Institute FAQ – Frequently Asked Questions GB – Gigabyte GHs- Glycoside Hydrolases GO – Gene ontology GTs – Glycosyltransferases HTML – HyperText Markup Language JSON – JavaScript Object Notation NCBI - The National Center for Biotechnology Information NFS – Network File Server OBO – Open biological and biomedical Ontology OWL – Ontology Web Language RDF – Resource Description Format RDFS – Resource Description Format Schema SPARQL – SPARQL Protocol and RDF Query Language

Glygen As a Case Study by Jeet Vora BS In

The Rise and Fall of the Bovine Corpus Luteum

1General Introduction and Outline Glycosphingolipids, Carbohydrate

Enzymatic Encoding Methods for Efficient Synthesis Of

Unicarbkb: Building a Standardised and Scalable Informatics Platform for Glycosciences Research

Glycoproteomics-Based Signatures for Tumor Subtyping and Clinical Outcome Prediction of High-Grade Serous Ovarian Cancer

Hierarchical Classification of Gene Ontology Terms Using the Gostruct

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

Localization of Heparanase in Esophageal Cancer Cells: Respective Roles in Prognosis and Differentiation

Mannosidases Are the Putative Catabolic Enzymes Which

Ultrasensitive Small Molecule Fluorogenic Probe for Human Heparanase

Renal Cell Neoplasms Contain Shared Tumor Type–Specific Copy Number Variations

Salivary Alpha Amylase (AMY1C) (NM 001008219) Human Tagged ORF Clone Product Data