<<

Infrastructure for Data Collection and Integration for Biomedical Knowledgebases – GlyGen as a Case Study

by Jeet Vora

B.S. in Microbiology, April 2012, University of Pune M.S. in Microbiology, April 2014, Savitribai Phule Pune University

A Thesis submitted to

The Faculty of The Columbian College of Arts and Sciences of The George Washington University in partial fulfillment of the requirements for the degree of Master of Science

May 20, 2018

Thesis directed by

Raja Mazumder Associate Professor of Biochemistry and Molecular Biology

© Copyright 2018 by Jeet Vora All rights reserved

ii

Acknowledgments

The author expresses his sincere gratitude to his mentor, Dr. Raja Mazumder, for providing valuable time, guidance and suggestions throughout the project. The author also wishes to thank Dr. Nagarajan Pattabiraman for his comments and suggestions on the thesis, Dr. Robel Kahsay for his help in understanding the semantic web technologies and IT infrastructure, Hayley Dingerdissen for her continuous support and help in familiarizing glycobiology concepts, Dacian Recce-Stremtan for providing details of

GlyGen servers, Rahi Navelkar and Reza Mousavi for helping in creating datasets and

Amanraj Singh for his help in using Cytoscape. The author also expresses sincere gratitude to Charles Hadley King, Amanda Bell, and all other members of the lab for their support and encouragement.

The author wishes to acknowledge the members of the GlyGen project for their contribution of data and the technical discussions of this project.

GlyGen Project is funded by National Institutes of Health Common Fund Glycoscience

Fund program (NIH Award #U01GM125267-01; PI- William York & Raja Mazumder)

iii

Abstract of Thesis

Infrastructure for Data Collection and Integration for Biomedical Knowledgebases – GlyGen as a Case Study

The ongoing acceleration in the use of omics technologies is generating petabytes of data that has resulted in the development of several knowledgebases and tools. Even though a vast amount of knowledge is present in these resources, much of it is redundant heterogeneous and scattered. Despite of the availability of considerable number of resources, there is still a need for manual literature search and manual collection of data from multiple resources to find an answer to a specific scientific question. Hence, there is a greater need for collection and integration of such biomedical data. The slow progress in biomedical data integration is because of the data in the knowledgebases are in different file formats, have multiple identifiers for the same entity, lack of machine- readable schemas, ill-defined Application Programming Interfaces (APIs) and data licensing issues. The biomedical community is making substantial progress by implementing new infrastructure technologies and standards comprising of Semantic

Web technologies, common formats and global linked identifiers for data collection, integration and retrieval.

The field of glycomics is generating data at a fast pace from the high-throughput projects, and as a result, many tools and databases have been developed for the community. Even in the glycobiology domain, the relevant data is scattered in various databases and knowledgebases giving rise to a need of having a global and comprehensive glycobiology knowledgebase. GlyGen, a glycoinformatics

iv knowledgebase that is a free, extendable and multidisciplinary resource for glycobiology aims to address these needs. GlyGen includes data and knowledge related to glycobiology which comprises of , , , diseases, expression, and mutation. For GlyGen, the data based on pre-defined data model derived from use-cases developed from the input of more than 50 scientists are being collected and integrated from various data resources.

In the initial phase of the GlyGen project, we have collected and integrated mouse and human data from resources such as UniProt, PDB, PubChem, GlyTouCan and

UniCarbKB and other individual data generators based on the workflow that incorporated semantic web technologies and standards. We created 74 datasets and further categorized them as centric, proteoform centric and centric datasets based on the content of the data. Detailed readme for each dataset was created based on the

BioCompute Object specification document. A dataset collection viewer page was developed and view the dataset collection and to understand the relationship and linking of data in the dataset categories; dataset networks were created using Cytoscape. The datasets in CSV (Comma Separated Value) format were later rdfized using an RDF

(Resource Description Framework) model based on the existing RDF models. The rdfized data was stored in the GlyGen triplestore in the form of triples. The GlyGen triplestore was made available to be accessed by web service APIs and by SPARQL queries. To collect, integrate and retrieve data, high availability clusters server configuration comprising three servers with preinstalled software were used. For GlyGen data, we chose Creative Commons Attribution 4.0 International (CC BY 4.0) license, and

v for source code, we chose, GNU General Public License v3. A private GitHub repository was created for sharing source code with the public. As it is challenging to retrieve a list if (GHs) from a single resource developed a workflow that retrieved the GHs from UniProtKB and validated the entries through QuickGO,

Carbohydrate-Active database (CAZy) and Pfam. The workflow retrieved a list of 83 GHs classified by GH families. When GlyGen is fully developed and functional, we will make it available to the public through the link – www.glygen.org

Data integration is challenging and a difficult task that requires meticulous planning, creative approaches to tackle issues and concentrated efforts in implementing solutions. We firmly believe that the rest of the biomedical community will take inspiration from GlyGen to collect and integrate data for a biomedical domain from diverse resources.

vi

Contents

Acknowledgments...... iii

Abstract of Thesis ...... iv

List of Figures ...... viii

List of Tables ...... ix

List of Abbreviations ...... x

Introduction ...... 1

Literature Review...... 4

Methods...... 8

Results ...... 13

Discussion ...... 17

References ...... 38

vii

List of Figures

Figure 1. Data collection and integration workflow for GlyGen ...... 18

Figure 2: GlyGen high availability cluster server configuration ...... 19

Figure 3: GlyGen dataset collection viewer ...... 20

Figure 4: Protein centric network ...... 21

Figure 5: Proteoform centric network ...... 22

Figure 6: Glycan centric network ...... 23

Figure 7: Biocompute Object example representation...... 24

Figure 8: Detailed dataset readme for Human glycoside hydrolased ...... 25

viii

List of Tables

Table 1. Information on licenses that cover the biomedical resources ...... 26

Table 2. List of human glycosidases ...... 27

Table 3. List of datasets in GlyGen ...... 33

ix

List of Abbreviations

API – Application programming interface

BCO – BioCompute Object

CAZy – -Active Enzymes

CPL – Common Programming Language

CSV – Comma Separated Value

EMBL-EBI - The European Molecular Biology Laboratory - European

Institute

FAQ – Frequently Asked Questions

GB – Gigabyte

GHs- Glycoside Hydrolases

GO – ontology

GTs –

HTML – HyperText Markup Language

JSON – JavaScript Object Notation

NCBI - The National Center for Biotechnology Information

NFS – Network File Server

OBO – Open biological and biomedical Ontology

OWL – Ontology Web Language

RDF – Resource Description Format

RDFS – Resource Description Format Schema

SPARQL – SPARQL Protocol and RDF Query Language

SWLS – Semantic Web for the Life Sciences

x

TB – Terabyte

TSV – Tab Separated Value

URI – Uniform Resource Identifier

W3C – World Wide Web Consortium

XML – Extensible Markup Language

xi

Introduction

Data integration has become a crucial and daunting undertaking in the biomedical field due to the exponential data surge that is driven by cheaper omics technologies. The biomedical field is generating petabytes of data with a data doubling rate of seven months1, an achievement that was impossible a decade ago. However, we are far from fully understanding the data and successfully transform it to biomedical knowledge. From the publication of first “Atlas of Protein Sequence and Structure2” in the form of a book, the number of biomedical knowledgebases has steadily increased to

1,621 in 20153. Although a staggering amount of curated knowledge is present in these resources, the data and knowledge are redundant, scattered and in various formats across these diverse resources. There is still a need for manual literature research, or in some cases manual collection of data from multiple diverse resources to answer a specific scientific question. The difficulty in accessing the heterogenous knowledgebases and inability of merging datasets with others often causes impediments in research and creates an environment of dampened enthusiasm for researchers. Such situations put forward serious questions to the biomedical community, like “Is there a need for thousands of databases?” or “Is there a possibility of merging and integrating data from various databases in one or few databases for a specific biomedical domain?". There may not be straightforward answers to such questions, but there indeed is a possibility of maintaining one or few databases in a specific biomedical domain. Data collection and integration is thus required to attain such possibilities. However, there are critical challenges to data collection, integration and retrieval in biomedical field that are impeding the exchange and dissemination of data and knowledge among the existing

1 resources. The challenges include heterogeneous biomedical data formats, multiple identifiers pointing to a single identity, non-standardized and non-machine-readable schema, ill-defined APIs and licenses that do not allow to share, distribute and remix data. Biomedical communities know the value and outcome of data integration and have been rapidly making progress by implementing new infrastructure technologies in the form of shared data formats, linked global identifiers and by adopting Semantic Web4 standards developed by the World Wide Web Consortium (W3C)

(https://www.w3.org/standards/semanticweb/).

The Semantic Web Technology uses universal standards developed by the World Wide

Web Consortium international community to interlink the data on the web. Semantic web technologies enable smooth data sharing and exchange at a global in a form that is machine friendly. The semantic web technology is represented by main standards such as

Resource Description Framework5 (RDF), Resource Description Framework Schema6

(RDFS), SPARQL7 (SPARQL Protocol and RDF Query Language), and OWL8,9 (Web

Ontology Language). RDF also known as a graph database is a model to store data on the semantic web. The data stored in RDF format is in the form of triples, each consisting of a subject, object, and predicate that denotes the relationship between the subject and object. The subject and object are also known as nodes and along with predicate is represented by Uniform Resource Identifier (URI) to identify resources uniquely. RDFS is the data modeling language for describing RDF data. It provides primary language for representing RDF vocabularies also known as web ontologies such as OWL. OWL is an ontology language that extends RDF and RDFS. OWL is a logic-based language that

2 represents rich and complex knowledge about things, relationships between things and a group of things and facilitates machine readability. SPARQL is the semantic query language designed to query and retrieve data stored in RDF format file or through a

SPARQL endpoint.

In this study, we present the infrastructure developed for data collection and integration for the GlyGen project. GlyGen (gly-glycobiology, gen-information), is a free, extendable and cross-disciplinary glycoinformatics knowledgebase for glycoscience research. GlyGen aims to address the data integration challenges in the glycobiology domain by implementing standards and ontology-based semantic technologies. The goals of the GlyGen project are 1) Integrate glycobiology related knowledge and data from various resources and data generators, 2) Create a user-friendly intuitive web interface to browse and search for glycobiology knowledge. 3) Develop new glycoinformatics infrastructure and tools; that will provide a systems-level understanding of glycobiology knowledge even to researchers who are not experts in glycobiology. GlyGen’s comprehensive data integration framework is designed to provide unprecedented support for complex queries spanning diverse data types relevant to glycobiology, extending its scope beyond the mapping of glycan data to genes and proteins. GlyGen is implementing data warehousing approach where data is first collected from various resources, cleaned and reshaped based on the pre-determined data model derived from several use cases and then systematically integrated to be made available to the users or public resources in standard formats.

3

Literature Review

A PubMed10 search for publications on data integration was performed on May 3,

2018, by using the advanced search option “Search data integration[Title] Sort by Best

Match” that resulted in 447 publications. After reviewing some of the publications from the list, it was evident that the challenges in data integration and need for data integration in the biomedical domain existed since the 1990s. While some publications have elucidated on current challenges11,12 in data integration, others have discussed the strategies13, technologies14, methods15,16 and the tools17 developed to alleviate challenges associated with data collection, integration, and retrieval in the domain.

Goble and Stevens11 highlighted the reasons for success and failures in data integration efforts made in the past and suggested the use of shared identifiers and

Semantic Web technologies to address data integration challenges. They stressed on the point that biomedical data integration needs to use existing semantics, schema, and standards that are free from biases. In their paper, they have drawn the attention towards the situation where similar data if found in multiple, replicating and overlapping resources that have led to “a loose federation of bio-nations.” Good and Wilkinson13 through “The Semantic Web for the Life Sciences (SWLS)” discussed the use of semantic web4technologies such as OWL, RDF, SPARQL for naming, representing, describing and accessing the biomedical data. They expressed the concern in regards with some bioinformatics groups that should have taken the initiative to accept and implement

SWLS for the community fully, rather what they saw was “semantic creep—timid, piecemeal and ad hoc adoption of parts of standards by these groups.” An overview by

4

Lapatas and collegues12 provided background on data integration from a computer science viewpoint by illustrating six common schemata used in data and knowledge integration in biology. They identified data heterogeneity as a major barrier to data integration. Chung and Wong17 understood the importance of data retrieval and integration nearly two decades ago and developed “Kleisli,” a tool for broad-scale data integration using the Collection Programming Language (CPL). The Kleisli system supported complex queries across multiple heterogeneous resources and made data transformation, manipulation, retrieval, and integration a smooth and efficient task.

KaBOB16 (The Knowledge Base of Biomedicine) was another system that used 14 widely used Open Biomedical and Biomedical Ontologies18 (OBO) and semantic web framework to integrate biomedical data from 18 resources. The Bio2RDF14 project built a database mashup system and made available several knowledgebase’s documents in RDF format using rdfizer programs. The mashup of biomedical data was built on three-step approach in which Bio2RDF had successfully applied Semantic Web4 technologies and common ontologies. Gligorijevic and Przulj15 surveyed and compared different methods of data integration for heterogeneous data. They proposed that non-negative matrix factorization-based approaches to become favorable data integration strategy given its superiority in handling heterogeneous data with high accuracy.

It is estimated that more than 50 percent of the proteins in undergo glycosylation as a posttranslational modification, making glycosylation a critical enzymatic modification in the . The need to study and understand the importance of glycans and glycosylation of lipids and proteins that play an essential role in the

5 physiological well-being of humans has resulted in the emergence of glycobiology as a separate discipline. Researchers around the globe are showing deep interest in studying glycans for their structural, functional and metabolic role19 to fully understand the biological system. The rise of glycobiology has resulted in the development of several glycoinformatics databases and tools but to coherently understand the complete picture of an ’s glycome, integration of growing number of resources, datasets and tools are required. A list of glycoinformatics resources and tools is well documented in Chapter

52, Essentials of Glycobiology [Internet]. 3rd edition20. The glycoinformatics community is developing ontologies using RDF framework that could connect remote glycoinformatics databases and facilitate data exchange, retrieval, cross-linking and integration21-23. By integrating multiple glycan structure databases into a single central portal24,25, the community has begun the data integration process, and GlyGen will further propel their efforts forward.

Glycoside hydrolases (GHs) or glycosidases or glycosyl are enzymes that catalyze the of glycosidic linkages. Genome projects have revealed that

1–3% of an organism's genes are typically devoted to carbohydrate hydrolysis.

Glycoside hydrolases are classified into EC 3.2.1 and in families based on their amino acid sequence similarities26. Common glycoside hydrolases include , , , , , , , and and are involved in degradation on glycans, act as intermediates that are used as substrates by

Glycosyltransferases (GTs) and are a significant component of which is involved phagocytosis, autophagy, and receptor-mediated endocytosis. Defect in GHs

6 cause defect in and degradation that further cause fatal genetic disorders and diseases27. Thus, GHs are an area of active research, but little information is available on them. It is challenging and difficult to get a comprehensive list of GHs from the current resources. GlyGen is working on such use cases to make available the required data. In this study, we have attempted to retrieve the list of GHs. The workflow for retrieving GHS will be refined and a final, comprehensive and validated GH list will be made available to the users.

7

Methods

Data Integration Workflow

For GlyGen, the data collection and integration workflow was designed to encapsulate and implement the current semantic standards for data collection, transformation, visualization, storage, and retrieval. Additional inputs from database engineers and developers were implemented in the design. The workflow for data collection and integration is shown in Figure 1.

Data Collection

GlyGen specific multidisciplinary data was collected from partnering resources, collaborators, and data-generators. The data was collected on the basis of the pre-defined data model that was derived from the 114 distinct use cases grouped in six categories.

The use cases were gathered from the input of more than 100 multidisciplinary investigators impacted by glycobiology. The resources included UniProtKB28, Protein

Data Bank29 (www.rcsb.org), RefSeq30, GlyTouCan25, UniCarbKB31, PubChem32, and

Protein Resource Ontology (PRO)33. Before data collection, data exchange formats was formalized with the data sources and, data for two viz. mouse and human were collected and integrated in the initial phase of the project. The collected data was categorized mainly into protein, proteoform, and glycan centric data. To avoid data sharing and data remixing issue arising due to licenses, prior agreements with data resources were made to allow sharing and modification of data with proper attribution.

Table 1. Shows the licenses that cover these biomedical resources.

8

Data Processing

The data collected was in heterogeneous file formats such as CSV, TSV,

RDF/XML and therefore the files were first converted to a standard CSV file format, using python scripts. The RDF/XML triples were first converted into N-triples format using Rapper (librdf.org/raptor/rapper.html). Through SPARQL queries data were extracted from N-triple files to a CSV file. Once the CSV files were created, self- descriptive filenames were given to the files. Further, the CSV files were referred as datasets. In the subsequent step, a python script was used to map identifiers belonging to the protein and proteoform centric data to UniProtKB28 canonical accessions, and identifiers belonging to the glycan centric data to the GlyTouCan25 accessions. This script also filtered out identifiers of the data resources that did not belong to human and mouse species. Another python script was run to count the unique value statistics of each column of the CSV files.

Dataset Readme

Once a dataset was created, an object id was assigned to it, and a detailed readme for the dataset was made based on the BioCompute Object34 (BCO) specification

(https://osf.io/r6s4u/).

Dataset Viewer

A dataset collection viewer page was created to view the datasets based on the user-friendly concepts of e-commerce web pages that had search and filter capabilities.

Additional functionalities were added to the page.

9

Dataset Network Visualization

For creating network visualization of datasets for a data category, Cytoscape tool35 was used. Once the network was created, it was saved as a JSON file and embedded in the HTML page of dataset viewer.

RDFization of Data, Storage, and Access

For rdfization of the GlyGen data, an RDF model was created based on existing

RDF models such as GlycoCoO22 RDF and UniProtKB28 RDF. The data were extracted from the datasets for rdfization and stored in the form of triples in GlyGen triplestore, as per the RDF model.

Hardware and Software infrastructure for data collection and integration

Two server configurations viz. stand-alone server configuration and high availability clusters server configuration were evaluated for data collection and integration, storage and retrieval. Out of the two configurations, high availability clusters server configuration was selected based on their benefits, efficiency, and performance.

The high availability clusters server configuration used for GlyGen comprised of three servers out of which two servers were computing nodes. The compute nodes VMware vSphere Hypervisor virtual machine. The third server was the Network File System

(NFS) Server for data storage. The total storage and memory capacity of the server setup were 21TBs and 198 GBs respectively. The servers ran on Linux CentOS 7 operating system and were preinstalled with software like Apache web server, MySQL for database management, Virtuoso for triplestore management, and Flask for web-service APIs. The

10 servers were connected with a networking switch and were protected by a firewall. The hardware and software setup selected for GlyGen is sufficient to store billions of triples and actively serve the GlyGen front-end API and other public APIs. The setup can provide thousands of users seamless and uninterrupted access to the GlyGen interface.

The server setup is shown in Figure 2.

License for GlyGen Data and Source code

Open licenses such as Open Data Commons Open Database License

(https://opendatacommons.org/licenses/odbl/), Creative Common License (CC BY 4.0)

(https://creativecommons.org/licenses/by/4.0/) and GNU General Public License v3

(https://www.gnu.org/licenses/gpl-3.0.en.html) were reviewed for GlyGen data and source code.

GitHub Repository for GlyGen’s source codes

The scripts and source codes for GlyGen knowledgebase were committed to

GlyGen GitHub account and would be made available to the public at later stage ensuring shareability of developed resources.

Workflow for creating a list of human

Using UniProt advanced search infrastructure three searches were performed

(UniProtKB/Swiss-Prot 03/02/2018, accessed 03/15/2018) using the query

“keyword:Glycosidase [KW-0326]” AND reviewed:yes AND organism: “Homo sapiens

(Human) [9606]”, “glycosyl hydrolase” AND reviewed:yes AND organism: “Homo

11 sapiens (Human) [9606]” and “glycoside hydrolase” AND reviewed:yes AND organism:

“Homo sapiens (Human) [9606]”, to obtain the list of all reviewed human glycosidase enzymes. Since glycosidase is also known by different names such as glycoside hydrolase and glycosyl hydrolase,. three searches were performed. For each entry, the

UniProtKB Accession, protein name, and gene name (primary) were retrieved. The protein entries retrieved from these three queries were reviewed for Gene Ontology36

(GO) molecular function annotation term GO:0016798 (hydrolase activity, acting on glycosyl bonds) by using QuickGO37 (03/09/2018, accessed 3/15/18). The entries that did not have molecular function annotation term GO:0016798 were filtered out. In a subsequent step, list of UniProtKB entries belonging to CAZy38 (Carbohydrate-Active

Enzymes) Glycoside hydrolase (GH) human families were manually retrieved from

CAZy database (03/09/2018, accessed). The UniProtKB accessions retrieved from CAZY database that identically mapped to the accessions retrieved from queries were assigned the GH families. In the next step, the accessions from the queries were analyzed for GH domain family annotations directly in Pfam39 (03/15/2018, accessed) and were assigned the GH family domain. Entries were filtered out if they did not belong to both GH families and if they did not have a GH family domain assigned. After filtering accessions from the three queries based on the above manual validation, the accessions were merged to give a comprehensive list of glycosidases.

12

Results

The workflow design for data collection and integration in GlyGen knowledgebase reflects the implementation of semantic web standards and technologies.

Data Collection and Data Processing

From the files received from the resources, 74 datasets were created. 32 datasets were categorized under the protein centric data category in which 19 belonged to the human species, and other twelve belonged to the mouse species. 22 datasets were categorized under the proteoform centric data category in which 19 belonged to the human species, and three belonged to the mouse species. 20 datasets were categorized under the glycan-centric data category, 10 datasets each belonging to mouse and humans.

The list of datasets is in shown in Table 3.

Dataset Readme

The BioCompute Object34 framework was developed for standardizing and harmonizing HTS computations and data formats. The BioCompute Object framework ensures interoperability, streamlined communication, and simplification of bioinformatics protocol. Similarly, the dataset readme suffices the purpose of a readme as well as captures meticulous details of the dataset. The details of the dataset were distributed across five domains in the readme. The identification and provenance domain provides the history of the dataset. The usability domain provides the specific description of the dataset based on the use case. The description domain provided brief information on the content of the dataset, keywords, and steps to create the dataset. The execution domain

13 provided a link to the script that can automatically execute to create the dataset as well as the software requirements for running the script. Currently, automatic creation of dataset is in a developmental phase but would be functional in the near future. The input and output domain has input sub-domain that provides the information about the input files and input file types required to create the dataset. It also provides the location of the input file for retrieval. The output sub-domain provides information of the file and file type that was created using the input files and following the steps mentioned in the description domain. It also provides the location of the output file and the description of file column headers. The output sub-domain shows the unique value statistics computed by the python script during dataset processing. The example of BCO is shown in Figure 7. and the readme for human glycosidase is shown in figure 8.

Dataset collection viewer

The datasets created from the collected data were displayed on the dataset viewer page. The dataset viewer page allows to search, and filter datasets based on the species, data category, and file format types. The page also allows to download a dataset, preview it, read the readme and comment on the readme. The dataset viewer page points to an

FAQ page that contains questions on data, dataset category, GlyGen license, and readme.

The GlyGen Dataset collection viewer page is shown in Figure 3.

Dataset Network Visualization

With the help of Cytoscape, three networks for all the datasets under three data categories were created. The networks helped to understand the relationships and linking

14 of the data. The networks uploaded in the JSON format provide an intuitive experience.

The networks of the three data categories are shown in Figure 4. (Protein centric), Figure

5. (Proteoform Centric) and Figure 6. (GlyCan centric)

RDFization of Data, Storage, and Access

GlyGen triplestore was developed for users to query GlyGen interface for data through web service APIs and by querying the GlyGen SPARQL endpoint.

Licenses for GlyGen data and source

Creative Commons Attribution 4.0 International (CC BY 4.0) license was chosen for GlyGen data, and GNU General Public License v3 was chosen for GlyGen source code. CC BY 4.0 license allows to copy, share, distribute and remix data. It also allows commercial use of data, but it requires proper attribution, i.e., giving proper credit to the data source, links to the license and indicates if any changes were made to the original data. Similarly, GNU General Public License v3 allows to share, change and use the source code for any purpose. Once the source code is stable GlyGen will make the source code publicly available thereby creating an environment that allows the free and open use of knowledge and resources.

Human glycoside hydrolase list

From the query “keyword:Glycosidase [KW-0326]” AND reviewed:yes AND organism: “Homo sapiens (Human) [9606]” a total of 76 entries were retrieved. In the manual validation step 11 entries did not have the GO term GO:0016798 nor belonged to

15 any GH families in CAZY and nor had GH family domains in Pfam, thus these accessions were removed from the list to give a total of 65 GH entries. From the query

“glycosyl hydrolase” AND reviewed:yes AND organism: “Homo sapiens (Human)

[9606]” a total of 84 GH entries were retrieved. Only one entry did not pass the manual validation criteria, thus giving a total of 83 GHs entries from the list. Similarly from the query “glycoside hydrolase” AND reviewed:yes AND organism: “Homo sapiens

(Human) [9606]” a list of 82 GH entries were retrieved. Out of which four did not pass the manual validation criteria and were filtered out to give a total 78 GH accessions. The filtered accessions from the three queries were consolidated to get a comprehensive list of

GHs. After consolidation, 83 GHs entries were retrieved that were identical to the 83 Ghs entries retrieved from the result from query “glycosyl hydrolase” AND reviewed:yes

AND organism: “Homo sapiens (Human) [9606]”. This might not be the best way to retrieve human GHs list, but we are working on refining our workflow to provide a pre compiled and validated list of GHs to the GlyGen users. The list of human GHs are provided in the Table 2.

16

Discussion

Data integration is not an issue that can be solved in a fortnight, but if the efforts are concentrated and widespread in implementing the new data integration infrastructure technologies, we might see its effect and outcome in the biomedical field. Connecting knowledgebases and cross-linking data will further facilitate data curation and data to knowledge transformation. Through GlyGen knowledgebase, we are implementing semantic web standards and technologies to collect and integrate glycobiology data from multidisciplinary resources. The integrated glycobiology data will be exchanged with

NCBI, EBML-EBI supported formats that will further make data and knowledge comprehensively accessible and would also value-add to the existing data to improve knowledge discovery in basic and translational glycobiology. GlyGen’s robust data model is designed to reflect the needs of glycobiology and biomedical community. The project will also include evidence tagging of data, new standards and ontology development, data mining through the interface that will ultimately enable data sharing, dissemination, and exchange. GlyGen, at its initial stage during my project is putting in efforts to effectively and comprehensively collect and integrate data to make it a better, exhaustive and extensive glycobiology knowledgebase. Once GlyGen is fully developed and functional, it will be open and available to the public through the link - www.glygen.org.

17

Figures

Figure 1. Data collection and integration workflow for GlyGen

Step 1. Collection of data from various sources and data generators. Step 2. Conversion of collected data files into CSV format. Step 3. Filtering and mapping of identifiers to either to UniProtKB accessions or GlyTouCan accessions. Step 4. Counting unique statistics of csv columns of each datasets. Step 5) Rdfization of data from CSV files and storage of GlyGen RDF data in GlyGen triplestore. Step 6) Providing APIs and SPARQL queries access to GlyGen Triplestore

18

Figure 2: GlyGen high availability cluster server configuration

The high availability cluster server configuration for GlyGen protected by a firewall.

Two servers are computing nodes, and one is an NFS server connected by a networking switch. Two virtual servers run on compute node one whereas one virtual server runs on node 2.

19

Figure 3: GlyGen dataset collection viewer

The GlyGen dataset collection viewer page showing the datasets in three dataset categories. The page also shows filter and search options along with links to network schema and FAQs.

20

Figure 4: Protein centric network

The protein centric view of datasets categorized under protein centric data category created using Cytoscape. The protein centric network figure shows the column headers of datasets categorized under protein centric data linked to primary key which is UniProtKB canonical accession.

21

Figure 5: Proteoform centric network

The proteoform centric view of datasets categorized under proteoform centric data category created using Cytoscape. The proteoform centric network figure shows the column headers of datasets categorized under proteoform centric data linked to primary key which is UniProtKB canonical accession.

22

Figure 6: Glycan centric network

The glycan centric network view of datasets categorized under glycan centric data category created using Cytoscape. The glycan centric network figure shows the column headers of datasets categorized under glycan centric data linked to primary key which is

GlyTouCan accession.

23

Figure 7: Biocompute Object example representation

Compressed version of BioCompute Object (BCO) example represented as a JSON text with domains highlighted. This example BCO example shows the bioinformatics pipleline for Detection of EGFR [hgnc:3236] gene mutations in human [taxonomy:9606] non-small cell lung carcinoma [doid:3908] patients. The dataset readme for GlyGen is similar to BCO and is bases BCO specification.

24

Figure 8: Detailed dataset readme for Human glycoside hydrolased

25

Tables

Table 1. Information on licenses that cover the biomedical resources

Resource License Web link to license/policy Date

Name information in the resource Accessed

UniProtKB CC BY-ND 3.01 https://www.uniprot.org/help/lice 05/06/2018

nse

UniCarbKB CC BY-NC-ND 3.02 http://www.unicarbkb.org/about 05/06/2018

GlyTouCan CC BY 4.03 https://glytoucan.org/ 05/06/2018

PRO CC BY 4.0 05/06/2018

RefSeq NA4 https://www.ncbi.nlm.nih.gov/ho 05/06/2018

me/about/policies/

PubChem NA https://www.ncbi.nlm.nih.gov/ho 05/06/2018

me/about/policies/

PDB NA https://www.rcsb.org/pdb/static.do 05/06/2018

?p=general_information/about_pd

b/policies_references.html

1 CC BY-ND 3.0 – Creative Commons Attribution-NoDerivs 3.0 Unported

2 CC BY-NC-ND 3.0 – Creative Commons Attribution-NonCommercial-NoDerivs 3.0

Unported (CC BY-NC-ND 3.0)

3 CC BY 4.0 - Creative Commons Attribution 4.0 International

4 NA – Not applicable

26

Table 2. List of human glycosidases

No UniProt Protein names GH Gene

Kb_acc Family Name

(CAZy) primary

1. Q04446 1,4-alpha-glucan-branching (EC 2.4.1.18) GH13 GBE1 (Brancher enzyme) (-branching enzyme)

2. P08195 4F2 cell-surface antigen heavy chain (4F2hc) (4F2 heavy GH13 SLC3A2 chain antigen) (Lymphocyte activation antigen 4F2 subunit) (Solute carrier family 3 member 2) (CD antigen CD98) 3. Q9BZP6 Acidic mammalian chitinase (AMCase) (EC 3.2.1.14) GH18 CHIA (Lung-specific protein TSA1902) 4. P04745 Alpha-amylase 1 (EC 3.2.1.1) (1,4-alpha-D-glucan GH13 AMY1A; glucanohydrolase 1) (Salivary alpha-amylase) AMY1B; AMY1C 5. P19961 Alpha-amylase 2B (EC 3.2.1.1) (1,4-alpha-D-glucan GH13 AMY2B glucanohydrolase 2B) (Carcinoid alpha-amylase) 6. P06280 Alpha-galactosidase A (EC 3.2.1.22) (Alpha-D- GH27 GLA galactosidase A) (Alpha-D-galactoside galactohydrolase) (Melibiase) (Agalsidase) 7. P35475 Alpha-L- (EC 3.2.1.76) GH39 IDUA 8. Q16706 Alpha- 2 (EC 3.2.1.114) (Golgi alpha- GH38 MAN2A mannosidase II) (AMan II) (Man II) (Mannosidase alpha 1 class 2A member 1) (Mannosyl-oligosaccharide 1,3-1,6- alpha-mannosidase) 9. Q9NTJ4 Alpha-mannosidase 2C1 (EC 3.2.1.24) (Alpha GH38 MAN2C mannosidase 6A8B) (Alpha-D-mannoside 1 mannohydrolase) (Mannosidase alpha class 2C member 1) 10. P49641 Alpha-mannosidase 2x (EC 3.2.1.114) (Alpha- GH38 MAN2A mannosidase IIx) (Man IIx) (Mannosidase alpha class 2A 2 member 2) (Mannosyl-oligosaccharide 1,3-1,6-alpha- mannosidase) 11. P17050 Alpha-N-acetylgalactosaminidase (EC 3.2.1.49) (Alpha- GH27 NAGA galactosidase B) 12. P54802 Alpha-N-acetylglucosaminidase (EC 3.2.1.50) (N-acetyl- GH89 NAGLU alpha-glucosaminidase) (NAG) [Cleaved into: Alpha-N-

27

acetylglucosaminidase 82 kDa form; Alpha-N- acetylglucosaminidase 77 kDa form] 13. Q68DE3 Basic helix-loop-helix domain-containing protein USF3 GH47 USF3 (Upstream transcription factor 3) 14. P16278 Beta-galactosidase (EC 3.2.1.23) (Acid beta- GH35 GLB1 galactosidase) (Lactase) (Elastin receptor 1) 15. Q6UWU Beta-galactosidase-1-like protein (EC 3.2.1.-) GH35 GLB1L 2 16. Q8IW92 Beta-galactosidase-1-like protein 2 (EC 3.2.1.-) GH35 GLB1L2 17. Q8NCI6 Beta-galactosidase-1-like protein 3 (EC 3.2.1.-) GH35 GLB1L3 18. P08236 Beta-glucuronidase (EC 3.2.1.31) (Beta-G1) GH2 GUSB 19. P06865 Beta- subunit alpha (EC 3.2.1.52) (Beta- GH20 HEXA N-acetylhexosaminidase subunit alpha) (Hexosaminidase subunit A) (N-acetyl-beta- glucosaminidase subunit alpha) 20. P07686 Beta-hexosaminidase subunit beta (EC 3.2.1.52) (Beta- GH20 HEXB N-acetylhexosaminidase subunit beta) (Hexosaminidase subunit B) (Cervical cancer proto-oncogene 7 protein) (HCC-7) (N-acetyl-beta-glucosaminidase subunit beta) [Cleaved into: Beta-hexosaminidase subunit beta chain B; Beta-hexosaminidase subunit beta chain A] 21. Q86Z14 Beta- (BKL) (BetaKlotho) (Klotho beta-like GH1 KLB protein) 22. O00462 Beta-mannosidase (EC 3.2.1.25) (Lysosomal beta A GH2 MANBA mannosidase) (Mannanase) (Mannase) 23. P36222 Chitinase-3-like protein 1 (39 kDa synovial protein) GH18 CHI3L1 (Cartilage glycoprotein 39) (CGP-39) (GP-39) (hCGP-39) (YKL-40) 24. Q15782 Chitinase-3-like protein 2 (Chondrocyte protein 39) GH18 CHI3L2 (YKL-39) 25. Q13231 Chitotriosidase-1 (EC 3.2.1.14) (Chitinase-1) GH18 CHIT1 26. Q9H227 Cytosolic beta-glucosidase (EC 3.2.1.21) (Cytosolic beta- GH1 GBA3 glucosidase-like protein 1) 27. Q8NFI3 Cytosolic endo-beta-N-acetylglucosaminidase (ENGase) GH85 ENGASE (EC 3.2.1.96) 28. Q01459 Di-N-acetylchitobiase (EC 3.2.1.-) GH18 CTBS 29. Q9UKM mannosyl-oligosaccharide 1,2- GH47 MAN1B 7 alpha-mannosidase (EC 3.2.1.113) (ER alpha-1,2- 1 mannosidase) (ER mannosidase 1) (ERMan1) (Man9GlcNAc2-specific-processing alpha-mannosidase) (Mannosidase alpha class 1B member 1) 30. Q9Y2E5 Epididymis-specific alpha-mannosidase (EC 3.2.1.24) GH38 MAN2B (Mannosidase alpha class 2B member 2) 2 31. Q92611 ER degradation-enhancing alpha-mannosidase-like GH47 EDEM1 protein 1 32. Q9BV94 ER degradation-enhancing alpha-mannosidase-like GH47 EDEM2 protein 2 28

33. Q9BZQ6 ER degradation-enhancing alpha-mannosidase-like GH47 EDEM3 protein 3 (EC 3.2.1.113) (Alpha-1,2-mannosidase EDEM3) 34. P54803 Galactocerebrosidase (GALCERase) (EC 3.2.1.46) GH59 GALC (Galactocerebroside beta-galactosidase) () ( beta- galactosidase) 35. P04062 (EC 3.2.1.45) (Acid beta- GH30 GBA glucosidase) (Alglucerase) (Beta-) (Beta-GC) (D-glucosyl-N-acylsphingosine glucohydrolase) (Imiglucerase) 36. P35573 Glycogen debranching enzyme (Glycogen debrancher) GH133 AGL [Includes: 4-alpha-glucanotransferase (EC 2.4.1.25) (Oligo-1,4-1,4-glucantransferase); Amylo-alpha-1,6- glucosidase (Amylo-1,6-glucosidase) (EC 3.2.1.33) (Dextrin 6-alpha-D-glucosidase)] 37. Q5SRI9 Glycoprotein endo-alpha-1,2-mannosidase (Endo-alpha GH99 MANEA mannosidase) (Endomannosidase) (hEndo) (EC 3.2.1.130) (Mandaselin) 38. Q5VSG8 Glycoprotein endo-alpha-1,2-mannosidase-like protein GH99 MANEA (EC 3.2.1.-) L 39. Q9Y251 (EC 3.2.1.166) (Endo-glucoronidase) GH79 HPSE (Heparanase-1) (Hpa1) [Cleaved into: Heparanase 8 kDa subunit; Heparanase 50 kDa subunit] 40. Q8WVB Hexosaminidase D (EC 3.2.1.52) (Beta-N- GH20 HEXDC 3 acetylhexosaminidase) (Beta-hexosaminidase D) (Hexosaminidase domain-containing protein) (N-acetyl- beta-galactosaminidase) 41. P38567 Hyaluronidase PH-20 (Hyal-PH20) (EC 3.2.1.35) GH56 SPAM1 (Hyaluronoglucosaminidase PH-20) (Sperm adhesion molecule 1) (Sperm surface protein PH-20) 42. Q12794 Hyaluronidase-1 (Hyal-1) (EC 3.2.1.35) GH56 HYAL1 (Hyaluronoglucosaminidase-1) (Lung carcinoma protein 1) (LuCa-1) 43. Q12891 Hyaluronidase-2 (Hyal-2) (EC 3.2.1.35) GH56 HYAL2 (Hyaluronoglucosaminidase-2) (Lung carcinoma protein 2) (LuCa-2) 44. O43820 Hyaluronidase-3 (Hyal-3) (EC 3.2.1.35) GH56 HYAL3 (Hyaluronoglucosaminidase-3) (Lung carcinoma protein 3) (LuCa-3) 45. Q2M3T Hyaluronidase-4 (Hyal-4) (EC 3.2.1.35) (Chondroitin GH56 HYAL4 9 sulfate endo-beta-N-acetylgalactosaminidase) (Chondroitin sulfate hydrolase) (CSHY) (Hyaluronoglucosaminidase-4) 46. Q8WW Inactive heparanase-2 (Hpa2) HPSE2 Q2

29

47. Q9UEF7 Klotho (EC 3.2.1.31) [Cleaved into: Klotho peptide] GH1 KL

48. Q6UW Lactase-like protein (Klotho/lactase-phlorizin hydrolase- GH1 LCTL M7 related protein) 49. P09848 Lactase-phlorizin hydrolase (Lactase- GH1 LCT ) [Includes: Lactase (EC 3.2.1.108); Phlorizin hydrolase (EC 3.2.1.62)] 50. P10253 Lysosomal alpha-glucosidase (EC 3.2.1.20) (Acid GH31 GAA maltase) (Aglucosidase alfa) [Cleaved into: 76 kDa lysosomal alpha-glucosidase; 70 kDa lysosomal alpha- glucosidase] 51. O00754 Lysosomal alpha-mannosidase (Laman) (EC 3.2.1.24) GH38 MAN2B (Lysosomal acid alpha-mannosidase) (Mannosidase 1 alpha class 2B member 1) (Mannosidase alpha-B) [Cleaved into: Lysosomal alpha-mannosidase A peptide; Lysosomal alpha-mannosidase B peptide; Lysosomal alpha-mannosidase C peptide; Lysosomal alpha- mannosidase D peptide; Lysosomal alpha-mannosidase E peptide] 52. P61626 Lysozyme C (EC 3.2.1.17) (1,4-beta-N-acetylmuramidase GH22 LYZ C) 53. Q8N1E2 Lysozyme g-like protein 1 (EC 3.2.1.-) GH23 LYG1 54. Q86SG7 Lysozyme g-like protein 2 (EC 3.2.1.-) GH23 LYG2 55. Q6UWQ Lysozyme-like protein 1 (EC 3.2.1.17) GH22 LYZL1 5 56. Q7Z4W Lysozyme-like protein 2 (Lysozyme-2) (EC 3.2.1.17) GH22 LYZL2 2 57. Q96KX0 Lysozyme-like protein 4 (Lysozyme-4) GH22 LYZL4 58. O75951 Lysozyme-like protein 6 (EC 3.2.1.17) GH22 LYZL6 59. O43451 Maltase-glucoamylase, intestinal [Includes: Maltase (EC GH31 MGAM 3.2.1.20) (Alpha-glucosidase); Glucoamylase (EC 3.2.1.3) (Glucan 1,4-alpha-glucosidase)] 60. P33908 Mannosyl-oligosaccharide 1,2-alpha-mannosidase IA GH4 MAN1A (EC 3.2.1.113) (Man(9)-alpha-mannosidase) (Man9- 1 mannosidase) (Mannosidase alpha class 1A member 1) (Processing alpha-1,2-mannosidase IA) (Alpha-1,2- mannosidase IA) 61. O60476 Mannosyl-oligosaccharide 1,2-alpha-mannosidase IB GH47 MAN1A (EC 3.2.1.113) (Mannosidase alpha class 1A member 2) 2 (Processing alpha-1,2-mannosidase IB) (Alpha-1,2- mannosidase IB) 62. Q9NR34 Mannosyl-oligosaccharide 1,2-alpha-mannosidase IC GH47 MAN1C (EC 3.2.1.113) (HMIC) (Mannosidase alpha class 1C 1 member 1) (Processing alpha-1,2-mannosidase IC) (Alpha-1,2-mannosidase IC) 63. Q13724 Mannosyl-oligosaccharide glucosidase (EC 3.2.1.106) GH63 MOGS (Processing A-glucosidase I)

30

64. Q6NSJ0 Myogenesis-regulating glycosidase (EC 3.2.1.-) GH31 MYORG (Uncharacterized family 31 glucosidase KIAA1161) 65. Q14697 Neutral alpha-glucosidase AB (EC 3.2.1.84) (Alpha- GH31 GANAB glucosidase 2) (Glucosidase II subunit alpha) 66. Q8TET4 Neutral alpha-glucosidase C (EC 3.2.1.20) GH31 GANC 67. Q07837 Neutral and basic amino acid transport protein rBAT GH13 SLC3A1 (NBAT) (D2h) (Solute carrier family 3 member 1) (b(0,+)- type amino acid transport protein) 68. Q9HCG7 Non-lysosomal glucosylceramidase (NLGase) (EC GH116 GBA2 3.2.1.45) (Beta-glucocerebrosidase 2) (Beta-glucosidase 2) (Glucosylceramidase 2) 69. Q12889 Oviduct-specific glycoprotein (Estrogen-dependent GH18 OVGP1 oviduct protein) (Mucin-9) (Oviductal glycoprotein) (Oviductin) 70. P04746 Pancreatic alpha-amylase (PA) (EC 3.2.1.1) (1,4-alpha-D- GH13 AMY2A glucan glucanohydrolase) 71. Q9BTY2 Plasma alpha-L- (EC 3.2.1.51) (Alpha-L- GH29 FUCA2 fucoside fucohydrolase 2) (Alpha-L-fucosidase 2) 72. Q2M2H Probable maltase-glucoamylase 2 (Maltase- GH31 MGAM2 8 glucoamylase (alpha-glucosidase) pseudogene) [Includes: Glucoamylase (EC 3.2.1.3) (Glucan 1,4-alpha- glucosidase)] 73. O60502 Protein O-GlcNAcase (OGA) (EC 3.2.1.169) (Beta-N- GH84 MGEA5 acetylglucosaminidase) (Beta-N-acetylhexosaminidase) (Beta-hexosaminidase) (Meningioma-expressed antigen 5) (N-acetyl-beta-D-glucosaminidase) (N-acetyl-beta- glucosaminidase) (Nuclear cytoplasmic O-GlcNAcase and acetyltransferase) (NCOAT) 74. Q32M8 Protein-glucosylgalactosylhydroxylysine glucosidase (EC GH65 PGGHG 8 3.2.1.107) (Acid -like protein 1) 75. Q99519 Sialidase-1 (EC 3.2.1.18) (Acetylneuraminyl hydrolase) GH33 NEU1 (G9 sialidase) (Lysosomal sialidase) (N-acetyl-alpha- neuraminidase 1) 76. Q9Y3R4 Sialidase-2 (EC 3.2.1.18) (Cytosolic sialidase) (N-acetyl- GH33 NEU2 alpha-neuraminidase 2) 77. Q9UQ49 Sialidase-3 (EC 3.2.1.18) ( sialidase) GH33 NEU3 (Membrane sialidase) (N-acetyl-alpha-neuraminidase 3) 78. Q8WWR Sialidase-4 (EC 3.2.1.18) (N-acetyl-alpha-neuraminidase GH33 NEU4 8 4) 79. Q8IXA5 Sperm acrosome membrane-associated protein 3 GH22 SPACA3 (Cancer/testis antigen 54) (CT54) (Lysozyme-like acrosomal sperm-specific secretory protein ALLP-17) (Lysozyme-like protein 3) (Sperm lysozyme-like protein 1) (Sperm protein reactive with antisperm antibodies) (Sperm protein reactive with ASA) [Cleaved into: Sperm acrosome membrane-associated protein 3, membrane

31

form; Sperm acrosome membrane-associated protein 3, processed form] 80. Q96QH8 Sperm acrosome-associated protein 5 (EC 3.2.1.17) GH22 SPACA5; (Lysozyme-like protein 5) (Sperm-specific lysozyme-like SPACA5 protein X) (SLLP-X) B 81. P14410 Sucrase-isomaltase, intestinal [Cleaved into: Sucrase (EC GH31 SI 3.2.1.48); Isomaltase (EC 3.2.1.10)] 82. P04066 Tissue alpha-L-fucosidase (EC 3.2.1.51) (Alpha-L- GH29 FUCA1 fucosidase I) (Alpha-L-fucoside fucohydrolase 1) (Alpha- L-fucosidase 1) 83. O43280 Trehalase (EC 3.2.1.28) (Alpha,alpha-trehalase) GH37 TREH (Alpha,alpha- glucohydrolase)

32

Table 3. List of datasets in GlyGen

No Title of Dataset Short description of dataset Species Dataset category 1 Human Proteome 21,538 canonical isoform Human Protein centric Accessions accessions mapped to 72,299 isoforms. 2 Human Proteome 21,538 canonical protein Human Protein centric Canonical Sequences sequences. 3 Protein Annotations Protein annotations rdf file Human Protein centric (UniProtKB) 4 277 UniProtKB accessions for Human Protein centric Accessions glycosyltransferases with cross-references. 5 Protein Centric ID 21,538 UniProt Accessions Human Protein centric Mapping mapped to RefSeq, PDB, BioMuta, STRING, ChEMBL... 6 Structure Accessions 6,235 UniProt Accessions Human Protein centric mapped to PMID and PDB ID's with best resolution. 7 PubChem ID 123 UniProt Accessions Human Protein centric Mapping mapped to PDB and PubChem. 8 Pathway Accessions 21,538 UniProt Accessions Human Protein centric mapped to KEGG and Reactome. 9 Human Cancer 710 human cancer genes Human Protein centric Genes mapped to UniProtKB accessions 10 RefSeq Mapping 78,796 UniProt Accessions Human Protein centric mapped to RefSeq 11 Isoform Ensembl 93,786 UniProt Accessions Human Protein centric Mapping mapped to Ensembl peptide id 12 Protein Annotations Protein annotations rdf file Human Protein centric (RefSeq) 13 Protein Annotations annotations Human Protein centric (PDB) file 14 Protein Function 20,267 UniProtKB accessions Human Protein centric with protein name and function. 15 Congenital Disorders 44 UniProtKB accessions of Human Protein centric proteins associated with congenital disorders of glycosylation.

33

16 Protein 20,266 UniProtKB accessions Human Protein centric Recommended Name with recommended name. 17 Protein Alternative 14,369 UniProtKB accessions Human Protein centric Name with protein alternative name. 18 Protein Information 21,538 UniProtKB accessions Human Protein centric with protein information such as length, mass… 19 21,534 UniProtKB accessions Human Protein centric with gene ontology terms 20 Protein Centric ID 25,399 UniProt Accessions Mouse Protein centric Mapping mapped to RefSeq, PDB, BioMuta, STRING, ChEMBL... 21 Mouse Proteome 25,490 Canonical accessions Mouse Protein centric Acessions mapped to 35,263 isoforms 22 Mouse Proteome 25,490 protein sequences. Mouse Protein centric Canonical Sequences 23 Protein Annotations Protein annotations rdf file Mouse Protein centric (UniProtKB) 24 Protein Annotations Protein annotations xml file Mouse Protein centric (RefSeq) 25 Protein Annotations Protein structure annotations Mouse Protein centric (PDB) file for PDB 26 Protein Function 18,436 UniProtKB accessions Mouse Protein centric with protein name and function. 27 Protein 18,419 UniProtKB accessions Mouse Protein centric Recommended Name with recommended name. 28 Glycosyltransferases 211 UniProtKB accessions for Mouse Protein centric glycosyltransferases with cross-references. 29 Protein Alternative 11,319 UniProtKB accessions Mouse Protein centric Name with protein alternative name. 30 Protein Information 25,490 UniProtKB accessions Mouse Protein centric with protein information such as mass, length 31 Gene Ontology 21,479 UniProtKB accessions Mouse Protein centric with gene ontology information. 32 All Glycosylation 4,535 UniProtKB accessions Human Proteoform Sites with all associated centric glycosylation sites and evidence. 33 Glycosylation 62 UniProtKB accessions with Human Proteoform Sites (UniCarbKB) associated glycosylation sites centric mapped to UniCarbKB

34

34 Glycosylation 51 UniProtKB accessions Human Proteoform Sites (GlyTouCan, mapped to GlyTouCan and centric UniCarbKB) UniCarbKB. 35 N-Linked 403 UniProt Accessions Human Proteoform Glycosylated mapped to PDB with centric Proteins in PDB glycosylated amino acid positions. 36 O-Linked 38 UniProt Accessions mapped Human Proteoform Glycosylated to PDB with glycosylated centric Proteins in PDB amino acid positions. 37 Loss of 10,585 germline variants that Human Proteoform NLGs (germline results in the loss of N-linked centric variants) glycosylation sequons (NLGs). 38 Gain of 12,665 germline variants that Human Proteoform NLGs (germline creates N-linked glycosylation centric variants) sequons (NLGs). 39 Gain of 7,345 somatic variants(only) Human Proteoform NLGs (somatic that creates N-linked centric variants, only) glycosylation sequons (NLGs). 40 Gain of 7,345 somatic variants (all) that Human Proteoform NLGs (somatic creates N-linked glycosylation centric variants, all) sequons (NLGs). 41 NLGs (Predicted) 14,039 N-linked glycosylation Human Proteoform sequons(NLGs) predicted by centric NetNGlyc. 42 NLGs (Experimental) 14,039 N-linked glycosylation Human Proteoform sequons(NLGs) determined centric experimentally. 43 Loss of 5197 somatic variants(only) Human Proteoform NLGs (somatic that results in loss of N-linked centric variants, only) glycosylation sequons (NLGs). 44 Loss of 5,886 somatic variants(all) that Human Proteoform NLGs (somatic results in loss of N-linked centric variants, all) glycosylation sequons (NLGs). 45 Gain of 40 somatic variants that creates Human Proteoform NLGs (multiple N-linked glycosylation centric cancers, somatic sequons(NLGs) in different variants) types of cancer. 46 Loss of 12 somatic variants that results Human Proteoform NLGs (multiple in loss of N-linked centric cancers, somatic glycosylation sequons (NLGs) variants) in different types of cancer. 47 Human Proteome All 93,786 protein sequences. Human Proteoform Isoform Sequences centric 48 10,541 Multiple sequence Human Proteoform alignment centric

35

49 Loss of 250 somatic variants that Human Proteoform NLGs (ovarian results in loss of N-linked centric cancer) glycosylation sequons (NLGs). 50 Gain of 290 somatic variants that Human Proteoform NLGs (ovarian results in loss of N-linked centric cancer) glycosylation sequons (NLGs). 51 Mouse Proteome All 60,717 protein sequences. Mouse Proteoform Isoforms Sequences centric 52 Sequence alignment 4,814 Multiple sequence Mouse Proteoform alignments centric 53 All Glycosylation 3,806 UniProtKB accessions Mouse Proteoform Sites with associated glycosylation centric sites. 54 Human Glycome 5,239 GlyTouCan accessions Human Glycan centric Accessions mapped to GlycomeDB, GlycO, PubChem, UniCarbKB, Taxonomy, UniProtKB 55 Glycan Sequences 5,239 Glycan sequences Human Glycan centric present in IUPAC, WURCS, GlycoCT format. 56 Enzyme Mapping 2,658 GlyTouCan accessions Human Glycan centric mapped to GeneID (Enzyme) 57 Glycan Properties 5,239 GlyTouCan accessions Human Glycan centric with information on GlyTouCan Type, mass, base composition and topology 58 Glycan Classification 5,239 GlyTouCan accessions Human Glycan centric with glycan type and sub-type. 59 Glycan Images 5,238 glycan images. Human Glycan centric 60 Glycan Annotations Glycan annotations rdf file Human Glycan centric (GlyTouCan) 61 Glycan Annotations Glycan annotations file Human Glycan centric (PubChem - Compound) 62 Glycan Annotations Glycan annotations file Human Glycan centric (PubChem - Substance) 63 Glycan Motifs 5,329 GlyTouCan accessions Human Glycan centric with motif information. 64 Mouse Glycome 2,406 GlyTouCan accessions Mouse Glycan centric Accessions mapped to GlycomeDB, GlycO, PubChem, UniCarbKB, Taxonomy, UniProtKB. 65 Glycan Sequences 2,406 Glycan sequences Mouse Glycan centric present in IUPAC, WURCS, GlycoCT format

36

66 Enzyme Mapping 2,050 GlyTouCan accessions Mouse Glycan centric mapped to RefSeq (Enzyme). 67 Glycan Properties 2,406 GlyTouCan accessions Mouse Glycan centric with information on GlyTouCan Type, mass, base composition and topology 68 Glycan Classification 2,406 GlyTouCan accessions Mouse Glycan centric with glycan type and sub-type. 69 Glycan Images 2,405 glycan images. Mouse Glycan centric 70 Glycan Annotations Glycan annotations rdf file Mouse Glycan centric (GlyTouCan) 71 Glycan Annotations Glycan annotations file Mouse Glycan centric (PubChem - Compound) 72 Glycan Annotations Glycan annotations file Mouse Glycan centric (PubChem - Substance) 73 Glycan Motifs 2,406 GlyTouCan accessions Mouse Glycan centric with motif information. 74 Human Mouse 18,484 human genes mapped to Both Protein centric Orthologs find 19,396 mouse orthologs.

37

References

0 1. Stephens ZD, Lee SY, Faghri F, et al. Big data: Astronomical or genomical? PLoS biology. 2015;13(7):e1002195. http://www.ncbi.nlm.nih.gov/pubmed/26151137. doi: 10.1371/journal.pbio.1002195.

2. Robert T. Hersh. Atlas of protein sequence and structure, 1966. Systematic Zoology. 1967;16(3):262-263. https://www.jstor.org/stable/2412074. doi: 10.2307/2412074.

3. Galperin MY, Rigden DJ, Fernandez-Suarez XM. The 2015 nucleic acids research database issue and molecular biology database collection. Nucleic Acids Res. 2015;43(Database issue):1. doi: 10.1093/nar/gku1241 [doi].

4. Berners-Lee T, Hendler J, Lassila O. The semantic web. Scientific American. 2001;5(285):34-43. https://www.jstor.org/stable/26059207. doi: 10.1038/scientificamerican0501-34.

5. Resource description framework (RDF): Concepts and abstract syntax. W3C Web site. https://www.w3.org/TR/rdf-concepts/. Accessed May 6, 2018.

6. RDF schema 1.1. W3C Web site. https://www.w3.org/TR/rdf-schema/. Accessed May 6, 2018.

7. SPARQL query language. W3C Web site. https://www.w3.org/TR/sparql11-query/. Accessed May 6, 2018.

8. OWL web ontology language overview. W3C Web site. https://www.w3.org/TR/owl- features/. Accessed May 6, 2018.

9. OWL 2 web ontology language document overview (second edition). W3C Web site. https://www.w3.org/TR/owl2- overview/. Accessed May 6, 2018.

10. NCBI Resource Coordinators. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2017;45(D1):D17. doi: 10.1093/nar/gkw1071 [doi].

11. Goble C, Stevens R. State of the nation in data integration for bioinformatics. J Biomed Inform. 2008;41(5):687-693. doi: 10.1016/j.jbi.2008.01.008 [doi].

12. Lapatas V, Stefanidakis M, Jimenez RC, Via A, Schneider MV. Data integration in biological research: An overview. J Biol Res (Thessalon). 2015;22(1):5. eCollection 2015 Dec. doi: 10.1186/s40709-015-0032-5 [doi].

13. Good BM, Wilkinson MD. The life sciences semantic web is full of creeps! Brief Bioinform. 2006;7(3):275-286. doi: bbl025 [pii].

38

14. Belleau F, Nolin MA, Tourigny N, Rigault P, Morissette J. Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. J Biomed Inform. 2008;41(5):706- 716. doi: 10.1016/j.jbi.2008.03.004 [doi].

15. Gligorijevic V, Przulj N. Methods for biological data integration: Perspectives and challenges. J R Soc Interface. 2015;12(112):10.1098/rsif.2015.0571. doi: 10.1098/rsif.2015.0571 [doi].

16. Livingston KM, Bada M, Baumgartner WA, Hunter LE. KaBOB: Ontology-based semantic integration of biomedical databases. BMC Bioinformatics. 2015;16:3. doi: 10.1186/s12859-015-0559-3 [doi].

17. Chung SY, Wong L. Kleisli: A new tool for data integration in biology. Trends Biotechnol. 1999;17(9):351-355. doi: S0167-7799(99)01342-6 [pii].

18. Smith B, Ashburner M, Rosse C, et al. The OBO foundry: Coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology. 2007;25(11):1251. https://www.nature.com/articles/nbt1346. Accessed May 10, 2018. doi: 10.1038/nbt1346.

19. Varki A. Biological roles of glycans. Glycobiology. 2017;27(1):3-49. https://academic.oup.com/glycob/article/27/1/3/2527575. Accessed May 15, 2018. doi: 10.1093/glycob/cww086.

20. Campbell MP, Aoki-Kinoshita KF, Lisacek F, York WS, Packer NH. Glycoinformatics. In: Varki A, Cummings RD, Esko JD, et al, eds. Essentials of glycobiology. 3rd ed. Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press; 2017. http://www.ncbi.nlm.nih.gov/books/NBK453097/. Accessed May 6, 2018. 10.1101/glycobiology.3e.052.

21. Glycomics ontology. BioPortal Web site. https://bioportal.bioontology.org/ontologies/GLYCO. Accessed May 6, 2018.

22. Glycoconjugate ontology. GitHub Web site. https://github.com/glycoinfo/GlycoCoO. Accessed May 6, 2018.

23. Ranzinger R, Aoki-Kinoshita KF, Campbell MP, et al. GlycoRDF: An ontology to standardize glycomics data in RDF. Bioinformatics. 2015;31(6):919-925. doi: 10.1093/bioinformatics/btu732 [doi].

24. Ranzinger R, Herget S, von der Lieth C-, Frank M. GlycomeDB--a unified database for carbohydrate structures. Nucleic Acids Research. 2011;39(Database):D376. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3013643/. doi: 10.1093/nar/gkq1014.

25. Aoki-Kinoshita K, Agravat S, Aoki NP, et al. GlyTouCan 1.0--the international glycan structure repository. Nucleic acids research. 2016;44(D1):D1242. http://www.ncbi.nlm.nih.gov/pubmed/264706458. doi: 10.1093/nar/gkv1041.

39

26. Rini JM, Esko JD. Glycosyltransferases and glycan-processing enzymes. In: Varki A, Cummings RD, Esko JD, et al, eds. Essentials of glycobiology. 3rd ed. Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press; 2015. http://www.ncbi.nlm.nih.gov/books/NBK453021/. Accessed May 15, 2018.

27. Freeze HH, Kinoshita T, Schnaar RL. Genetic disorders of glycan degradation. In: Varki A, Cummings RD, Esko JD, et al, eds. Essentials of glycobiology. 3rd ed. Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press; 2015. http://www.ncbi.nlm.nih.gov/books/NBK453095/. Accessed May 15, 2018. 10.1101/glycobiology.3e.044.

28. UniProt Consortium T. UniProt: The universal protein knowledgebase. Nucleic Acids Res. 2018;46(5):2699. doi: 10.1093/nar/gky092 [doi].

29. Berman HM, Westbrook J, Feng Z, et al. The . Nucleic Acids Res. 2000;28(1):235-242. https://academic.oup.com/nar/article/28/1/235/2384399. Accessed May 10, 2018. doi: 10.1093/nar/28.1.235.

30. O'Leary NA, Wright MW, Brister JR, et al. Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):733. Accessed May 10, 2018. doi: 10.1093/nar/gkv1189.

31. Campbell MP, Peterson R, Mariethoz J, et al. UniCarbKB: Building a knowledge platform for glycoproteomics. Nucleic acids research. 2014;42(Database issue):D215. http://www.ncbi.nlm.nih.gov/pubmed/24234447.

32. Kim S, Thiessen PA, Bolton EE, et al. PubChem substance and compound databases. Nucleic acids research. 2016;44(D1):D1213. http://www.ncbi.nlm.nih.gov/pubmed/26400175. doi: 10.1093/nar/gkv951.

33. Natale DA, Arighi CN, Barker WC, et al. The protein ontology: A structured representation of protein forms and complexes. Nucleic acids research. 2011;39(Database issue):D539. http://www.ncbi.nlm.nih.gov/pubmed/20935045.

34. Simonyan V, Goecks J, Mazumder R. Biocompute objects-A step towards evaluation and validation of biomedical scientific computations. PDA J Pharm Sci Technol. 2017;71(2):136-146. doi: 10.5731/pdajpst.2016.006734 [doi].

35. Shannon P, Markiel A, Ozier O, et al. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498- 2504. Accessed May 7, 2018. doi: 10.1101/gr.1239303.

36. Gene ontology consortium: Going forward. Nucleic Acids Res. 2015;43(Database issue):1049. Accessed May 15, 2018. doi: 10.1093/nar/gku1179.

37. Binns D, Dimmer E, Huntley R, Barrell D, O'Donovan C, Apweiler R. QuickGO: A web-based tool for gene ontology searching. Bioinformatics. 2009;25(22):3045-3046. http://www.ncbi.nlm.nih.gov/pubmed/19744993. doi: 10.1093/bioinformatics/btp536.

40

38. Lombard V, Golaconda Ramulu H, Drula E, Coutinho PM, Henrissat B. The carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res. 2014;42(Database issue):490. Accessed May 15, 2018. doi: 10.1093/nar/gkt1178.

39. Finn RD, Bateman A, Clements J, et al. Pfam: The protein families database. Nucleic Acids Research. 2014;42(D1):D230. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva- 102095. doi: 10.1093/nar/gkt1223.

41