Databases and Bioinformatic Tools for Glycobiology and Glycoproteomics

International Journal of Molecular Sciences Review Databases and Bioinformatic Tools for Glycobiology and Glycoproteomics Xing Li, Zhijue Xu, Xiaokun Hong, Yan Zhang * and Xia Zou * Key Laboratory of Systems Biomedicine (Ministry of Education), Shanghai Center for Systems Biomedicine, Shanghai Jiao Tong University, Shanghai 200240, China; [email protected] (X.L.); [email protected] (Z.X.); [email protected] (X.H.) * Correspondence: [email protected] (Y.Z.); [email protected] (X.Z.); Tel.: +86-21-34206778 (Y.Z.); +86-21-34206966 (X.Z.) Received: 31 July 2020; Accepted: 11 September 2020; Published: 14 September 2020 Abstract: Glycosylation plays critical roles in various biological processes and is closely related to diseases. Deciphering the glycocode in diverse cells and tissues offers opportunities to develop new disease biomarkers and more effective recombinant therapeutics. In the past few decades, with the development of glycobiology, glycomics, and glycoproteomics technologies, a large amount of glycoscience data has been generated. Subsequently, a number of glycobiology databases covering glycan structure, the glycosylation sites, the protein scaffolds, and related glycogenes have been developed to store, analyze, and integrate these data. However, these databases and tools are not well known or widely used by the public, including clinicians and other researchers who are not in the field of glycobiology, but are interested in glycoproteins. In this study, the representative databases of glycan structure, glycoprotein, glycan–protein interactions, glycogenes, and the newly developed bioinformatic tools and integrated portal for glycoproteomics are reviewed. We hope this overview could assist readers in searching for information on glycoproteins of interest, and promote further clinical application of glycobiology. Keywords: glycosylation; glycoinformatics; glycan; glycoprotein; glycogene; lectin 1. Introduction Glycosylation is known as one of the most ubiquitous and important post-translational modifications (PTMs) in nature. It is estimated that more than 50% of mammal proteins and 90% of secreted and membrane proteins are glycosylated [1]. Glycosylation controls and defines a wide range of biological events, including cellular signaling, ligand–receptor interactions, cell–cell communication, pathogen–host recognition, immunological responses, and, consequently, it is involved in many human diseases, including inflammation and cancer [2]. Furthermore, the importance of glycosylation is highlighted by the fact that it can influence the therapeutic properties of many therapeutic proteins [3]. Therefore, deciphering the glycocode in diverse cells and tissues offers opportunities to develop new disease biomarkers and more effective recombinant therapeutics. However, studies of the glycome are more complicated compared with those of the genome and proteome due to the diversity and branched nature of glycans. The complexity is further increased by the fact that glycan expression on a single protein is subject to both macroheterogeneity (variable occupancy of different glycosylation sites) and microheterogeneity (variable distribution of different glycans attached to a single glycosylation site) [4]. In addition, due to the different expression of glycogenes such as glycosyltransferase and glucosidase in different cells and tissues, the glycans attached to the same protein are cell-, tissue-, organism-, and physiological state-dependent. Furthermore, the same glycan structure on different proteins may have different functions. Therefore, the interpretation of Int. J. Mol. Sci. 2020, 21, 6727; doi:10.3390/ijms21186727 www.mdpi.com/journal/ijms Int. J. Mol. Sci. 2020, 21, 6727 2 of 29 the glycosylation sites, the attached glycan structure, the protein scaffolds, and related glycogenes are crucialInt. for J. Mol. understanding Sci. 2020, 21, x FOR the PEER biological REVIEW function of glycosylation. 2 of 28 In past decades, with improvements in analytical methods including high performance separation the interpretation of the glycosylation sites, the attached glycan structure, the protein scaffolds, and techniques and mass spectrometry (MS)-based analysis techniques [5,6], the qualitative and quantitative related glycogenes are crucial for understanding the biological function of glycosylation. data of glycans, glycosites, glycopeptides, and glycoproteins have increased tremendously. In addition, In past decades, with improvements in analytical methods including high performance extensiveseparation application techniques of lectin and mass chips, spectrometry glycan microarrays, (MS)-based and analysis glycogene techniques chips [5 greatly,6], the qualitative increased the data ofand glycan–protein quantitative data interactions of glycans, and glycosites, the glycogenome glycopeptides, [7–9 ].and Consequently, glycoproteins the have development increased of databasestremendously. and informatics In addition, tools extensive to store, application retrieve, integrate, of lectin chips, and interpret glycan microarrays, these data and is one glycogene of the most activechips fields greatly in glycobiology. increased the data of glycan–protein interactions and the glycogenome [7–9]. TheConsequently, first publicly the development available glycan of databases structural and database,informatics thetools Complex to store, retrieve, Carbohydrate integrate Structural, and Databaseinterpret (CCSD)—also these data is known one of asthe the most CarbBank—was active fields in glycobiology. developed in the late 1980s [10,11]. Inspired by this work,The more first and publicly more databases available haveglycan been structural developed, database, such the as the Complex GLYCOSCIENCE.de Carbohydrate Structural database [12] Database (CCSD)—also known as the CarbBank—was developed in the late 1980s [10,11]. Inspired and the KEGG (Kyoto Encyclopedia of Genes and Genomes) GLYCAN database [13]. In the early by this work, more and more databases have been developed, such as the GLYCOSCIENCE.de 2000s, the Consortium for Functional Glycomics (CFG) project began with a comprehensive database database [12] and the KEGG (Kyoto Encyclopedia of Genes and Genomes) GLYCAN database [13]. containingIn the information early 2000s, on the glycan Consortium structures, for glycan-bindingFunctional Glycomics proteins (CFG) (GBPs), project glycosyltransferases, began with a and glycan-relatedcomprehensive genedatabase knockout containing mice information [14], while on the glycan Japan structures, Consortium glycan for-binding Glycobiology proteins and Glycotechnology(GBPs), glycosyltransferase DataBase (JCGGDB)s, andcollected glycan-related information gene onknockout mass spectrometry mice [14], while analysis, theglycogenes, Japan and theConsortium affinity constants for Glycobiology between and lectins Glycotechnology and pyridylaminated DataBase glycans (JCGGDB) [15– collected17]. In recent information years, variouson glycoinformaticmass spectrometry databases analysis, and tools glycogenes, have been and developed. the affinity Of note,constants significant between progress lectins hasand been madepyridylaminated lately through aglycans new generation [15–17]. In ofrecent centralized years, various and integrated glycoinformatic resources, databases including and tools GlyGen have [18], Glycomics@ExPASybeen developed. [ 19Of], note, and GlyCosmossignificant progress [20]. In has addition been made to these lately comprehensive through a new resources, generation a varietyof centralized and integrated resources, including GlyGen [18], Glycomics@ExPASy [19], and of databases on glycan structure, glycogenes, glycoproteins, and a series of bioinformatic tools based GlyCosmos [20]. In addition to these comprehensive resources, a variety of databases on glycan on these databases have been developed all over the world, which also promote the development structure, glycogenes, glycoproteins, and a series of bioinformatic tools based on these databases of glycobiology.have been developed However, all generally over the speaking,world, which these also databases promote andthe development tools have not of beenglycobiology. well known and widelyHowever, used generally by the speaking, public, including these databases clinicians and tools and otherhave not researchers been well known who are and not widely in the used field of glycobiology,by the public, but areincluding interested clinicians in glycoproteins. and other researchers who are not in the field of glycobiology, but Asare glycoproteins interested in glycoproteins. have become the promising candidates of disease markers and therapeutic targets, we mainlyAs focus glycoproteins on providing have anbecome overview the promising of main featurescandidates and of functionalitiesdisease markers of and the therapeutic representative databasestargets, of glycanwe mainly structure, focus on glycoproteins, providing an glycan–protein overview of main interactions, features and and functionalities glycogenes inof humansthe and mammaliansrepresentative (Figure databases1). In of addition, glycan westructure, summarized glycoproteins, some newly glycan developed–protein interactions, bioinformatic and tools glycogenes in humans and mammalians (Figure 1). In addition, we summarized some newly and integrated resources for glycoproteomics with the goal of making these bioinformatics resources developed bioinformatic tools and integrated resources for glycoproteomics with the goal of making more widely known to the public,

Databases and Bioinformatic Tools for Glycobiology and Glycoproteomics

Zebrafish Disease Models to Study the Pathogenesis of Inherited Manganese Transporter Defects and Provide A

Glycomics Goes Visual and Interactive

Unicarbkb: Building a Standardised and Scalable Informatics Platform for Glycosciences Research

The ELIXIR Core Data Resources: Fundamental Infrastructure for The

Hierarchical Classification of Gene Ontology Terms Using the Gostruct

Bioinformatics Study of Lectins: New Classification and Prediction In

Toolboxes for a Standardised and Systematic Study of Glycans

Gene Ontology Analysis with Cytoscape

To Find Information About Arabidopsis Genes Leonore Reiser1, Shabari

Use and Misuse of the Gene Ontology Annotations

Pathogenicity and Selective Constraint on Variation Near Splice Sites

Viroinformatics Investigation of B-Cell Epitope Conserved Region in SARS