Thesis
An interactive and integrative view of glycobiology
ALOCCI, Davide
Abstract
Glycosylation is one of the most abundant post-translational modifications (PTM) and the primary cause of microheterogeneity in proteins (glycoforms). It consists of the addition of a sugar moiety on the surface of a protein, and it has been suggested that over 50% of mammalian cellular proteins are typically glycosylated. Despite its importance, glycobiology is still lacking concrete bioinformatics support which, in the last two decades, has boosted other major ’omics’ like proteomics, transcriptomics and genomics. Additionally, the inherent complexity of carbohydrates and the use of many orthogonal experimental techniques to elucidate a structure have slowed down the annotation of novel glycans leaving glycobiology fragmented internally. This thesis aims at the creation of a collection of integrative, explorative and knowledge-based tools to reconstitute the puzzle of biological evidence produced by different glycomics experiments and, at the same time, fill the gap with other ’omics’.
Reference
ALOCCI, Davide. An interactive and integrative view of glycobiology. Thèse de doctorat : Univ. Genève, 2018, no. Sc. 5280
URN : urn:nbn:ch:unige-1122950 DOI : 10.13097/archive-ouverte/unige:112295
Available at: http://archive-ouverte.unige.ch/unige:112295
Disclaimer: layout of this document may differ from the published version.
1 / 1 UNIVERSITEDEGENÈVE FACULTE DES SCIENCES
Département d’informatique Dr. Frédérique Lisacek
An Interactive and Integrative view of Glycobiology
THÈSE
Présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention bioinformatique
Par Davide Alocci de Orbetello (Italy)
Thèse N ◦ 5280
GENÈVE 2018
iii
Résumé en Français
La glycosylation est l’une des modificationes post- traductionnelles (PTM) les plus abondantes ainsi que la première cause de la microhétérogénéité des proteines (gly- coformes). Elle consiste en l’adjonction d’une molécule de sucre (glycane) à la sur- face d’une protéine, d’autre part, il a été suggéré que plus de la moitié des protéines cellulaires des mammifères sont glycosylées .
Malgré son importance, la glycobiologie reste isolée des autres domaines de la bi- ologie et les ressources bioinformatiques nécessaires à son essor sont aussi éparpil- lées, peu standardisées et rarement intégrées. Pourtant au cours des deux dernières décénnies, la bioinformatique et l’intégration de données ont joué un rôle crucial dans les autres disciplines ‘ -omiques’ comme le protéomique, transcriptomique et génomique . De plus la complexité intrinsèque des glycanes ainsi que le recours à une large panoplie de techniques expérimentales orthogonales pour élucider leurs structures, ralentissent la découverte de nouvelles glycanes et contribuent à la frag- mentiation interne de la glycobiologie.
Ce travail de recherche vise à la création d’un ensemble d’outils pour reconstituer le puzzle des différentes sources de données expérimentales produites en glycomique aussi qu’à rétablir la liaison avec les autres ressources ’ -omiques ’ . Afin de car- actériser l’objectif propre de chaque outil nous les avons classés en trois catégories: intégratif, exploratoire et basé sur la connaissance.
Les outils de nature integrative ont été conçus pour combiner des informations que l’on trouve dans différentes bases de données et ne peuvent pas être intégrées par un simple hyper-lien. Cette catégorie est dominée par deux outil principaux: GlyS3 et EpitopeXtractor. Le premier recherche des sous-structure de glycanes, le deuxième définit les épitopes de glycanes contenus dans un ou plusieurs glycanes complets sur la base d’un ensemble de référence.
Les outils de nature exploratoire permettent à l’utilisateur d’ analyser des données à travers une approche graphique et interactive. Le premier logiciel de cette catégorie est PepSweetener qui facilite l’annotation manuelle des glycopeptides. Le deuxième est Gynsight qui, grâce à une interface web dédiée, permet de visualiser et comparer iv des profils de glycanes. Enfin, nous avons aussi repris la ressource Glydin’ qui per- met de naviguer dans un réseau de glyco-épitopes pour mettre en valeur la parenté éventuelle de ligands dans des données d’interactions sucre-protéine.
La dernière categorie concerne les outils basés sur la connaissance. La plateforme centrale appelée GlyConnect est un tableau de bord intégré pour l’analyse de don- nées glycomiques et glycoprotéomiques.
En conclusion, tous les outils présentés dans cette thèse font partie de l’initiative Glycomics@ExPASy qui vise à diffuser l’utilisation de la bioinformatique en glycobi- ologie et à mettre l’accent sur la relation entre la glycobiologie et la bioinformatique à partir des protéines. v
Abstract
Glycosylation is one of the most abundant post-translational modifications (PTM) and the primary cause of microheterogeneity in proteins (glycoforms). It consists of the addition of a sugar moiety on the surface of a protein, and it has been suggested that over 50% of mammalian cellular proteins are typically glycosylated.
Despite its importance, glycobiology is still lacking concrete bioinformatics support which, in the last two decades, has boosted other major ’omics’ like proteomics, tran- scriptomics and genomics. Additionally, the inherent complexity of carbohydrates and the use of many orthogonal experimental techniques to elucidate a structure have slowed down the annotation of novel glycans leaving glycobiology fragmented internally. This thesis aims at the creation of a collection of tools to reconstitute the puzzle of biological evidence produced by different glycomics experiments and, at the same time, fill the gap with other ’omics’. To characterise the aim of each tool, we have created three different categories: integrative, explorative and knowledge- based.
Integrative tools have been designed to combine specific pieces of information which are stored in different databases and cannot be integrated by a simple connection. The category is populated by two different tools: GlyS3 and EpitopeXtractor. The first performs glycan substructure search whereas the second defines which glycan epitopes are contained in one or more glycans based on a curated collection. Ex- plorative tools enable the user to perform data analysis with a graphical and inter- active approach, speeding up the analysis process. The first tool of this category is PepSweetener which facilitates the manual annotation of glycopeptides. Then, we have Glynsight which proposes a way to visualise and compare glycan profiles with a dedicated web-interface. To conclude, we mention Glydin’, a network visu- alisation of almost 600 glycan epitopes where connections exemplify substructure relationships. The last category regards the Knowledge-based tools and is filled by GlyConnect, an integrated dashboard for glycomics and glycoproteomics analysis. All tools presented in this thesis are part of the Glycomics@ExPASy initiative which aims at spreading the use of bioinformatics in glycobiology and stressing the relation between glycobiology and protein-oriented bioinformatics. vi vii
Acknowledgements
Firstly, I would like to express my sincere gratitude to my advisor Dr Frede´rique´ Lisacek for the continuous support of my PhD study and related research, for her patience and motivation. She always supported my ideas promoting the exploration of new technologies.
Besides my advisor, I would like to thank the rest of my thesis committee: Prof. Bastien Chopard, Prof. Amos Bairoch, and Prof. Serge Perez, for their insightful comments and encouragement.
I would like to thank my fellow labmates for the stimulating discussions during coffee breaks and for all the fun we have had in the last four years.
Additionally, I would like to thank the Swiss Institute of Bioinformatics and all the partners of the Gastric Glyco Explorer that have given me the possibility to reach the end of this PhD.
In the end, I would like to thank my family and my friend for supporting me spiri- tually throughout writing this thesis and my life in general.
ix
Contents
Résumé en Français iii
Abstractv
Acknowledgements vii
Contents ix
1 Introduction1
1.1 Bioinformatics...... 4 1.2 Data Integration...... 5 1.2.1 Data discovery...... 6 1.2.2 Data exploitation...... 6 1.2.3 Data provenance...... 7 1.3 Data integration strategies...... 8 1.3.1 Service oriented architecture...... 9 1.3.2 Link integration...... 9 1.3.3 Data warehousing...... 10 1.3.4 View integration...... 11 1.3.5 Model-driven service oriented architecture...... 12 1.3.6 Integration applications...... 12 1.3.7 Workflows...... 13 1.3.8 Mashups...... 14 1.3.9 The Semantic Web and RDF...... 15 1.4 Data integration in bioinformatics...... 16 1.4.1 Standards...... 16 1.4.2 Ontologies...... 17 x
1.4.3 Formats and reporting guidelines...... 19 1.4.4 Identifiers...... 20 1.4.5 Visualisation...... 20 1.5 Glycomics...... 21 1.6 Data integration in glycomics...... 25 1.6.1 Glycan formats...... 25 1.6.2 Glycan graphical representations...... 27 1.6.3 Glycan composition format...... 28 1.6.4 Identifiers...... 28 1.6.5 Reporting Guidelines...... 29 1.6.6 Ontologies...... 30 1.6.7 Visualisation...... 31 1.7 Web development...... 31 1.7.1 Web Components...... 33 1.8 Objectives and Thesis Overview...... 34
2 Glycomics@ExPASy 37
2.1 Overview...... 37 2.2 Concluding remarks...... 51
3 Integrative tools 53
3.1 Overview...... 53 3.2 Concluding remarks...... 93
4 Explorative tools 95
4.1 Overview...... 95 4.2 Concluding remarks...... 121
5 Knowledge-based tools 123
5.1 Overview...... 123 5.2 Concluding remarks...... 162
6 Discussion 163
6.1 Achievements...... 163 xi
6.1.1 Reassemble the glycoprotein puzzle...... 163 6.1.2 Bridging glycoproteins and carbohydrate-binding proteins... 166 6.1.3 Providing tools for data analysis in glycomics...... 170 6.1.4 Standards and community feedbacks...... 173 6.2 Technical discussion...... 174 6.2.1 Service Oriented Architecture...... 174 6.2.2 Resource description framework...... 176 6.2.3 Web components...... 178
7 Conclusion 183
7.1 GlyConnect...... 183 7.1.1 Data provenance...... 183 7.1.2 Semi-automated curation pipeline...... 184 7.1.3 Cross-references...... 184 7.2 GlyS3...... 185 7.2.1 Ontology refinement...... 185 7.3 PepSweetener...... 185 7.3.1 Dataset automatic update...... 185 7.4 Glynsight...... 186 7.4.1 Multiple dataset analysis...... 186 7.5 Glydin’...... 187 7.5.1 Data enhancement and interface improvements...... 187 7.6 Final thoughts...... 187
A Other Co-author papers 189
B Technical Notes 207
Bibliography 217
1
Chapter 1
Introduction
In the last two years, humanity has created more data that in its whole history (Munevar, 2017). In particular, life sciences have seen an explosion of new data, due to the spreading of high throughput technologies. For instance, recent advances in single-cell sequencing and cellular imaging have generated a considerable amount of new high dimensional data (Patwardhan, 2017), (Leinonen et al., 2011). According to Cook et al., the European Bioinformatics Institute (EBI) has doubled the data stor- age in less than two years reaching 120 petabytes (PB) in 2017 (Cook et al., 2018). To put this number in perspective, at QCon San Francisco 2016, the Netflix data ware- house was storing contents for 60PB and in 2014 Facebook was generating 4 PB of new data per day.
Figure 1.1 shows the growth of the EBI data storage divided into different data types: sequence, mass spectrometry and microarray. Since the scale on the Y-axis is loga- rithmic, the growth of all data types remains exponential with a possibility to over- come an exabyte of data in 2022 (Cook et al., 2018).
The increasing availability and growth rate of biological and biomedical information provides an opportunity for personalised medicine. However, collecting informa- tion is not enough to understand and control the biological mechanisms underlying a living system. In the "Data integration in the era of omics: current and future chal- lenges", Gomez-Cabrero et al. describe research in life science as a two-step work- flow: 1) Identify the components that make up a living system 2) Understand the interactions between all components that make the system alive (Gomez-Cabrero et al., 2014). Collecting more and more data, at a reduced cost, is a way to catalogue the 2 Chapter 1. Introduction
FIGURE 1.1: Data growth at European Bioinformatics Institute (EMBL-EBI) divided by type: nucleotide sequence (orange), mass spectrometry (blue) and microarray (brown). Courtesy of “The Eu- ropean Bioinformatics Institute in 2017: data coordination and integration” (Cook et al., 2018). elements of life and fulfil the first step of the workflow. Nevertheless, to accomplish the second step, data sources must be integrated under mathematical and relational models which can describe the relationships among components. Beside the pro- duction of biological data which is delegated to experimental methods, all the other milestones like data collection, storage, integration and analysis are going under the umbrella of bioinformatics. Bioinformaticians are tackling a wide range of problems from data production to data analysis building methods and software applications to process data and ultimately understand biological systems.
The actual situation is depicted in the following figure (Figure 1.2) where proteomics, transcriptomics and genomics are at the core of biology and medicine whereas lipidomics, metabolomics and glycomics seldom are taken into account. Bioinformatics played an essential role in the integration of genomics, proteomics and transcriptomics (Manzoni et al., 2018). The development of an integrated approach to assembly and annotate newly sequenced genomes from transcriptome and proteome data is just one example of the importance of bioinformatics (Prasad et al., 2017). How- ever, other issues like the extent of proteoforms (Aebersold et al., 2018) or that of Chapter 1. Introduction 3
FIGURE 1.2: A prospect of the integration of ‘omics’ research in Bi- ology and Medicine. Courtesy of “Databases and Associated Tools for Glycomics and Glycoproteomics” (Lisacek et al., 2016). the metabolome still hamper data integration, leaving the puzzle incomplete. Fur- thermore, several other -omics fields such as lipidomics or glycomics are remote in this picture mainly because of technical and knowledge gaps. This is particularly the case in glycomics that remains internally fragmented and misses solid bioinfor- matics support. Only the integration of all the ’omics’ in Figure 1.2 will lead to an in-silico simulation on a living cell.
As mentioned above, the identification of proteoforms (each individual molecular form of an expressed protein) has become a topic of concern for understanding bi- ological processes further. A broad range of sources of biological variation alters the amino acid sequence that results from the conceptual translation of a nucleotide sequence. Among these, post-translational modifications are prevalent, and in par- ticular, glycosylation, which is considered as the most common, concerning putative sites (Khoury, Baliban, and Floudas, 2011). It consists of the adduction of a carbo- hydrate to selected residues. Details of this process are provided in the glycomics section. 4 Chapter 1. Introduction
This thesis focuses on the means to boost the usage of bioinformatics in glycobi- ology that encompasses glycomics and, as a side effect, integrating glycosylation data with other more advanced “omics”, making a step towards system biology as hinted in Figure 1.2. In this context, several web tools have been developed, and they will be presented in the thesis.
We first illustrate some fundamental principles of bioinformatics about data inte- gration. Then, we introduce glycomics and explain our approach for integrating glycomics data. Finally, we present the basic principles of web development which we have applied.
1.1 Bioinformatics
The application of computational techniques to organise and analyse information on biological macromolecules is defined as bioinformatics (Luscombe, Greenbaum, and Gerstein, 2001). This relatively new discipline applies the principles of information technology and computer science to biological data. Computers can indeed handle a large quantity of data and perform large-scale analysis in a limited amount of time. The aim of bioinformatics is generally two-fold. In the first place, bioinformaticians are organising the information in a way that can be retrieved quickly. Thus, they are usually connected with the creation of databases like UniProt (Bateman et al., 2017) which stores all the information about known proteins. The second aim is to generate software applications which act as support in the data analysis. One example is constituted by all the software that support genome assembly and the comparison between sequenced genomes.
In this thesis, we have touched both aims of bioinformatics developing three differ- ent class of software applications: integrative, knowledge-based, explorative tools. The first and the second class of application are directly connected with the first aim of bioinformatics whereas explorative tools are supporting data analysis which is related to the second aim. However, all class of tools are dealing with the problem of data integration and data visualisation. Thus, in the following paragraph, we 1.2. Data Integration 5 will present the problem of data integration with a centre on bioinformatics. Data visualisation will be introduced as part of the data integration process.
1.2 Data Integration
Data integration is one of the main challenges in bioinformatics. Although the term "data integration" often appears in research, especially in life science, a consolidated definition is still missing. We report here several definitions of "data integration" which will help in understanding the central concept and how it evolved.
In the beginning, the "data integration" problem was restricted to database research. Ziegler et al. describe the integration problem as the aim at "combining selected sys- tems so that they form a unified new whole and give users the illusion of interacting with one single information system" (Ziegler and Dittrich, 2004). A higher level definition comes from Leser and Naumann which defines "data integration" as "a redundancy- free representation of information from a collection of data sources with overlapping con- tents" (Leser and Naumann, 2007). More recently, Gomez-Cabrero et al. describe "data integration as the use of multiple sources of information (or data) to provide a better understanding of a system/situation/association/etc" (Gomez-Cabrero et al., 2014).
Out of these three options, the last definition is the most general and complete. The authors emphasised the use of several sources to discover new insights that are hid- den when looking at a single source. We propose to adopt this definition of "data integration" and consider the two main challenges associated with it: 1) finding rel- evant data sources (data discovery); 2) using collected data to produce new findings (data exploitation) (Weidman and Arrison, 2010). Additionally, we tackle the prob- lem of data provenance, which arises once data sources are integrated and allow to trace the origin of each piece of information. 6 Chapter 1. Introduction
1.2.1 Data discovery
The data discovery challenge entails the search of relevant data sources for describ- ing a system. Nowadays, finding data sources for a specific problem is easy. How- ever, finding a relevant source for a system of interest is becoming more and more challenging. The ease of publishing data through the Web has contributed to an explosion of online resources and publicly accessible datasets. Furthermore, the pic- ture becomes more and more fragmented as each sub-discipline provides its data representation. In biology, the problem is not only arising from aiming to com- bine different omics but also to reconcile information even within the same field. For example in glycomics, several different formats are available to describe glycan structures. We list here the main three:
• IUPAC format which is regulated by the International Union of Pure and Ap- plied Chemistry
• GlycoCT format which has been developed under the EUROCarbDB project
• WURSC format which is developed in Japan and used for the structure repos- itory GlyTouCan.
The creation of more and more heterogeneous data sources leads to what has been called "A loose federation of bio-nation" (Goble and Stevens, 2008). In this con- text, the use of standards and community guidelines, which are going to be tackled in "Data integration in bioinformatics" paragraph, is the only way to stop the data fragmentation smoothing the way towards an efficient data discovery.
1.2.2 Data exploitation
Data exploitation is the process of exploring a set of data sources to provide new in- sights. Before starting the data exploitation, scientists must have a complete overview of each dataset including the units of measure and more subtle aspects such as en- vironmental conditions, equipment calibrations, preprocessing algorithms, etc (Wei- dman and Arrison, 2010). Having an in-depth knowledge of the data is the only 1.2. Data Integration 7 way to avoid the phenomenon of "garbage-in garbage-out" that could affect the data exploitation outcome.
Once each data set has been fully clarified, researchers can focus on developing methodologies to analyse the data. Methodologies can vary according to the dif- ferent data types. In this regards, we distinguish between "similar" and "heteroge- neous" types. According to Hamid et al. we consider "similar type" when they are produced by the same underlying source, for example, they are all gene expression data sets. If multiple data sources, like gene expression and protein quantification data sets, are taken into account, we refer to data as "heterogeneous type" (Hamid et al., 2009). Although researchers are developing more and more hybrid method- ologies for data integration, a set of general techniques will be presented in the next paragraph.
The creation of data explorative tools is the last but not the least step of the data ex- ploitation process. The development of user-friendly interfaces to navigate the out- come of data exploitation can decree the success of the whole process. Although sci- entist are usually putting most of their attention on methodologies and algorithms, the design of custom visualisation can be time-consuming especially due to data heterogeneity and data volume.
1.2.3 Data provenance
With the integration of more and more data sources, understanding the derivation of each piece of information becomes an issue. This is referred to as the data prove- nance issue in the literature. As in the case of data integration, data provenance has a different definition according to the field. In the context of database systems, Bune- man et al. describe it as "the description of the origins of a piece of data and the process by which it arrived in a Database" (Buneman, Khanna, and Wang-Chiew, 2001). However, data provenance is not only related to the data but also to the processes that led to the creation of the data. For this reason, Greenwood et al. (Greenwood et al., 2003) define data provenance as part of metadata which records “the process of biological 8 Chapter 1. Introduction experiments for e-Science, the purpose and results of experiments as well as annotations and notes about experiments by scientists”.
For our purposes, we characterise data provenance as the record of the origin and all transformations that every dataset has undergone. It is quite an central ques- tion since the corresponding solutions enable users to trace back to the original data. Additionally, this concern for tracing information back provides the means to un- derstand the different workflows that have been applied to integrate each dataset. For example, if a user is interested in removing all datasets that are produced from mouse, using the data provenance metadata, he/she can exclude these datasets and use the rest of information to perform new analyses (Masseroli et al., 2014). Data provenance is also essential in the assertion of the quality of a dataset, especially in fields where information is treated manually and the quality can vary according to the curator. In this thesis, data integration issues have prevailed and, only towards the end as we started considering high throughput data, the problem of data prove- nance became apparent. Consequently, we discuss this topic in the conclusion of the thesis.
1.3 Data integration strategies
In the last 20 years, software developers have explored a wide variety of method- ologies for data integration. Each of this methodology has a technology of reference and uses one or more touch-points. A touch-point is the logical connection between different data sources which make integration possible, for example, data values, names, ontology terms, keywords, etc. In this paragraph we list a few popular ap- proaches that have been identified by Goble et al. These methods show a different level of integration which ranges from light solutions to heavyweight mechanisms (Figure 1.3) 1.3. Data integration strategies 9
FIGURE 1.3: Schema of the different data integration strategies placed according to their level of integration. From left to right we go from heavyweight to lightweight integration level.
1.3.1 Service oriented architecture
Service oriented architecture (SOA) is a way to integrate different data sources which can be accessed using a programmatic interface. The beauty of SOA resides in its loose coupling among different resources. The implementation of each resource is entirely masked behind its interface. As far as the interface remains the same, each resource provider can change and expand its resource without causing problems with the rest of the architecture. Usually, SOA relies on technologies like Simple Object Access Protocol (SOAP) and Representational State Transfer (REST) which allow the creation of web services (Dudhe and Sherekar, 2014). Due to its simplicity, contrapose to the SOAP verbosity, REST emerged as a de facto standard for service design in Web 2.0 applications (Battle and Benson, 2008).
Data sources can be accessed through programmatic interfaces which eradicate the problem of user simulation and screen-scraping. However, the poor stability of web services and the lack of documentation represent the major issues for SOA which leads to the impossibility of using the corresponding data.
1.3.2 Link integration
Link integration is a technique where an entry from a data source is directly cross- linked to another entry from a different data source. Because many data sources are websites and subsequently, entries are web pages, link integration can be renamed 10 Chapter 1. Introduction
FIGURE 1.4: The cross-reference section of a UniProt entry which rep- resents an example of link integration. hyperlink integration. Users can navigate among different data sources using the hyperlinks present in each web page. The Uniprot "Cross-references" section, avail- able in each entry, represents an example of link integration. In this section, a list of links connects the entry to sequence databases, 3D structure database, etc (Fig- ure 1.4). Link integration is broadly used in bioinformatics and can be seen as a lite integration technique. This method relies on service provider agreement, and it is vulnerable to name clash, updates and ambiguous cases (Goble and Stevens, 2008).
1.3.3 Data warehousing
The data warehousing technique has its root in many companies. All the data pro- duced within an organisation is extracted, transformed and loaded (ETL) into a new general data model (figure 1.5). In this combined shape, data can be analysed to pro- vide useful strategic information for business intelligence (Ponniah, 2004). Data is stored and queried as a monolithic, integrated resource which relies on a predefined schema. In contrast to the previous methods, data warehousing represents a heavy- weight mechanism for data integration. Due to the initial high costs of implement- ing a data warehouse and the fixed model, which can hardly change with time, this 1.3. Data integration strategies 11
FIGURE 1.5: Data warehouse model of BI360 by Solver (https://www.solverglobal.com/it-it/products/data-warehouse) technique failed to last in life science applications. This method is well-suited for companies which have the control over data production but becomes particularly unsafe when data is produced by third parties who potentially and unpredictably change their model at any time. When one or more data sources cannot be syn- chronized with the data warehouse, the only solution is to redesign the underlying data model from scratch, which is costly. A recent example of a data warehouse in bioinformatics is Geminivirus.org (Silva et al., 2017).
1.3.4 View integration
View integration is based on the same concept as data warehousing without pro- viding a monolithic integrated resource. In this methodology, data is kept within the sources that are integrated on fly to provide a fresh integrated view. Users have the illusion of querying a unique resource, but, in the background, data is pulled from the several sources using ad-hoc drivers (Halevy, 2001). The mediated schema of the view is defined at the beginning like in data warehousing, but drivers can 12 Chapter 1. Introduction be adapted to support changes in the data sources. However, drivers tend to grow with time making the maintenance more and more complicated. Additionally, the overall performance can be an issue since, in the view integration, the query speed is limited by the slowest source of information. TAMBIS (Stevens et al., 2000) can be considered the first example of view integration in bioinformatics. This software ap- plication was able to perform tasks using several data sources and analytical tools. TAMBIS used a model of knowledge built by a list of concepts and their relation- ships. However, the tool is not maintained anymore.
1.3.5 Model-driven service oriented architecture
The model-driven service oriented architecture represents a hybrid technique which combines SOA and the integration view. Usually, this methodology is used by no- table projects which are able to define a common data model for a particular context. In this way, data producers can join the infrastructure only if they are fully compliant with the predefined model. One example of model-driven SOA is caCORE, a soft- ware infrastructure which allows the creation of "interoperable biomedical information systems" (Komatsoulis et al., 2008). Using caCORE, scientists can access (syntactic interoperability) as well as understand (semantic interoperability) the data once it has been retrieved. caCORE has been used to create the cancer Biomedical Informat- ics Grid (caBIG) which consist of a data grid system where scientist can publish and exchange cancer-related data. In 2012, caBIG was followed by the National Cancer Informatics Program.
1.3.6 Integration applications
Integration applications are special tools designed to integrate data in a single ap- plication domain. Contrary to view integration and data warehousing, integration applications are far from being general integration systems. Software developers tailor the application according to the needs of a specific sub-field. In this way, the application is well suited to a specific application domain, but it cannot be trans- posed in another field. Due to their custom implementation, integration applications 1.3. Data integration strategies 13 are usually a mix of several data integration methodologies. An excellent example of this technique is EnsEMBL (http://www.ensembl.org/) genome database project (Zerbino et al., 2018). EnsEMBL is a genome browser built on top of an open source framework which can store, analyse as well as visualise large genomes. It provides access to an automatic annotation of the human genome sequence which is based on several external data sources.
1.3.7 Workflows
FIGURE 1.6: An example of Apache Taverna workflow for pro- teomics taken from http://ms-utils.org/Taverna/. In particular, this workflow identifies peptides from mass spectrometry data using X!Tandem and validates assignments with PeptideProphet.
In data integration, a workflow describes the set of transformations which are ap- plied to the data. A workflow can be built using a set of custom scripts or taking advantage of one of the workflow management systems available. When software like Knime (Fillbrunn et al., 2017), Apache Taverna (Oinn et al., 2004) or Galaxy 14 Chapter 1. Introduction
(Afgan et al., 2016) are used, researchers can perform in silico experiments without becoming neither a software developer nor a scripting language expert (Karim et al., 2017).
Contrary to data warehouse and integration view, the integration process is defined by a series of transformations which are publicly exposed. Figure 1.6 describes a workflow for the identification of peptides in mass spectrometry data. Light blue boxes are file inputs whereas brown boxes are running scripts like PeptideProphet. Workflows can cope with unreliable data sources, and they can be adapted to face changes in data production. However, they are not the solution to every problem since the design of a workflow can be hard and its quality is strictly bounded to the data sources they integrate.
1.3.8 Mashups
FIGURE 1.7: An example of a mashup extracted from “A Secure Proxy- Based Cross-Domain Communication for Web Mashups” (Hsiao et al., 2011). The housing data feed is integrated with Google Maps to show the position of each entry.
All the methodologies proposed, except for link integration and workflows, require robust database and programming expertise. For some techniques, changing the data model or adding new data sources represent significant issues. For this reason, 1.3. Data integration strategies 15 with the start of the Web 2.0, mashups have emerged. A Mashup is a web page or a web application where a collection of data sources are combined to create a new service. To create a mashup, we identify possible data sources that can be integrated to produce a novel functionality. Figure 1.7 shows a mashup done by integrating Craigslist apartment and housing listings (on the right) onto Google Maps.
Mashups provide a lite integration which is closed to aggregation more than in- tegration. However, they produce useful lightweight applications which can be quickly built in a short amount of time with limited expertise. Mashup tools like Pipes (https://www.pipes.digital/) use a graphical workflow editor to bridge web resources easily. Data visualisation tools like Google Maps and Google Earth offer the possibility to display and combine georeferenced data on this principle as well.
1.3.9 The Semantic Web and RDF
The semantic web, even defined as the web of data, "provides a common framework that allows data to be shared and reused across application, enterprise, and community bound- aries" (W3C Semantic Web Activity Homepage). It is based on the Resource Description Framework (RDF), which consists of a standard model for data interchange on the web. The graph-based structure, provided by RDF, is responsive to change and sup- port semantic descriptions of data. All data in RDF are in the form of triples, i.e., statements always composed by a subject, a predicate and an object. Each of these resources is identified by a Uniform Resource Identifier (URI) or a blank node. The latter identifies anonymous resources which can only be subjects or objects. Refer- ring data with the same URI across several data sources, when these are semantically equal, allows the creation of touch points. As mentioned earlier, touch points are the connection of multiple knowledge graphs and allow the creation of integrated re- sources. To store data in an RDF endpoint, it is necessary to draft an ontology which can be described using the RDF Schema (RDFS) or the Web Ontology Language (OWL). Additionally, ontologies are helpful to understand the graph structure of re- sources and to link resources which have shared information. Once data are stored in the triple store, SPARQL, a SQL-like query language, is used to retrieve data from each resource. Using federated queries, SPARQL interacts and retrieves data from 16 Chapter 1. Introduction multiple connected data sources. The logic is masked to the user who interacts only with one single interface. To conclude, publishing data using RDF allows connect- ing different data sources that have one or multiple touch-points. The ultimate goal of the semantic web is to have a single web of connected endpoints which can be queried as a single resource. Cmapper, for example, is a web application that allows navigating all EBI RDF endpoint at once using the gene name as touch point (Shoaib, Ansari, and Ahn, 2017).
1.4 Data integration in bioinformatics
All data integration techniques presented in the previous paragraphs need touch points to be implemented. In bioinformatics, a diversity of efforts have been carried out to provide standards and, as a matter of fact, touch points across data sources. In this paragraph, we identify some areas of study which are crucial to enforce stan- dardisation and encourage data integration.
1.4.1 Standards
In life science, where data can be represented in many ways, widely adopted stan- dards provide the only ground for data exchange and data integration. To show why standards are so relevant, we take the example of the amino acid naming con- vention. The 20 amino acids have a standard name that is recognised worldwide. Additionally, each amino acid has a one and three letter codes that are used by biol- ogists around the globe. If a European biologist talks at a conference in Asia about the S amino acid, everyone in the audience understands it is about Serine.
Nowadays, lots of initiatives for developing standards are arising. We took a list of the most famous from a paper published by Lapatas et al. which is available in Table I (Lapatas et al., 2015). We will not explore each of these initiatives, but we provide URLs for more information. 1.4. Data integration in bioinformatics 17
To conclude, we want to strive the importance of standards for data sharing. Stan- dards facilitate data re-use, limiting the work needed to integrate different data sources and the waste of potential datasets.
1.4.2 Ontologies
In the last twenty years, several ontologies have been created in the biological and biomedical fields (Bard and Rhee, 2004) (Hoehndorf, Schofield, and Gkoutos, 2015) (Kelso, Hoehndorf, and Prüfer, 2010). In philosophy, an ontology describes “what exists”, whereas, in life science, it represents what exists in a specific context, for ex- ample, diseases (Turk, 2006). To be more precise, an ontology is defined as a collec- tion of concepts and relationships used to characterise an area of concern. To consol- idate and coordinate the rapid spread of ontologies, in 2001, Ashburner and Lewis established The Open Biomedical Ontology (OBO) consortium. The OBO ontologies form the basis of OBO Foundry, "a collaborative experiment based on the voluntary ac- ceptance by its participants of an evolving set of principles that extend those of the original
OBO" (Smith et al., 2007). As stated in its website, the OBO Foundry “is a collective of ontology developers that are committed to collaboration and adherence to shared princi- ples”, and its mission “is to develop a family of interoperable ontologies that are both logi- cally well-formed and scientifically accurate”. OBO contains ten ontologies (June 2018) which are member ontologies like Gene Ontology (GO) and more than a hundred candidate ontologies. To become member, a candidate ontology has to be developed using OBO’s shared principles and validated by OBO members (Quesada-Martínez et al., 2017). The growing use of OBO ontologies allows connecting more and more datasets using techniques like the semantic web. This enhances data integration and fosters the creation of the web of data. 18 Chapter 1. Introduction
Acronym Name Goal URL PMID Establish a set of princi- ples for ontology devel- The Open Bi- opment to create a suite ological and OBO of orthogonal interop- www.obofoundry.org 17989687 Biomedical erable reference ontolo- Ontologies gies in the biomedical domain Establish standards to Clinical data support the acquisition, interchange exchange, submission CDISC www.cdisc.org 23833735 standards and archive of clinical consortium research data and meta- data Human Defines community Proteome standards for data HUPO- Organisation- representation in pro- www.psidev.info 16901219 PSI Proteomics teomics to facilitate Standards data comparison, ex- Initiative change and verification Create interoperable Global Al- approaches to cat- liance for alyze projects that will GAGH genomicsandhealth.org 24896853 Genomics help unlock the great and Health potential of genomic data Coordinate the devel- Computational opment of the vari- COMBINE Modeling in ous community stan- co.mbine.org 25759811 Biology dards and formats for computational models Define community- agreed reporting stan- dards, which provided Metabolomics a clear description of msi- MSI Standards 17687353 the biological system workgroups.sourceforge.net Initiative studied and all compo- nents of metabolomics studies Builds the social and technical bridges that Research RDA enable open sharing of rd-alliance.org Data Alliance data across multiple scientific disciplines
TABLE 1.1: List of data standard initiatives. Courtesy of “Data inte- gration in biological research: an overview” (Lapatas et al., 2015). 1.4. Data integration in bioinformatics 19
1.4.3 Formats and reporting guidelines
As data is increasingly generated by high throughput techniques, computers, equipped with appropriate software, are required to store and analyse the information pro- duced. In this scenario, data formats play a critical role as they provide instruc- tions to store data in a file. However, the scarcity of well-designed data formats gives birth to many different standards that hamper data exchange and data inte- gration. Therefore, bioinformaticians are forced to build converters, spending more time to clean data than to analyse them. Currently, to store Next Generation Se- quence (NGS) data, there are six common file formats: FASTQ, FASTA, SAM/BAM, GFF/GTF, BED, and VCF (Zhang, 2016). To ease and foster data sharing, converters between different data format have been developed in genomics, proteomics, etc. In this context, a good example is provided by the PRoteomics IDEntifications (PRIDE) project (Vizcaíno et al., 2016). PRIDE is a centralised repository for proteomics data which includes protein and peptide identification as well as post-translational mod- ifications. Since mass spectrometry data can be encoded with several formats, the PRIDE development team have developed PRIDE Converter. This tool converts a large amount of mass spectrometry encoding formats into a valid PRIDE XML ready for submission to the PRIDE repository (Barsnes et al., 2009). With the grad- ual maturation of "omics", a huge step forward has been made by adopting clear guidelines to describe and depositing datasets. In this context, the first example of a concrete guideline is represented by The Minimum Information About a Microarray Experiment (MIAME) (Brazma et al., 2001). MIAME defines "the minimum infor- mation required to ensure that microarray data can be easily interpreted and that results derived from its analysis can be independently verified". The use of MIAME facilitates the creation of public repositories as well as the development of data analysis tools. Following the example of MIAME, in 2007, Taylor et al. published The minimum information about a proteomics experiment (MIAPE) (Taylor et al., 2007). Nowa- days, proteomics guidelines are defined by HUPO Proteomics Standards Initiative (Hermjakob, 2006) which additionally proposes data formats and controlled vocab- ularies (http://www.psidev.info). In general, these guidelines focus on defining the content and the structure of necessary information to describe a specific experiment. 20 Chapter 1. Introduction
Although they do not provide any technical solution for storing data, some of them suggest standard file formats. To conclude, the use of minimum information guide- lines together with suggested data formats enhance the data integration progress and the reusability of datasets.
1.4.4 Identifiers
An identifier is a short list of characters which identifies a data entry. For example, UniProt (Bateman et al., 2017) is using accession numbers, i.e. stable identifiers, to identify entries. When two or more entries are merged, all the accession numbers are kept. In this case, one is the "Primary (citable) accession number" whereas the others become "Secondary accession numbers". To avoid any source of uncertainty, it is not possible that one accession number refers to multiple proteins. In life science, the information is spread across multiple databases, and each of them has developed its identifiers. This leads to a multitude of identifiers to describe the same biologi- cal concept. However, to facilitate data integration, databases have cross-referenced their entries with external resources (See chapter1 - Link integration). In UniProt, for example, each protein has a cross-reference section which contains all external identifiers related to the same protein (3D structures, protein family, genome an- notation, etc)(Figure 1.4). If all life science databases used the same identifier to characterise a biological concept, data integration would be facilitated. However, the use of identifiers from established databases, like UniProt or GenBank, in re- search papers is already a sign of progress. In genomics, for example, most journals oblige researchers to deposit newly obtained DNA and amino acid sequences to a public sequence repository (DDBJ/ENA/Genbank - INSDC) as part of the publica- tion process (Benson et al., 2018). Although this is happening in advanced fields like genomics and proteomics, disciplines like glycomics are lagging behind.
1.4.5 Visualisation
In biology, data visualisation is an essential part of the research process. Scientists have always relied on different visualisation means to communicate experimental 1.5. Glycomics 21 results. Some domains of biology like phylogeny (Allende, Sohn, and Little, 2015) and pathway analysis (Tanabe and Kanehisa, 2012) have created specific visualisa- tions that, nowadays, are considered a standard (i.e. phylogenic trees). The spread- ing of high throughput technologies has complicated the panorama. The increasing quantity of data and the integration of heterogeneous information have created new challenges for visualisation experts. For example, the rise of the next generation se- quencing and the resulting availability of genome data has prompted the need for new custom visualisations to show sequence alignments, expression patterns or en- tire genomes (Gaitatzes et al., 2018). Data visualisation is also becoming a crucial resource in the integration of multiple resources. General purpose tools are avail- able to overlay data from different data sources. An example is Cytoscape (Shannon et al., 2003), an open source software platform for visualising interaction networks and biological pathways. Cytoscape gives the possibility to overlay networks with gene expression profiles, annotation and other quantitative and qualitative data.
1.5 Glycomics
Carbohydrates are a class of molecules utterly crucial for the assembly of complex multicellular organisms which requires interactions among cells. Beside the media- tor role in cell-cell, cell-matrix and cell-molecule interactions, glycans play an essen- tial function in host-pathogen interaction. This is not surprising since all types of cell in nature are covered by a glycocalyx, a sugar shell that could reach the thickness of 100 nm at the apical border of some epithelial cell (figure 1.8) (Le Pendu, Nyström, and Ruvoën-Clouet, 2014).
Cell membranes are built by several molecules, many of which are carrying a set of carbohydrates (glycoconjugates). Each of these sugars consists of a single (monosac- charides) or multiple (oligosaccharide) units. In this thesis, we follow the rule set by the Essential of Glycobiology, which uses the term "glycan" to "refer to any form of mono-, oligo-, polysaccharide, either free or covalently attached to another molecule".
Glycans are linear or branched molecules where each building block is linked to the 22 Chapter 1. Introduction
FIGURE 1.8: The Bacterium Bacillus subtilis. The glycocalyx is the “grass” which covers the cellular membrane. Courtesy of Allon Weiner, The Weizmann Institute of Science, Rehovot, Israel. 2006. next using a glycosidic bond. The units which compose glycans are called monosac- charides, and they are identified as units since they cannot be hydrolysed to a sim- pler form (Varki et al., 2010). A list of all monosaccharides available is present in figure 1.9; however each biological system can only use a subset of those.
Depending on the stereochemistry of the two building blocks, the glycosidic linkage is defined as α-linkage (same stereochemistry) or β-linkage (different stereochem- istry). A variation on the linkage type can have a significant impact on structural properties and biological functions of oligosaccharides with the same composition. This is evident for starch and cellulose which share the same composition but have a different type of linkages (α1-4 for starch, β1-4 for cellulose). Besides the linkage type, building blocks have multiple attaching points leading to an extensive collec- tion of possible glycan structural configurations. Elucidating glycan structures is crucial to determine whether they can be recognized by a glycan-binding protein (GBP). These proteins are scanning glycan structures searching for a specific region, defined as glycan determinant. We can imagine this process as the key and lock mechanism introduced by Emil Fischer in 1894 (Cummings, 2009). Blood group antigens (A, B and O) are the first example of glycan determinants, and they give an 1.5. Glycomics 23
FIGURE 1.9: The SNFG encoding table for monosaccharide. Each row represents a class of monosaccharide which are represented by one specific shape (in white). Each column, instead, identifies a type of monosaccharide. For example, glucose and all its modifications are coloured in blue. This table contains all monosaccharide found in Na- ture and has been extracted from “Symbol Nomenclature for Graphical Representation of Glycans” (Varki et al., 2015). idea of the importance of glycobiology (Figure 1.10).
Glycans can freely stand in body fluids or be attached to other molecules. The com- bination of a single or multiple glycan(s) with a molecule is named glycoconjugate. In particular, when sugars are attached to proteins, we define the whole as glycopro- teins whereas the combination of carbohydrates and lipids are called glycolipids. In this thesis, we focus on glycans and glycoproteins as a start of a work which will eventually evolve towards glycolipids and proteoglycans.
Branched glycans attached to proteins are divided into two main groups: N-linked and O-linked. In the former group, glycans are attached to an asparagine which has to be located in a specific consensus sequence, usually Asn-X-Ser or Asn-X- Thr (in rare cases Asn-X-Cys). In the latter group instead, sugars are attached to serine, threonine, tyrosine, hydroxylysine or hydroxyproline (Lauc et al., 2016). N- linked glycans have a common pentasaccharide core (three Mannose and two N- Acetylglucosamine), and they can be classified into three groups: high-mannose, 24 Chapter 1. Introduction
FIGURE 1.10: The AB0 group antigens represented in SNFG format. complex and hybrid (Figure 1.11 N-glycan). On the contrary, O-linked glycans have eight types of cores which start with an N-Acetylgalactosamine (Figure 1.11 O-glycan).
The collection of all glycan structures expressed in a specific biological system is de- fined as the glycome (Packer et al., 2008). In analogy with genomics and proteomics, the systematic study of the glycome is called glycomics. Differing from proteins, gly- cans cannot be directly predicted from a DNA template. Also, technical difficulties to correctly determine their complex and branched structures are the reasons why gly- comics is lagging far behind other "omics". A typical glycomics experiment releases and catalogues all the glycan structures present in a cell, tissue, etc. using techniques like mass spectrometry. In a more sophisticated version of this experiment, glycans are analysed while still attached to trypsin-digested peptides (glycoproteomics ex- periments). In addition to technical difficulties in elucidating complete structures, glycans are studied separately from their environment losing all the information about the actual landscape (Varki et al., 2010). Although qualitative techniques are improving, the quantitative aspect remains somewhat untouched, since quantitative experiments are still bound to composition more than structure elucidation. 1.6. Data integration in glycomics 25
FIGURE 1.11: The N-glycan section shows the three types of N-linked glycans: high-mannose, complex and hybrid. The O-glycan section, instead, presents the eight type of cores available. All structures fol- lows the SNFG standard explained in figure 1.9
1.6 Data integration in glycomics
In this paragraph, we are going to explore what has been done in glycobiology to ease data integration within glycomics and with other "omics".
1.6.1 Glycan formats
Glycans are inherently more complex than nucleic acids and proteins, so defining a format to store the molecular information correctly is not a trivial problem. The complexity of the glycans resides in their branched structure and the collection of building blocks available. In contrast with proteins and nucleic acids which are made of respectively 4 and 20 building blocks, glycans can be built with up to 72 different monosaccharides (figure 1.9). Additionally, information about monosac- charide anomericity, residues modification and substitution, glycosidic linkages and possible structure ambiguities must be taken into account. Without commenting 26 Chapter 1. Introduction
FIGURE 1.12: An O-Glycan (Glyconnect ID 2641 and GlyTouCan ac- cession number G01614ZM) encoded with seven different glycan for- mat. (A) LINUCS sequence format as used in GLYCOSCIENCES.de. (B) BCSDB sequence encoding. (C) CarbBank sequence format. (D) GlycoCT sequence format as used in GlycomeDB and UniCarbDB. (E) KCF format used in the KEGG database. (F) LinearCode R as used in the CFG database. (G) WURCS format used in GlyTouCan. Adapted from "Toolboxes for a standardised and systematic study of gly- cans" (Campbell et al., 2014). on the different nomenclatures available to represent each monosaccharide, we will focus here on formats to encode a glycan structure into a file. In figure 1.12 we show an O-glycan (Glyconnect ID 2641) encoded in several different formats (Camp- bell et al., 2014). Glycan encoding methods can be grouped in three sets according to the technique used to store the data. The first group reduces glycan tree-like structures into linear sequences using strict rules for sorting branches. This group contains Web3 Unique Representation of Carbohydrate Structures (WURCS) (Mat- subara et al., 2017), the Bacterial Carbohydrates Structure Database sequence format
(Toukach, 2011), LinearCode R (Banin et al., 2002), LINUCS (Bohne-Lang et al., 2001) and CarbBank multilinear representation (Doubet and Albersheim, 1992).
The second group, instead, uses connective tables. Here, we found GlycoCT (Herget et al., 2008) and KFC (Aoki et al., 2004). The third group relies on XML encoding and contains CabosML (Kikuchi et al., 2005) and GLYDE (Sahoo et al., 2005) which are not present in figure 1.12. The use of several formats has made data integration across different sources almost impossible. To tackle this problem, bioinformati- cians have developed a collection of tools to parse and translate glycan structures 1.6. Data integration in glycomics 27 across different encoding systems. Four main projects have been focus on data in- tegration and tool development: RINGS (Akune et al., 2010), GLYCOSCIENCES.DE (Lütteke et al., 2006), EUROCarbDB (Lieth et al., 2011) and GlycomeDB (Ranzinger et al., 2008) which now has been replaced by GlyTouCan (Tiemeyer et al., 2017). In the recent years, different databases like Glyco3D (Pérez et al., 2015; Pérez et al., 2016), MatrixDB (Launay et al., 2015), UniLectin and the Glycomics@ExPASy initia- tive have introduced GlycoCT as a standard encoding format.
1.6.2 Glycan graphical representations
FIGURE 1.13: Graphic representations of the O-glycan in figure 1.12. (A) Chemical representation. (B) Oxford cartoon representation. (C) SNFG representation. The legend shows the difference encoding of monosaccharides between Oxford and SNFG. Adapted from "Tool- boxes for a standardised and systematic study of glycans" (Campbell et al., 2014).
Data formats presented in the previous paragraph are essential for storing infor- mation and exchanging data across software applications but are not suitable for humans. Consequently, glycobiologists have proposed different graphical represen- tations where symbols or chemical structures replace monosaccharides. In Figure 1.13, we collect a set of depictions of the O-glycan presented in figure 1.12. Each of these representations has some peculiarities. The chemical depiction (Figure 1.13A) is used by researchers that are interested in glycan synthesis or use NMR to elucidate 28 Chapter 1. Introduction glycan structures. The Oxford nomenclature (Harvey et al., 2011) (Figure 1.13 B) de- fines linkages using angles and encodes monosaccharide anomericity with dashed or solid lines.
The nomenclature used by the first version of "Essential of Glycobiology" encodes monosaccharides using shapes and colour being the most user-friendly (Figure 1.13 C). With the publication of the third edition of "Essential of glycobiology" Varki et al. have made an effort to define a standard nomenclature called the Symbol Nomen- clature for Glycans (SNFG) which should replace previous graphical representations (Varki et al., 2015) (Figure 1.13 C).
1.6.3 Glycan composition format
In many cases, experimental techniques can only distinguish glycan compositions, losing the information about the 2D structure. A glycan composition, indeed, reports the amounts of different monosaccharides present in the glycan without saying any- thing about their position in the space. An example is H2N1S2 which refers to a glycan with 2 Hex, 1 HexNAc and 2 NeuAc. Sometimes, in addition to the composi- tion, it is possible to extract information about the number of antennas, the presence of a bisected HexNAc and the amount of galactose. In this case, researcher uses a nomenclature called Oxford which allows encoding of these additional data with the composition. This nomenclature should not be confused with the Oxford graphical representation for glycan structure described in the previous paragraph. In Oxford nomenclature, A2G2S2 refers to a glycan which has 2 antennas, 2 galactose and 2 NeuAc resulting in a composition of 5 Hex, 4 HexNAc and 2 NeuAc. However, Ox- ford nomenclature works only with N-glycans and there are many dialects which are currently in use.
1.6.4 Identifiers
The community acceptance of unique identifiers for genes and proteins has facil- itated data integration and data exchange boosting the research in genomics and proteomics. At the same time, some initiatives have taken place to fill this gap in 1.6. Data integration in glycomics 29 glycomics. In 1989, the Complex Carbohydrates Structure Database, even known as CarbBank, (Doubet et al., 1989) was the first attempt to establish unique iden- tifiers for glycan structures. However, in the late 1990s, CarbBank project ceased due to the end of funding. Consequently, different initiatives like CFG and KEGG have tried to continue CarbBank mission leading to a multiplicity of identifiers for each glycan structure. From 2013, however, researchers in glycomics brought back the need for a glycan structure registry which would ease data sharing and in- crease data integration among different platforms. Following this pressing wish, in 2015, Aoki-Kinoshita et al. (Aoki-Kinoshita et al., 2016) unveiled GlyTouCan, a central registry for glycan structures. GlyTouCan allows users to deposit glycan structures in exchange for a unique identifier. Since different glycomics techniques cannot fully elucidate glycan structures, GlyTouCan accepts submissions of incom- plete structures. Recently, the glycomics community has strongly endorsed the work of Aoki-Kinoshita to secure the stability of the registry which has become a crucial resource in the glycomics panorama. Every scientist in glycomics is encouraged to submit glycan structures to the registry and use the unique identifiers in reports and manuscripts (Tiemeyer et al., 2017).
1.6.5 Reporting Guidelines
The innate complexity of glycan structures, often, needs orthogonal experimental techniques to be elucidated. In this scenario, each experiment solves only a part of the puzzle, and only the integration of multiple data source leads to an accurate an- notation. Due to this fact, experimental guidelines are necessary to publish datasets that can be easily interpreted, evaluated and reproduced by other glycoscientists. From 2011, experts in the field of glycobiology, glycoanalytics and glycoinformatics have been working together, under the patronage of the Beilstein Institute, to define the Minimum information Required for A glycomics Experiment (MIRAGE) (York et al., 2014). This initiative follows the more popular initiatives MIAME and MI- APE which we have already discussed. At the time of writing, MIRAGE already has published guidelines for sample preparation (Struwe et al., 2016), mass spectrome- try analysis (Kolarich et al., 2013) and glycan microarray analysis (Liu et al., 2017b). 30 Chapter 1. Introduction
Additionally, a guideline for liquid chromatography analysis is in preparation.
1.6.6 Ontologies
The combination and integration of multiple experimental and knowledge-based sources are essential to define the role of glycans. However, data are spread across different databases which act as "disconnected islands" (Lütteke, 2008). In this situ- ation, ontologies provide a painless way to interconnect resources within glycomics and with other "omics". As a first attempt, Sahoo et al. generated GlycO, "a glyco- proteomics domain ontology for modelling the structure and functions of glycans, enzymes and pathways" (Sahoo et al., 2006). GlycO has a strong focus on the biosynthesis of complex glycan structures and their relationships with proteins, enzymes and other biochemical entities.
FIGURE 1.14: The diagram of the core classes of the GlycoRDF on- tology. Grey and white boxes respectively identify classes and sub- classes. Courtesy of “GlycoRDF: an ontology to standardize glycomics data in RDF” (Ranzinger et al., 2015).
More recently, Ranziger et al. developed GlycoRDF (Ranzinger et al., 2015). Con- trary to GlycO, GlycoRDF has been designed with the precise goal of integrating all the information available in glycomics resources limiting the development of multi- ple RDF dialects. A detailed diagram of GlycoRDF is available in figure 1.14. 1.7. Web development 31
1.6.7 Visualisation
Despite its importance, data visualisation is still a challenge in glycomics. In the last decade, some initiatives have pushed the development of visual tools to im- prove some aspects of glycan identification and quantification. Glycoviewer (Joshi et al., 2010) is the first example of data visualisation tool which allows glycoscien- tists to visualise, summarise and compare different glycomes. GlycomeAtlas (Kon- ishi and Aoki-Kinoshita, 2012), provides an interactive interface for exploring data produced by the Consortium of Functional Glycomics (CFG). To conclude, as stated in its website, GlycoDomainViewer (Joshi et al., 2018) is a visual "integrative tool for glycoproteomics that enables global analysis of the interplay between protein sequences, glycosites, types of glycosylation, and local protein fold / domain and other PTM context". GlycoDomainViewer integrates experimental data as well as knowledge data sources presenting the most extensive collection of information to explore the possible effect of glycosylation on a protein. Despite the availability of these and more visual tools, the majority of glycoscientists are still using general purpose applications like Excel to publish experimental results. Therefore, results are hardcoded in figures or text and data is stored in tables which populate the sup- plemental material section. Additionally, integrative tools like GlycoDomainViewer, although very useful, are usually developed taking into account the needs of a spe- cific research group limiting the possibility of reaching out to the entire community.
1.7 Web development
As said in the previous paragraphs the integration of multiple data source is critical to have a complete overview of a living system. Then, once data are correctly com- bined, it is essential to develop software applications which allow the exploration of the integrated data sources and the design new hypotheses. Thus, besides data inte- gration, software development represents another critical step towards an in-silico simulation of a living cell. 32 Chapter 1. Introduction
Software development has many other aspects to be taken into account. One central question that comes along with the start of a new project is usually “Should I de- velop a desktop tool or a web application?”. The answer to this question is crucial as the development process can be entirely different. The two main advantages of desktop applications are security and performance. Data are manipulated within the computer which can be detached from the network. Since most of the security issues come from the internet, desktop applications are more secure than web tools. Also, the use of machine-optimised code and the locality of the data increase the performance of desktop applications.
In the last years, the situation has changed and promoted the development of web applications over the desktop ones. Security issues are still there, but the spreading of technologies like HTTPS and SSL have provided a new security shield against fraudulent actions. On the performance side, the growing availability of cloud com- puting services, at a reduced cost, has given web applications a virtually endless computing power. For example, software like Hadoop, built for handling Big Data, can be plugged in a SOA and piloted via a web interface. In this scenario, the user- friendliness and possibility of updating contents remotely, introduced by web apps, are critical elements for developing a community-wide application that can be used by anyone even the less tech-savvy user. Another advantage of web tools is the pos- sibility of using them regardless of the location (they only need an internet connec- tion) and the platform which can range from a smartphone to a personal computer.
Web development is a broader term which implies the construction of a website. From the start of the Web in 1991 (History of the World Wide Web - Wikipedia) websites have evolved from a collection of basic HTML pages to sophisticated applications built on top of multiple frameworks. A framework is an high-level solution which allows the reuse of software pieces. It also ensures the quality of the code and guar- antees the same behaviours across multiple browsers and multiple platforms (Salas- Zárate et al., 2015). The use of web framework has speeded up the development of new websites which, nowadays, fulfil high-quality standards. There are plenty of web frameworks on the market, therefore, choosing the right one has become a 1.7. Web development 33 hard task. Web developers spend more and more time benchmarking different solu- tions since making a wrong choice would lead to a waste of energy and time. Each framework requires a period of study to learn its specific syntax and logic. Further- more, some frameworks have short life leaving the developer without support and updates. In this ecosystem of frameworks which expands every day, we are going to focus on web components being one of the principal technologies used for devel- oping the web tools.
1.7.1 Web Components
To fully understand the power of web components we need to introduce the con- cepts of HTML and DOM. HTML stays for HyperText Markup Language and is the standard programming language for building web pages. Together with Cascading Style Sheets (CSS) and javascript form the cornerstone technologies behind the web (HTML - Wikipedia). The DOM, which stays for Document Object Model, is an appli- cation programming interface (API) which converts HTML, XHTML and XML into a tree structure where each node is an object and represent a part of the web page.
Web components are reusable interface widgets which are built with web technolo- gies. In other words, web components directives allow developers to create new cus- tom, reusable HTML tag in web pages and web apps. All the custom components built following the web components directives operate across modern browsers, and work in synergy with any JavaScript library or framework that support HTML (web- components.org - Discuss & share web components). Web components are standardised by the W3C, the international standard organisation for the World Wide Web, and they have been added to the specification of HTML and DOM. Web components they consist of four different technologies:
• Custom elements: A collection of JavaScript APIs that allow the user to define new custom HTML tags and their behaviour.
• Shadow DOM: A collection of JavaScript APIs to attach a private and encapsu- lated DOM to a custom element. Using shadow dom the code of each compo- nent become private and cannot be accessed from other ones. This technology 34 Chapter 1. Introduction
removes the possibility of collision between different parts of the document.
• HTML templates: It introduce the custom tags and
• HTML Imports: It is the mechanism to import new custom components into plain HTML pages. HTML imports allow developers to encapsulate the logic and the interface of a web component into a separated file and import it where it will be used.
Developers are free to create "vanilla" web components which are built only using the standard by W3C. A second option is to use libraries which guide the develop- ment process and refine parts of the standard that are not mature. Google Polymer, Bosonic, SkateJS, X-Tag are some of the available libraries which help to create a web component ecosystem. Within the Glycomic@ExPASy initiative, we decided to use Google Polymer as one of the most stable and continuously updated libraries on the market. Also, Google Polymer developers are working together with W3C to enlarge and improve the web component standard. This lead to a reduction of the Polymer library each time the standard upgrade one of its technology. In the long run, Polymer components will be entirely built with the standard, and the Polymer library will slowly disappear.
1.8 Objectives and Thesis Overview
Although bioinformatics was already introduced in glycobiology, it was often de- signed for specific experimental methods without interconnecting several pieces of evidence to support a general fact. The main objective of this thesis is to introduce bioinformatics as the glue between the different pieces that compose glycomics and, as a side effect, connect glycomics with the other ’omics’. To meet this goal, we have developed an ecosystem of connected software applications (Figure 1.15) which is described in the four chapters of the thesis. 1.8. Objectives and Thesis Overview 35
The first chapter gives an overview of the global effort to introduce bioinformatics in glycobiology. This passes through the definition of Glycomics@ExPASy, an umbrella which collects all tools developed in this thesis as well as others. Glycomics@ExPASy supports glycobiology by building and maintaining different classes of tools, in par- ticular: integrative, explorative and knowledge-base tools.
The second chapter focuses on integrative tools which allow breaking some barri- ers to the integration of multiple glycomics pieces. Here we present GlyS3 and Epi- topeXtractor. The former verifies if a structure or a group of structures contain a spe- cific subregion. So, GlyS3 can define which of the glycan structures are potentially recognised by GBPs discovering possible protein-protein interactions mediated by glycans. EpitopeXtractor, instead, does the opposite. It searches for glycan epitopes present in a specific structure based on a curated list of over 500 glycan epitopes retrieved from the literature or databases.
FIGURE 1.15: This figure represents the Glycomics@ExPASy ecosys- tem. Blue arrows (tool lines) represent the interactions between soft- ware applications via an integrative tool. Instead, red arrows (data lines) show the flow of data among different tools. Software applica- tions are divided into three groups according to the colour of the con- tainer: integrative (blue), explorative (green) and knowledge-based (red) tools. Grey containers identify libraries and accessory tools.
In the third chapter, we describe three explorative tools: Glynsight, Glydin’ and 36 Chapter 1. Introduction
PepSweetener. The development process of these tools has mainly focused on data visualisation providing a new way of showing information to speed up data analy- sis. Glynsight provides the means for visualising and comparing quantitative gly- can profiles. Quantitative data are still rare in glycomics and Glynsight is the first tool that allows their manipulation and comparison. Since these data are carrying only glycan composition information, Glysight integrates GlyConnect data to sug- gest putative structures from compositions. Additionally, it is connected to Glydin’, the epitope network, using EpitopeXtractor. Glydin’ was the master project of Marie Ghraichy and completed in 2016 (https://archive-ouverte.unige.ch/unige:86523). It proposes a new way of describing interactions among different glycan epitopes based on their structures. Data is shown in a network which can be easily explored through a web interface that was improved and extended during this PhD. To con- clude we introduce PepSweetener, a web-based visualisation tool designed to facil- itate the manual annotation of intact glycopeptides from MS data regardless of the instrument that produced these data. Even in this case finding an appropriate visu- alisation of the data in connection with the design of a rapid data structure has been the focus of our work.
In the fourth chapter, we present Glyconnect, an integrative web dashboard to sup- port the study of glycomics and glycoproteomics. Glyconnect is our “swiss army knife” to explore the connections between glycans, proteins, tissues, taxonomies and diseases. The information stored in Glyconnect is shared with most of our tools. Furthermore, tools like EpitopeXtractor and GlycoSiteAlign are directly integrated into Glyconnect to expand its capabilities. The search part of Glyconnect has been shaped with attention on aggregating data through ad hoc visualisation which helps scientists generating new hypotheses. The browser part, instead, displays a list of entries to speed up and ease the search for specific information. 37
Chapter 2
Glycomics@ExPASy
2.1 Overview
In this chapter, we present the Glycomics@ExPASy initiative. This action has led to the creation of a glycomics tab on ExPASy, the bioinformatics resource portal of SIB Swiss Institute of Bioinformatics.
In 2016, the Proteome Informatics Group has reached a critical mass of tools and databases specific to glycomics. Therefore, together with an international network of glycoscientists, they decided to start the Glycomics@ExPASy initiative. All glycoin- formatics tools developed at SIB and a selection of prominent glycomics resources were listed in a new tab, labelled glycomics, on the ExPASy server. This action aimed to spread the use of bioinformatics in glycomics and fill the gap between glycomics and other "omics". Thus, all tools developed at SIB are designed to be used not only by glycoscientists but also by protein scientists, possibly medical doctors, etc. Glycomics@ExPASy proposes a mixture of tools (exclusively web-based) and cu- rated databases which will increasingly be interconnected to drive scientists in their data analysis and hypothesis building. At the moment, we cover glycome profiling, glyco-epitope mapping, protein glyco-site characterisation and mass spectrometry data analysis for glycan or glycopeptide identification. tapraid4/zjw-macp/zjw-macp/zjw01118/zjw5814-18a xppws Sϭ3 22/8/18 6:35 4/Color Figure(s) F1–5 ARTNO: RA118.000799 los Research
Glycomics@ExPASy: Bridging the Gap*□S
Julien Mariethoz‡§, Davide Alocci‡§, Alessandra Gastaldello‡§, Oliver Horlacher‡ Elisabeth Gasteiger¶, Miguel Rojas-Maciasʈ, Niclas G. Karlssonʈ, Nicolle H. Packer**‡‡, and Fre´de´ rique Lisacek‡§§§¶¶
Glycomics@ExPASy (https://www.expasy.org/glycomics) cell surface picture as well as the enzymes that are needed to is the glycomics tab of ExPASy, the server of SIB Swiss synthesize or trim the attached glycans. The first challenge in Institute of Bioinformatics. It was created in 2016 to cen- collecting this data is the wide range of experimental tech- tralize web-based glycoinformatics resources developed niques used to analyze glycans and to elucidate their biolog- within an international network of glycoscientists. The ical roles. Mass Spectrometry (2) and Nuclear Magnetic Res- hosted collection currently includes mainly databases and onance (3) methods are commonly used to solve glycan tools created and maintained at SIB but also links to a range of reference resources popular in the glycomics structures released from proteins. High or Ultra-High Perform- community. The philosophy of our toolbox is that it should ance Liquid Chromatography (4), and Capillary Gel Electro- be {glycoscientist AND protein scientist}–friendly with the phoresis with Laser-Induced Fluorescence (5) experiments aim of (1) popularizing the use of bioinformatics in glyco- are used for high-throughput separation of released and la- biology and (2) emphasizing the relationship between gly- beled glycan structures for determination of their expression. cobiology and protein-oriented bioinformatics resources. Molecular Dynamics (6), Isothermal Titration Calorimetry or The scarcity of data bridging these two disciplines led us Surface Plasmon Resonance (7) are key techniques to track to design tools as interactive as possible based on data- glycan-protein interactions, as well as glycan and protein/ base connectivity to facilitate data exploration and sup- lectin arrays (8, 9). However, all these approaches are used in port hypothesis building. Glycomics@ExPASy was de- signed, and is developed, with a long-term vision in close rather low throughput to date. Mass spectrometric proteo- collaboration with glycoscientists to meet as closely as mics approaches are recently being used to determine glycan possible the growing needs of the community for compositions at specific sites on proteins (glycopeptide iden- glycoinformatics. Molecular & Cellular Proteomics 17: tification as reviewed in (10)) at higher throughput but at AQ: A 1–13, 2018. DOI: 10.1074/mcp.RA118.000799. present there is a modest amount of data in glycomics in contrast to most other -omics. After assessing the spread of the data, the next major
AQ:B-G Glycoscience is gaining recognition as an important com- obstacle lies in linking together this data that is usually ac- ponent of life science, as emphasized in two recently pub- quired independently. As it is, most glycan structures have lished roadmaps issued by the American National Research been solved after being cleaved off their natural support Council in 2012 (1) and by the European Science Foundation whereas protein glycosylation sites are identified after fully (ESF) GlycoForum (http://ibcarb.com/wp-content/uploads/A- removing or partly trimming down the attached glycans or roadmap-for-Glycoscience-in-Europe.pdf) in 2015. Both ref- with only the monosaccharide composition determined. As a erences point to the same need for organizing access to result, key information on the glycoconjugate is lost. Further- glycan-related data that is absent in current bioinformatics more, results relative to glycan-binding are in most cases resources. obtained with free glycans. The correlation between glycan Glycosylation is the most common protein post-transla- structures and glycoproteins can be sometimes extracted tional modification yet its role is far from being understood. manually by literature searches but this data is limited and Glycans, proteins to which they are attached (glycoproteins) spread over many publications and is recorded in a range of and proteins to which they bind (lectins or carbohydrate- different formats and representations. The limited usage of binding proteins) are the main molecular actors in this overall existing standards for encoding and representing glycans
From the ‡Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland; §Computer Science Department, University of Geneva, Geneva, Switzerland; ¶Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland; ʈGlyco Inflam- matory Group, Department of Medical Biochemistry and Cell Biology, Institute of Biomedicine, Sahlgrenska Academy, University of Gothen- burg, Gothenburg, Sweden; **Institute for Glycomics, Gold Coast Campus, Griffith University, Southport, QLD, Australia; ‡‡Biomolecular Discovery & Design Research Centre, Macquarie University, North Ryde, NSW, Australia; §§Section of Biology, University of Geneva, Geneva, Switzerland Received April 12, 2018, and in revised form, July 15, 2018 Published, MCP Papers in Press, August 10, 2018, DOI 10.1074/mcp.RA118.000799
Molecular & Cellular Proteomics 17.? 1 © 2018 Mariethoz et al. Published under exclusive license by The American Society for Biochemistry and Molecular Biology, Inc. • tapraid4/zjw-macp/zjw-macp/zjw01118/zjw5814-18a xppws Sϭ3 22/8/18 6:35 4/Color Figure(s) F1–5 ARTNO: RA118.000799 Glycomics@ExPASy
makes the extraction and collation of information labor inten- support for GLYCOSCIENCES.de and reaching the natural sive and time consuming (11). In the end, glycan, glycoprotein end of the CFG project has resulted in freezing further devel- and glycan-binding data has accumulated but at different opment and creating acute maintenance problems for the paces and with poor interrelatedness, when obviously all corresponding developed portals. these parameters need to be connected for understanding of In this context, the important focus of our bioinformatics the functional role of glycoproteins. and glycoscience collaborative initiative was to guarantee These requirements highlight the need for piecing the dis- on-line resource longevity and try to circumvent issues arising parate information together and building corresponding ana- from short-term or restricted funding. We therefore selected a lytical tools to facilitate data acquisition and interpretation well-established bioinformatics portal, ExPASy (22) that has (12). The recent rise of glycoproteomics analytical methods been hosting the leading proteomics Swiss-Prot knowledge- (10) and the constant development of higher throughput and base, then UniProt, for over twenty years. We have gradually quantitative glycomics (13) is bringing to the fore the need for populated ExPASy with developed glyco-related web-based dependable glycoinformatics resources. A range of tools is resources and a glycomics tab was created where this new now available for processing glycan structure analytical data collection, that we have called Glycomics@ExPASy, is now mostly from mass spectrometry (14). Spectral annotation is growing. We have recently broadened our range by including likely to improve when comprehensive structural databases a selection of reference web-based resources in glycoscience become universally available. This evolution is being influ- that have been developed elsewhere, thereby increasing the enced by several recent moves toward facilitating information coverage of the many aspects of glycobiology and the use- sharing. First, glycan structural data collection can now be fulness of the portal. This also matches the spirit of ExPASy, given unique identifiers that are supported by the wider com- which has traditionally catalogued a mix of in-house and munity (15). Second, an agreement on the simplified repre- external resources in other fields, such as proteomics. sentation of carbohydrates with the Symbol Nomenclature for We initially appraised the requirements of an efficient gly- Glycans (SNFG) has been reached (16). Third, guidelines for coinformatics toolbox to support research in glycomics and recording experimental metadata are being defined through glycoproteomics and identified some of the gaps between the MIRAGE initiative (17, 18). This momentum needs to per- glycomics and other -omics. We then set ourselves the goal of sist to bring glycoinformatics to a more mature stage com- filling those gaps. Early tasks involved defining and/or select- patible with larger scale glycomics and glycoproteomics stud- ing data formats and ontologies, structuring data in new da- ies. We recently reviewed the current status of the field (19). tabases (23, 24) and implementing new tools (25–27). The The range of existing, as well as future, glycoinformatics philosophy of our initiative is to be {glycobiologist AND protein resources warrants dedicated portals as initiated many years scientist}–friendly with the aim of facilitating (1) the use of ago by GLYCOSCIENCES.de (20) and by the Consortium for bioinformatics in glycoscience and (2) the relation between Fn1 Functional Glycomics (CFG)1. Each proposed specialized glycobiology and protein-oriented bioinformatics resources. toolboxes. Although the former gathered databases and soft- The scarcity of this bridging data led us to design tools as ware that were strongly anchored in the chemistry of glycans interactively as possible based on implementing database and glycoproteins, the latter offered comprehensive glycan- connectivity to facilitate data exploration and support hypoth- binding data. Later, RINGS (21) was launched; it hosts a esis building. series of tools based on machine learning and tree mining to Many glycomics and glycoproteomics experiments tend to classify and align glycan structures as well as utilities for generate results in the form of lists of results; such as lists of translating glycans structures between different encoding for- glycan compositions or glycan structures, lists of glyco- mats. However, in all these initiatives, many of the corre- epitopes or lists of glycoproteins. The Glycomics@ExPASy sponding web interfaces are rather cryptic outside the glyco- toolbox is organized to move away from displaying lists and to biology community. Furthermore, the interruption of financial provide web interfaces highlighting relationships between the molecular entities involved. For example, shared substruc- 1 The abbreviations used are: CFG, Consortium for Functional Gly- tures as glycoepitopes (e.g. Lewis a as part of Lewis b) are not comics; LC, liquid chromatography; H/UPLC, high or ultra-high per- apparent in a list whereas they can be highlighted if shown as formance liquid chromatography; NMR, nuclear magnetic resonance; the products of an enzymatic pathway and as components of CGE-LIF, capillary gel electrophoresis with laser-induced fluores- cence; MD, molecular dynamics; ITC, isothermal titration calorimetry; a bigger glycan structure. Our technical options aim at imple- SPR, surface plasmon resonance; GUI, graphical user interface; API, menting modular, interoperable and reusable applications to application programming interface; SNFG, symbol nomenclature for rationalize and speed up development. In addition, the cov- glycan; GBP, glycan binding protein; SOA, service oriented architec- erage of our resources is designed to reflect the situation at ture; ESF, European Science Foundation; RDF, resource description the cell surface where glyconjugates and glycan-binding pro- framework; NCBI, National Centre for Biotechnology Information; MIRAGE, minimum information required for a glycomics experiment; teins functionally interact. Glycomics@ExPASy has been CBS, Center for Biological Sequence Analysis; HGI, Human Glyco- listed as a reference on the NCBI glycan page that was proteomics Initiative. associated with the recent third edition of Essentials of Gly-
2 Molecular & Cellular Proteomics 17.? tapraid4/zjw-macp/zjw-macp/zjw01118/zjw5814-18a xppws Sϭ3 22/8/18 6:35 4/Color Figure(s) F1–5 ARTNO: RA118.000799 Glycomics@ExPASy
C O L O R
FIG.1.The glycomics tab of the ExPASy website. This figure captures the Glycomics tab of ExPASy website. Glycomics resources are divided in two sections: Databases on the left and Tools on the right. In addition, resources developed by SIB are identified with the SIB logo whereas a gray icon precedes external tools. This screenshot reflects the content as of July 2018. The range of databases and tools is destined AQ: L to expand and this image is likely to differ in the years to come.
cobiology, as highlighted in (28) but only described as a URL. need for translators (11). To minimize the encoding and decoding The present article details for the first time the growing con- phases in Glycomics@ExPASy tools, we chose GlycoCT (30) as our tent of the present interactive tool collection destined and global glycan encoding format. GlycoCT is widely used in the most recent databases and software. All the web interfaces developed for designed to expand as glycoinformatics grows. The glycoin- SIB tools, accept GlycoCT as input for glycan structures. All SIB formatics resource described here emphasizes the potential databases (UniCarb-DB, GlyConnect and SugarBindDB) store glycan of database and tool combination for integrating data and structures in GlycoCT and IUPAC. However, GlycoCT is not ideally building new hypotheses. suited for efficient structure similarity search or motif extraction. For these purposes the Resource Description Framework (RDF) scheme MATERIALS AND METHODS is recognized as a preferable option (11, 12). We implemented an RDF data model as the basis of GlyS3 (26), the generic substructure As in any other section/tab of ExPASy, e.g. proteomics, two lists search tool in Glycomics@ExPASy. For example, binding motifs of describe the collection, one for databases and one for software tools F1,2 SugarBindDB are cross-referenced to full structures of GlyConnect as shown in the screenshot of Fig. 1. The purpose of each item of the using GlyS3. list is briefly summarized after its name. Our current glycomics and Structure Representation—The 3rd edition of Essentials of Glyco- glycoproteomics selection is exclusively composed of databases and web-interfaced tools to the exception of MzJava, a software package biology, the reference manual in the field, highly recommends the used in several of our applications. It was designed for a broad usage SNFG defined in (16) for representing glycan structures. All the tools in proteomics and glycomics spectral data management (29). MzJava in Glycomics@ExPASy are SNFG-compatible. SNFG is also set as the was in fact primarily listed in the proteomics tab of ExPASy where it default representation in all SIB databases in which visualization in the was popularized. Its occurrence in Glycomics@ExPASy simply high- Oxford nomenclature (31) and text-IUPAC condensed is optional. lights the potential for applications in either glycomics or proteomics. Cross-references—All the databases developed in Glycomics@ We now describe the principles guiding the development of ExPASy describe species with the taxonomy of the National Centre Glycomics@ExPASy, as well as the technical choices we made for for Biotechnology Information (NCBI) (32) and provide a UniProtKB in-house resources. (33) accession number for each annotated glycoprotein. Furthermore, Formats and Standards—We strive to use the same nomenclature, to facilitate the connection with other glyco-resources and to be format and reference sources to encode and describe glyco-related compatible with multiple encoding standards, glycan structures are information in our resources as detailed below. associated with unique identifiers of GlyTouCan (15), the international Structure Encoding—As explained in (11), several naming schemes glycan structure repository. When glycan structures are imported into have been proposed to describe and represent each monosaccha- one of our data sources, we manually check if they are already ride. The diversity of encoding of monosaccharides is in most cases present in the repository, otherwise we proceed with the registration. summarized in MonosaccharideDB (20). Assuming monosaccharides The unique identifier is useful to easily connect tools. In addition, are clearly defined, the next step is to describe and represent the GlyTouCan automatically generates cartoons and linear representa- tree-like structure of glycans. Several attempts have been made to tion in diverse formats. encode structures in a linear format that is computationally advanta- Controled Vocabularies and Ontologies—Although glycan struc- geous. Such variety of glycan encoding formats has prompted the tures are the central elements in the resources of Glycomics@
Molecular & Cellular Proteomics 17.? 3 tapraid4/zjw-macp/zjw-macp/zjw01118/zjw5814-18a xppws Sϭ3 22/8/18 6:35 4/Color Figure(s) F1–5 ARTNO: RA118.000799 Glycomics@ExPASy
C O L O R
FIG.2.The essential entities described in the resources of Glycomics@ExPASy. One of the main purposes of Glycomics@ExPASy is the integration of the tools in the collection shown in Fig. 1. To that end, GlyConnect is used as a dashboard for using in-house and cross-referenced resources. It is designed to ease navigation between entities named in red: protein, peptide, site, glycan, composition and ligand. A protein, a glycan composition or a glycan structure (through its structural properties) is an entry point in the database. Then structures, AQ: M peptides and sites can be listed and compared and possible correlations brought out.
ExPASy, a substantial amount of biological data can be associated bases and their extent to a minimum, the glyco-epitope collection is with each glycan structure. To ease cross-linking of the information made accessible only through applications. As detailed in (37) it and ensure its quality, we use existing ontologies and controlled results from the compilation of four independent sources (Glyco vocabularies whenever possible. More specifically, tissue information EpitopeDB, SugarBindDB, Glyco3D (38) and the literature) and the is systematically tagged with Uberon (34) and Brenda (35) identifiers glycoepitopes were all translated into the GlycoCT format. Glycan (now operational in GlyConnect and soon extended to SugarBindDB array binding data were downloaded from the Consortium for Func- and UniCarb-DB). Disease associations in SugarBindDB and GlyCon- tional Glycomics (CFG) database (http://www.functionalglycomics. nect match appropriate terms from Disease ontology (36). org/glycomics/publicdata/) to reinforce the lectin recognition de- Experimental Evidence—The Minimum Information Required for A scribed in SugarBindDB. This data can of course still be queried Glycomics Experiment (MIRAGE) guidelines have been issued (17) but directly from the CFG website. The Database table (Table I) summarizes to date, are not implemented widely. UniCarb-DB, the glycomic mass the data that is fully hosted at SIB. Note that the Unilectin joint project spectrometry (MS) database and repository (23), is accumulating (https://www.unilectin.eu) aiming at classifying and predicting lectins as information in the first MIRAGE-compliant MS-based database (man- well as enhance Lectin3D (38) was included in 2018. Data is presently uscript in preparation). hosted at CERMAV (https://www.cermav.cnrs.fr). Technical Options for Databases Hosted on ExPASy—The All curated data within the Glycomics@ExPASy initiative are stored Glycomics@ExPASy backend is built on top of a family of curated in relational databases, which form the knowledge base of our web- T1 databases (Table I). Three (SugarBindDB, UniCarb-DB and Gly sites and tools. The technology stack is similar in our three databases Connect) can be directly queried on-line whereas the remainder serve and from a user perspective, the “look and feel” is harmonized. The as sources for dedicated tools. To keep both the number of data- databases are powered with a very recent version of PostgreSQL
4 Molecular & Cellular Proteomics 17.? tapraid4/zjw-macp/zjw-macp/zjw01118/zjw5814-18a xppws Sϭ3 22/8/18 6:35 4/Color Figure(s) F1–5 ARTNO: RA118.000799 Glycomics@ExPASy
TABLE I Database table List of databases with data hosted at SIB. Note that Unilectin is developed in collaboration with SIB but data is hosted at CERMAV (https://www.cermav.cnrs.fr). Status Name Purpose Ref On-line direct web access SugarBindDB Host-pathogen interactions (24) UniCarb-DB Experimental MS data (23) GlyConnect Glycoconjugate data (extension of Ϫ GlycoSuiteDB (48)) Only accessed through tool usage Glycoepitopes Protein-binding substructures Ϫ CFG data Glycan array experiments Ϫ
(version 9.5) in a dedicated web application for each. These web (now Bioinformatic unit at Technical University of Denmark) applications are built with the Java technology and the Play! Frame- have stood the test of time and portal changes. They pre- work (version 2.5). These options facilitate the development of corre- ceded the current effort described here to provide access to sponding Graphical User Interfaces (GUIs) and a REST Application Programming Interface (API) for machine communication. glyco-resources from ExPASy. This “seed” selection illus- With the power of the latest version of the database engine, an trates our willingness to mix in-house with externally devel- aggregation of the glycan structures, protein recognition and bindings oped quality informatic tools in the collection. It also empha- has led to storing data and annotation in a centralized repository. As sizes the continuity of our endeavor because two authors of a result, a communication Application Programming Interface (API) GlycoMod (46) are authors of the present article. Note that and GlyConnect acting as a dashboard provide the users with a broader view and the means to query and navigate through the GlycoSuiteDB (48) was also briefly part of the proteomics different data sets. Database information is encapsulated in a series collection between 2009 and 2013. of tables, which cannot be accessed from outside SIB. Open access Glycomics@ExPASy is developed in close collaboration to the data will however be shortly granted via an RDF end point. The with several glycoscience and glycoproteomics groups ac- end point comes with a new ontology, inspired from GlycoRDF (39) knowledged at the end of this article. As such it is destined to that helps users understanding relationships across different molec- ular entities. Using the ontology, any computer or user can integrate expand the range of databases and tools it contains. Glycomics@ExPASy data with external resources that provide a pub- In-house Data Collection—Data stored in the listed glyco- lic RDF end point and have an entity in common. The use of RDF and databases is in most cases, manually curated. For the SIB- dedicated ontology makes Glycomics@ExPASy compliant with FAIR labeled databases, curation is jointly undertaken with glyco- Guiding Principles for scientific data management (40). Technical Options for In-house Tool Development—The logic be- scientist partners who supply expert information assessment hind all Glycomics@ExPASy tools has been shaped using a Service that defines the filtering and subsequent annotation proce- Oriented Architecture (SOA) (41). We implemented each specific task dures. However, the expected increase of data production as a service, accessible with a common Application Programming has prompted tagging the stored information as “reviewed” or Interface (API), which can be combined to create more complex tools. “unreviewed” depending on the level of data curation. The Our willingness to extend principles of modularity and reusability to st the graphical user interface (GUI) prompted the search for a modular 21 century is being characterized by a major speed gap framework, which would allow the development of separated GUI between data production and its curation. All -omics fields are components as building blocks in the deployment of new tools. To undergoing a data deluge to produce the current Big Data reach our goal, we followed the Web Component standard created by challenge. Thus, at this stage “reviewed” glycoprotein infor- W3C and Google. It offers an easy composition and reuse of GUI mation is based on the curated literature based GlycoSuiteDB components. As a proof of concept, the framework named Google Polymer was used to redesign the interface of GlycoDigest (previously (48) and further updating work performed by Dr Robyn Peter- only available as download (25)). Then it was used more extensively son in the Packer research group. “Reviewed” glycan mass to build SwissMassAbaccus, Glynsight (37), PepSweetener (42), spectrometry data is annotated by the expert Karlsson group EpitopeXtractor (37) and the Search section of GlyConnect. In the in UniCarb-DB. The “unreviewed” tag usually involves minimal long run, this approach saves time because a new tool can be created quality check and currently only involved the integration of MS with readily available services and GUI building blocks. In addition, the ability to quickly build prototypes based on ideas from the wet lab large-scale glycoproteomics data. facilitates collaboration between developers and users. The glycan data and information stored in our databases reflects the actual trends in the literature. High-throughput RESULTS glycopeptide studies as illustrated in the latest published ExPASy, now a mature 24-year-old portal accessible to the works (49, 50) are overwhelmingly focused on human N-gly- Life Science community (43), has mainly hosted proteomics cosylation. These large data sets are now included in GlyCon- tools (44) until 2012 when it was extended to host other nect with an “unreviewed” status. Despite this subsequent -omics resources (22). CAZy (45), GlycanMass and GlycoMod and temporary bias toward human N-glycans reflecting data (46) as well as the neural net glyco-site predictor series (47) of availability, GlyConnect content is diverse and cross-species. the former Center for Biological Sequence Analysis (CBS) It does cover a significant number of O-glycans. In contrast,
Molecular & Cellular Proteomics 17.? 5 tapraid4/zjw-macp/zjw-macp/zjw01118/zjw5814-18a xppws Sϭ3 22/8/18 6:35 4/Color Figure(s) F1–5 ARTNO: RA118.000799 Glycomics@ExPASy
TABLE II Dedicated tool table This table lists all the tools developed by SIB. For each tool the table shows a short description followed by its reference. In addition, we report the software language that has been used and if the tool is accessed through a web interface written with Google Polymer web components. Web Name Purpose Ref Language component GlycoMod Predicts glycan structures from mass data (46) No Perl GlycoDigest Simulates exoglycosidase digestion (25) Yes Javascript SwissMassAbacus Glycopeptide mass calculator Ϫ Yes Javascript GlycoSiteAlign Aligns glycosite regions depending on attached glycan (27) No Python PepSweetener Predicts intact glycopeptides (peptide ϩ glycan composition) from mass data (42) Yes Javascript Glycoforest Partial de novo sequencing of glycans from MS/MS data (60) No Java Glynsight Displays interactively glycan profile changes on a single protein in multiple (37) Yes Javascript conditions
UniCarb-DB is biased toward O-glycan spectra. A greater because they are cross-referenced in our resources (this is integration of these two databases is planned in the near the case for example of GlyTouCan or Glyco3D), or are future. GlyConnect also contains a few entries of C-linked planned to be in the near future. Our latest addition to the glycans which are much rarer in the literature as well as Database section is Glycopedia that provides a source of structures of glycans not attached to proteins such as milk basic to advanced knowledge of glycoscience. Glycopedia oligosaccharides. collects e-chapters and compiles various sources to suggest UniLectin was launched in 2018 and currently contains an relevant readings. It is consulted by naïve or expert users and extended version of the Lectin3D part of the Glyco3D collec- as such is increasingly referred to in and outside the field of tion (38). It is dedicated to describing lectins and their glycan glycoscience. ligands initially based on information stored in the PDB (Pro- We include resources upon suggestion or request. Note tein Data Bank). Another (yet unpublished) aspect of the Uni- that as a rule, applying to most sections of ExPASy, to be Lectin project is to predict lectin domains and folds in unchar- included in the portal, the resources must be web-based, acterized amino acid sequences. regularly updated and improved. With these criteria met, we Selection of External Data Sources—Because of the grow- are open to grow the collection with those qualifying re- ing interest in glycomics, several glycoscience groups around sources. Our contact form on the ExPASy website is suited for the world are releasing new or updated versions of glycoin- suggesting a tool or a database to be assessed and added. formatics tools and databases. Glycomics@ExPASy is set to The inclusion process is ongoing. integrate strictly web-based, quality recognized and sup- Dashboard—We briefly introduce here GlyConnect, our ported resources for glycomics. Tools and databases devel- new dashboard released in December 2017 and unpublished oped by our group at SIB are preceded by the SIB logo so far. GlyConnect is designed to monitor, integrate and whereas external resources have an “external link” gray icon. facilitate the interpretation of collected glycomics and glyco- The inclusion of MatrixDB (51) and CSDB (52) in the data- proteomics data. It is the central platform of Glycomics@ base list of Glycomics@ExPASy offers access to curated gly- ExPASy that will boost the usefulness of on-line services by cosaminoglycans (GAGs) and bacterial and fungal glycan in- tightly integrating tools and databases. The essential entities formation respectively. The registration in the GlyTouCan that are stored or processed in GlyConnect, are shown in repository of all glycan structures recorded in our own and in Fig. 2. partner databases greatly facilitates integrative efforts. This Integrative Tools—Database associated tools can be task has required substantial preliminary work on data stand- grouped into two categories, either dedicated to solve a spe- ardization and was carried out for GAGs in collaboration with cific question or integrative to be used in several applications. MatrixDB developers (manuscript resubmitted after minor The collection is depicted in this way in Table II and III. Most T2,3,AQ:H revision). dedicated tools are described in detail in the cited references. The duplication of the link to Glycomics@ExPASy of the For example, integrative tools are: renowned CAZy database of glycosylation enzymes from the • Substructure search, GlyS3 is used in linking Sugar- ExPASy proteomics tab, where it has been listed for many BindDB to GlyConnect, in the construction of the Glydin’ years, improves the set of relevant databases by covering the map and in some queries of GlycoSiteAlign, knowledge of glycan synthesis and degradation. • EpitopeXtractor is integrated in Glynsight and GlyConnect, ZSI External resources are listed in supplemental Table S1.In • GlycanBuilder, developed within the EuroCarb project the vast majority of cases, they have been selected either (53) is used as a query tool in the databases GlycoDigest
6 Molecular & Cellular Proteomics 17.? tapraid4/zjw-macp/zjw-macp/zjw01118/zjw5814-18a xppws Sϭ3 22/8/18 6:35 4/Color Figure(s) F1–5 ARTNO: RA118.000799 Glycomics@ExPASy
TABLE III Integrative tool table This table lists all the integrative tools developed by SIB. For each tool the table shows a short description followed by its reference. In addition, we report the software language that has been used and if the tool is accessed through a web interface written with Google Polymer web components. Rows written in grey correspond to software developed outside SIB. Web Name Purpose Ref Language component GlycanBuilder Graphic interface for drawing glycan (operational but slow) (53) No Java SugarSketcher Client-side graphic interface for drawing glycan (prototype) Ϫ Yes Javascript Glydin’ Displays interactively a map of glycoepitopes and shared substructures (37) No Python GlyS3 Glycan substructure search (26) Yes Java EpitopeExtractor Outputs known glycan determinants from glycan structures (37) Yes Javascript LiteMol Displays interactively 3D models of glycoproteins (54) No Javascript
and GlyS3. It will be soon replaced by SugarSketcher, of glycoproteomics MS data and avoiding their dependence a new web component currently being prototyped. The on a set workflow or type of instrument (13). Note that an beta version is nonetheless accessible on the server international assessment of the latter tools available for the (https://glycoproteome.expasy.org/sugarsketcher) and interpretation of complex glycopeptide MS/MS data is under- awaits feedback from users, way through a call launched in 2018 within the HUPO Hu- • LiteMol (54) was also not developed in our group but is man Glycoproteomics Initiative (HGI: https://www.hupo.org/ used across our resources for visualizing 3D structures when available in the PDB (55). LiteMol was selected as Glycoproteomics-(B/D-GPP)) for testing the performance of the main 3D visualizer in the PDBe, the European version the currently available glycoproteomics MS software with of PDB World. benchmark data sets. The outcome of this study will guide the presentation of the Glycomics@ExPASy toolbox toward a Hypothesis Building—Glycomics@ExPASy attempts to more informative and didactic section on MS-based glyco- support the formulation of hypothese in the broad range of proteomics data analysis tools. Given that we collect exclu- biological functions where glycans are involved. We offer a sively web-based resources rules out the inclusion of efficient toolbox to navigate, explore and correlate data. As an exam- but only stand-alone software. One such web-based pro- ple, using GlyS3, SugarBindDB, GlyConnect and the respec- tive cross-links of these databases to UniProt, a user can seek teomics viewer caters for the visualization of glycoproteomics to establish the consistency of interactions taking place at the MS data, namely MSViewer, integrated in the Protein Pro- cell surface. In the following, three possible use cases are spector software suite that uses a specific glycosite database brought to the reader. to enhance intact glycopeptide identification (58). Protein Pro- From MS to Glycoprotein Features—Our tool set is de- spector features in the proteomics tab of the ExPASy portal as signed to match the expected boost of glycoproteomics (gly- it is currently recognized as a protein identification platform. can composition at specific sites on complex mixtures of In the best cases, glycoproteomics experiments also gen- glycoproteins) data that is currently just reaching high erate relative quantitative data (see (59) for example). In such throughput level (56, 57). An example on how to integrate favorable, though presently rare, data, a set of site-specific some of our dedicated tools for extracting glycoprotein fea- glycan compositions can be associated with levels of abun- F3 tures from MS data is shown in Fig. 3. Predominant precursor dance, which can then be processed by Glynsight that will masses in the MS spectra can be input into PepSweetener create glycan profiles for each glycoprotein in the experiment (42). This software supports the manual annotation of intact (37). A file containing a list of glycan compositions present on glycopeptides, using custom web visualization regardless of a protein, with the respective quantification, can be uploaded the instrument that produced the data. Results are displayed and processed by Glynsight. This tool produces a custom on an interactive heat-map chart featuring the combined visualization that highlights up/down-regulated glycan com- mass contributions of theoretical (usually tryptic) peptides and positions (either site-specific or global) among diverse pro- attached glycan compositions. The variations in tile colors teins or on the same protein under different conditions, for correspond to ppm deviations from the query precursor mass. example healthy/cancer. Fig. 3 highlights the situation where Annotation can be refined through glycan composition filter- two proteins are successively processed with PepSweetener ing, sorting by mass and tolerance, and checking MS2 data and assumes quantitative data are provided for each identi- consistency via an in silico peptide fragmentation diagram fied glycan composition. The Glynsight interface can display a (in-house fragmentation tool common with that of UniCarb- glycan expression profile for each protein as well as the DB). PepSweetener is mainly designed as a complement or differential profile between two proteins. Furthermore, Glyn- extension to software being developed for automatic analysis sight integrates with GlyConnect, such that any glycan com-
Molecular & Cellular Proteomics 17.? 7 tapraid4/zjw-macp/zjw-macp/zjw01118/zjw5814-18a xppws Sϭ3 22/8/18 6:35 4/Color Figure(s) F1–5 ARTNO: RA118.000799 Glycomics@ExPASy
C O L O R
FIG.3.From Mass Spectrometry data to glycoprotein profile. A representative scenario of the possible combination of PepSweetener and Glynsight in order to support the manual annotation of MS1 mass spectra of intact N-glycopeptides and integrate quantitative information when available. Users can process MS1 Spectra using PepSweetener to identify all the possible N-glycan compositions on a single human protein. Intact glycopeptide masses are broken into the respective contributions of the peptide and the glycan masses. Compositions in PepSweetener ZSI are in the detailed format shown in supplemental Fig. S1. Then, when quantitative data on each composition is available, Glynsight can be used to identify specific glycosylation patterns. The procedure can be repeated with a second protein and Glynsight will automatically generate the differential analysis of glycan profiles on the proteins. The integration with Glyconnect leads to displaying the potential glycan structures known to match the differentially expressed monosaccharide compositions.
position is connected to all the putative glycan structures that The GlyConnect search tool will retrieve the possible related have already been reported in the literature. In the end, after glycan structures and the proteins that have been reported to determining a pool of interesting compositions from MS data, have these compositions/structures attached and stored in scientists can leverage knowledge and tools in Glycomics@ the databases. Results are displayed in a form of a conceptual ExPASy to build new hypotheses and design the next round map where the compositions sit in the middle and connect of experiments. glycan structures and associated glycoproteins, respectively Exploring Glycoprotein Features—Current global glycome on the right and the left sides (Fig. 4 part 1). This visualization F4 profiling experiments generate one or more set(s) of glycan is well suited for understanding the potential relations be- compositions and/or structures with their respective expres- tween proteins and glycans. Selected glycan structures sion on a protein, in a tissue or in a cell. Tools and databases can be further explored through activating the integrated in Glycomics@ExPASy can be combined to explore distinctive EpitopeXtractor function (Fig. 4 part 2a). This tool extracts all glycan features that characterize glycoproteins as shown in glycoepitopes that form a part of the selected structures, Fig. 3. In this case, the entry point of the workflow is Gly- based on our curated set of glycoepitopes (see Material & Connect to which a list of glycan compositions is submitted. Methods section). To further explore the results, Epitope
8 Molecular & Cellular Proteomics 17.? tapraid4/zjw-macp/zjw-macp/zjw01118/zjw5814-18a xppws Sϭ3 22/8/18 6:35 4/Color Figure(s) F1–5 ARTNO: RA118.000799 Glycomics@ExPASy
C O L O R
FIG.4. From composition to glycoprotein features. An interactive way of extracting glycoprotein features from glycan compositions combining published data and ad hoc tools (1). A list of compositions is input in GlyConnect, which retrieves all the proteins reported as having these compositions attached to them (on the left) and reported glycan structures corresponding to this composition (on the right) annotated in this knowledge base. Glycan structures can be further processed to extract contained glycan epitopes using EpitopeXtractor (2a). Glycoepitope results can be mapped on Glydin’, an interactive epitope network (3a). Glydin’ aggregates glycan epitopes from four different sources (databases and literature reviews) and provides links to the original information. When epitopes are taken from SugarBindDB, further information on the pathogens can be browsed (4a). Xtractor is integrated with Glydin’ (37) (Fig. 4 part 3a), the specific glycan structures to support the user in narrowing epitope network viewer so that extracted glycoepitopes can down questions of molecular interaction and selecting a sub- be mapped onto the network to highlight clusters of shared set of relevant glycans. The structural properties of these structures as well as potential outliers. Glydin’ provides two interacting glycan structures can be further exploited. In both types of network: substructure based and enzyme based. In GlyConnect and GlycoSiteAlign (27) glycan structures are as- the substructure case two nodes are connected if one gly- sociated with descriptive features of glycan type such as coepitope is a substructure of the other whereas in the en- “sialylated”, “core-fucosylated”, etc (see supplemental Table ZSI zyme case, two nodes are connected as the result of a gly- S2 for the complete set of properties). In parallel to consider- cosyltransferase adding a monosaccharide to the other. Each ing the set of glycan structures matching the initial list of node/glycoepitope, is linked to the database or publication compositions as the input of EpitopeXtractor, the composi- source(s) from which it was extracted (see (37) for details). tions can also be submitted to GlycoSiteAlign. This tool se- When glycoepitopes are contained in SugarBindDB (Fig. 4 lectively aligns amino acid sequences surrounding glycosyla- part 4a), additional information includes structures potentially tion sites (by default, 20 positions on each side of the associated with diseases. glycosylated residue) depending on structural properties of This stepwise combination of tools draws a possible path the known glycan attached to the site. In other words, this is from a list of glycan compositions to the binding properties of an alternative mean to characterize the glyco-site sequence
Molecular & Cellular Proteomics 17.? 9 tapraid4/zjw-macp/zjw-macp/zjw01118/zjw5814-18a xppws Sϭ3 22/8/18 6:35 4/Color Figure(s) F1–5 ARTNO: RA118.000799 Glycomics@ExPASy
C O L O R
FIG.5. Glycan mediated protein-protein interactions. This figure shows how a new hypothesis on glycan mediated protein-protein interaction can be built using published data in Glyconnect and SugarBindDB: A, In this scenario glycan binding protein (GBP) is selected in SugarBind. The information on the glycan ligand recognized by the GBP is used to perform a substructure search on all the structures in Glyconnect with the GlyS3 glycan substructure search tool. The structures identified by GlyS3 are used in Glyconnect to create a list of target proteins that can interact with the initial GBP. In the example of the blood group B antigen triose, there are 35 full structure types in GlyConnect that contain this glycoepitope (B) In this scenario a glycoprotein in GlyConnect is selected with its list of associated glycan structures. Glycans
10 Molecular & Cellular Proteomics 17.? tapraid4/zjw-macp/zjw-macp/zjw01118/zjw5814-18a xppws Sϭ3 22/8/18 6:35 4/Color Figure(s) F1–5 ARTNO: RA118.000799 Glycomics@ExPASy
environment with the prospect of identifying the most con- DISCUSSION strained amino acids. Our project is part of a wider initiative and is intended as a Glycan-mediated Protein-Protein Interactions—Using an- major step toward interconnecting isolated efforts within the other combination of tools and databases in Glycomics@ broad and interdisciplinary field of glycobiology. As such it is ExPASy, potential correlations between a glycan binding pro- designed to boost progress in glycoscience research. At this tein (GBP) of a pathogen a host glycoprotein and a glycan stage, we have a focus on human glycans and related pro- F5 structure (Fig. 5B) can be made. teins reflecting the current main research published in the In the first scenario (Fig. 5A), the starting point is a gly- literature. Cell surfaces, plasma and serum are the most stud- coepitope recognized by a specific GBP (Fig. 5A), a bacterial ied systems displaying glycosylation changes in many dis- lectin described in SugarBindDB. This is illustrated with the eases. However, there is overwhelming evidence of similar blood group B antigen triose. A binding event in this database changes on other glycoproteins, components of the immune is always formed by a pair composed of a GBP/lectin and a system, saliva and the protective mucus of the gastrointesti- glycoepitope part of a glycan present on the host surface. nal tract, among others. Many of these glycosylation changes Whenever possible, further information of a GBP/lectin is are biologically significant and connected with the molecular available via cross-reference to UniProt. The glycoepitope dysfunctions involved in disease initiation and are evident can be used as an input of the GlyS3 substructure search tool early in the disease progress. Being key mediators in tumor to match the full structures stored in GlyConnect that contain progression events, glycoproteins influence features of tumor this specific ligand. The list of glycan structures retrieved by cells such as proliferation, invasion, angiogenesis, and me- GlyS3 can be explored in GlyConnect that reports relation- tastasis. Besides, glycans and glycoconjugates play essential ships between glycans and glycoproteins. In this example 35 roles in the dynamic interplay between host and pathogens. matches to full structures are detailed. They are spread Our technical choices detailed in the Material & Methods across twenty O-linked, nine N-linked glycans and six struc- section were deliberately made to ensure the modularity, in- tures reported as non-attached to proteins. The shortlisted teroperability, reusability and user-friendliness of our applica- glycoproteins can be considered as candidates for interacting tions on the ExPASy portal. In the long run, this approach with the initial GBP/lectin. The cross-references from Gly- saves time because a new tool can be created with readily Connect to UniProt closes the loop for protein information. In available services and Graphical User Interface (GUI) building a nutshell, starting with a GBP/lectin can lead to a selection of putative glycoprotein interaction partners. blocks. To give a concrete example, SwissMassAbacus was The second scenario (Fig. 5B) starts from a glycan structure prototyped in 1 day, assembling available pieces and quickly in GlyConnect and relies on its reported relationships with deployed within a week. In addition, the ability to build quick glycoproteins. The figure shows an example of a reviewed prototypes based on ideas from the wet lab facilitates the N-linked glycan structure. GlyConnect also offers the option collaboration between developers and users. The innovative of running EpitopeXtractor to generate a selection of gly- component of this approach is not only related with time but coepitopes contained in this starting glycan. Leveraging the has an impact on work sharing. The modularity enforced by binding data in SugarBindDB, the obtained glycoepitopes can the framework allows assigning precise and delimited tasks be associated with a collection of GBPs/lectins that recognize that can be undertaken by software developers with no prior one or more of these glycoepitopes. In the end, the workflow involvement in the project. Web Components and SOA are allows the selection of GBPs that could possibly interact with key to the fast delivery of ad hoc applications to our end users the glycoproteins on which the starting glycan has been re- mixing pre-built GUI components with a solid API. Mainte- ported to be attached. Cross-references of both glycopro- nance is minimal and easily transferred to new staff. Finally, teins in GlyConnect and GBPs/lectins in SugarBindDB to our tight network with glycoscientists contributes to biocura- UniProt can be used to further rationalize potential interacting tion and the definition of tool requirements and is creating a partners. critical mass of users and the best conditions for anticipating In the end, combinations of tools and databases in the bioinformatics needs of the community. This privileged Glycomics@ExPASy can be used to generate a selection of position combined with rationally chosen technology seems putative glycan mediated protein-protein interactions. These our best chance to guarantee long term production, quality, can be considered for further experiments. usefulness and maintenance.
are processed with EpitopeXtractor to single out all the glycan epitopes contained. The Glydin’ interactive map of structurally related glycoepitopes helps visualizing the potential common substructures in the complete set of glycoepitopes. Then, extracted epitopes are used in SugarBind to identify all the reported GBPs that can possibly interact with the initial glycoprotein. In this example, the VP1 capsid protein of the Norwalk virus is known to bind the blood group B antigen triose. Note that protein structures shown above UniProt and those shown above GlyConnect are not related to the example but simply illustrating the difference in the information that is stored on the unglycosylated AQ: N protein in contrast with the stored information on intact glycoproteins.
Molecular & Cellular Proteomics 17.? 11 tapraid4/zjw-macp/zjw-macp/zjw01118/zjw5814-18a xppws Sϭ3 22/8/18 6:35 4/Color Figure(s) F1–5 ARTNO: RA118.000799 Glycomics@ExPASy
We are aware that the scenarios we have proposed to 2. Wuhrer, M. (2013) Glycomics using mass spectrometry. Glycoconj. J. 30, combine several tools and rely on curated data are still limited 11–22 3. Lundborg, M., and Widmalm, G. (2011) Structural analysis of glycans by by the minimal content of available databases. As an exam- NMR chemical shift prediction. Anal. Chem. 83, 1514–1517 ple, we are collecting data from published glycoproteomics 4. Adamczyk, B., Sto¨ ckmann, H., O’Flaherty, R., Karlsson, N. G., and Rudd, data sets including the increasing O-GlcNAc data. We antici- P. M. (2017) in High-Throughput Glycomics and Glycoproteomics, eds Lauc G, Wuhrer M (Springer New York, New York, NY), pp 97–108 pate this shortcoming of limited data to lessen with time with 5. Ruhaak, L. R., Hennig, R., Huhn, C., Borowiak, M., Dolhain, R. J. E. M., converging efforts to produce more glycoproteomics data and Deelder, A. M., Rapp, E., and Wuhrer, M. (2010) Optimized workflow for improved experimental technologies. Glycomics@ExPASy has preparation of APTS-labeled N-glycans allowing high-throughput analy- sis of human plasma glycomes using 48-channel multiplexed CGE-LIF. been designed to accommodate this easily. J. Proteome Res. 9, 6655–6664 6. Grant, O. C., and Woods, R. J. (2014) Recent advances in employing CONCLUSION molecular modelling to determine the specificity of glycan-binding pro- teins. Curr. Opin. Struct. Biol. 28, 47–55 Glycomics@ExPASy is destined to become an essential 7. Cecioni, S., Praly, J.-P., Matthews, S. E., Wimmerova´ , M., Imberty, A., and and valuable reference portal for glycoscience research. Of- Vidal, S. (2012) Rational design and synthesis of optimized glycoclusters fering bioinformatics resources for glycomics and glycopro- for multivalent lectin-carbohydrate interactions: influence of the linker arm. Chem. - Eur. J. 18, 6250–6263 teomics on a server that has built its long-lasting reputation on 8. Pilobello, K. T., Krishnamoorthy, L., Slawek, D., and Mahal, L. K. (2005) bioinformatics for proteomics and more recently for other Development of a lectin microarray for the rapid analysis of protein common -omics is an effective means of pulling glycoscience glycopatterns. ChemBioChem 6, 985–989 9. Heimburg-Molinaro, J., Song, X., Smith, D. F., and Cummings, R. D. (2011) out of its isolation by boosting visibility and cross disciplinary in Curr. Protoc. Protein Sci., eds Coligan JE, Dunn BM, Speicher DW, research focus. Glycomics@ExPASy was designed and is Wingfield PT (John Wiley & Sons, Inc., Hoboken, NJ, U.S.A.), p developed with a long-term vision and in collaboration with 12.10.1–12.10.29 10. Thaysen-Andersen, M., Packer, N. H., and Schulz, B. L. (2016) Maturing glycoscientists to meet as closely as possible the needs of the glycoproteomics technologies provide unique structural insights into the community in glycoinformatics. This vision in the first instance N-glycoproteome and its regulation in health and disease. Mol. Cell encompasses reaching out to protein scientists who consider Proteomics 15, 1773–1790 11. Campbell, M. P., Ranzinger, R., Lu¨ tteke, T., Mariethoz, J., Hayes, C. A., glycomics a too complicated topic to include in protein char- Zhang, J., Akune, Y., Aoki-Kinoshita, K. F., Damerell, D., Carta, G., York, acterization studies. Next steps will involve the inclusion of W. S., Haslam, S. M., Narimatsu, H., Rudd, P. M., Karlsson, N. G., data on other biologically important glycoconjugate data such Packer, N. H., and Lisacek, F. (2014) Toolboxes for a standardised and systematic study of glycans. BMC Bioinformatics 15, S9 as glycolipids and proteoglycans and their connection with 12. Campbell, M. P., Aoki-Kinoshita, K. F., Lisacek, F., York, W. S., and Packer, other -omics data. N. H. (2015) in Essentials of Glycobiology, eds Varki A, Cummings RD, Esko JD, Stanley P, Hart GW, Aebi M, Darvill AG, Kinoshita T, Packer NH, Acknowledgments—We thank Serge Perez, Anne Imberty, Sylvie Prestegard JH, Schnaar RL, Seeberger PH (Cold Spring Harbor Labora- Ricard-Blum, Bernard Henrissat, Gordan Lauc, Manfred Wuhrer, Dan- tory Press, Cold Spring Harbor (NY)). 3rd Ed. iel Kolarich, Natasha Zachara, Pauline Rudd, Kiyoko Aoki-Kinoshita, 13. Walsh, I., O’Flaherty, R., and Rudd, P. M. (2017) Bioinformatics applications Gavin Davey, Sriram Neelamegham and Elaine Mullen for their support to aid high-throughput glycan profiling. Perspect. Sci. 11, 31–39 and cooperation in promoting this effort. We also thank Franc¸ ois Bon- 14. Hu, H., Khatri, K., and Zaia, J. (2016) Algorithms and design strategies towards automated glycoproteomics analysis: algorithms and design nardel, Matthew Campbell, Leonardo Castorina, Renaud Costa, Marcin AQ: J strategies. Mass Spectrom. Rev. 36, 475–498 Domagalski, Marie Ghraichy, Lou Gotz, Nicolas Hory and Josefina 15. Tiemeyer, M., Aoki, K., Paulson, J., Cummings, R. D., York, W. S., Karlsson, Lascano Maillard for their contribution to some of the hosted resources. N. G., Lisacek, F., Packer, N. H., Campbell, M. P., Aoki, N. P., Fujita, A., Matsubara, M., Shinmachi, D., Tsuchiya, S., Yamada, I., Pierce, M., Ran- * This work is supported by the Swiss Federal Government through zinger, R., Narimatsu, H., and Aoki-Kinoshita, K. F. (2017) GlyTouCan: an the State Secretariat for Education, Research and Innovation (SERI). accessible glycan structure repository. Glycobiology 27, 915–919 AQ: K The ExPASy portal is maintained by the web team of the Swiss Institute 16. Varki, A., Cummings, R. D., Aebi, M., Packer, N. H., Seeberger, P. H., Esko, of Bioinformatics and hosted at the Vital-IT Competency Center. J. D., Stanley, P., Hart, G., Darvill, A., Kinoshita, T., Prestegard, J. J., Part of the tools described above were developed with the support Schnaar, R. L., Freeze, H. H., Marth, J. D., Bertozzi, C. R., Etzler, M. E., of the European Union (https://ec.europa.eu/research/fp7/) FP7 In- Frank, M., Vliegenthart, J. F., Lu¨ tteke, T., Perez, S., Bolton, E., Rudd, P., novative Training Network [ITN 316929] and the Swiss National Paulson, J., Kanehisa, M., Toukach, P., Aoki-Kinoshita, K. F., Dell, A., Science Foundation [SNSF 31003A_141215; http://www.snf.ch/]. Narimatsu, H., York, W., Taniguchi, N., and Kornfeld, S. (2015) Symbol ZSI nomenclature for graphical representations of glycans. Glycobiology 25, □S This article contains supplemental material. 1323–1324 ¶¶ To whom correspondence should be addressed: SIB Swiss Insti- 17. York, W. S., Agravat, S., Aoki-Kinoshita, K. F., McBride, R., Campbell, M. P., tute of Bioinformatics, Proteome Informatics Group, Route de Drize 7, Costello, C. E., Dell, A., Feizi, T., Haslam, S. M., Karlsson, N., Khoo, K.-H., AQ: I 1227 Geneva, Switzerland. E-mail: [email protected]. Kolarich, D., Liu, Y., Novotny, M., Packer, N. H., Paulson, J. C., Rapp, E., Author contributions: J.M., D.A., O.H., N.G.K., N.P., and F.L. de- Ranzinger, R., Rudd, P. M., Smith, D. F., Struwe, W. B., Tiemeyer, M., signed research; J.M., D.A., A.G., O.H., E.G., and M.R.-M. performed Wells, L., Zaia, J., and Kettner, C. (2014) MIRAGE: The minimum informa- research; D.A. and F.L. wrote the paper. tion required for a glycomics experiment. Glycobiology 24, 402–406 18. Kolarich, D., Rapp, E., Struwe, W. B., Haslam, S. M., Zaia, J., McBride, R., Agravat, S., Campbell, M. P., Kato, M., Ranzinger, R., Kettner, C., and York, REFERENCES W. S. (2013) The minimum information required for a glycomics experiment 1. National Research Council (US) Committee on Assessing the Importance (MIRAGE) project: improving the standards for reporting mass-spectrom- and Impact of Glycomics and Glycosciences. (2012) Transforming Gly- etry-based glycoanalytic Data. Mol. Cell. Proteomics 12, 991–995 coscience: A Roadmap for the Future (National Academies Press (US), 19. Lisacek, F., Mariethoz, J., Alocci, D., Rudd, P. M., Abrahams, J. L., Camp- Washington (DC)) bell, M. P., Packer, N. H., Ståhle, J., Widmalm, G., Mullen, E., Adamczyk,
12 Molecular & Cellular Proteomics 17.? tapraid4/zjw-macp/zjw-macp/zjw01118/zjw5814-18a xppws Sϭ3 22/8/18 6:35 4/Color Figure(s) F1–5 ARTNO: RA118.000799 Glycomics@ExPASy
B., Rojas-Macias, M. A., Jin, C., and Karlsson, N. G. (2017) in High- O., Edmunds, S., Evelo, C. T., Finkers, R., Gonzalez-Beltran, A., Gray, Throughput Glycomics and Glycoproteomics, eds Lauc G, Wuhrer M A. J. G., Groth, P., Goble, C., Grethe, J. S., Heringa, J., ’t Hoen, P. A., (Springer New York, New York, NY), pp 235–264 Hooft, R., Kuhn, T., Kok, R., Kok, J., Lusher, S. J., Martone, M. E., Mons, 20. Lutteke, T. (2006) GLYCOSCIENCES.de: an Internet portal to support gly- A., Packer, A. L., Persson, B., Rocca-Serra, P., Roos, M., van Schaik, R., comics and glycobiology research. Glycobiology 16, 71R–81R Sansone, S.-A., Schultes, E., Sengstag, T., Slater, T., Strawn, G., Swertz, 21. Akune, Y., Hosoda, M., Kaiya, S., Shinmachi, D., and Aoki-Kinoshita, K. F. M. A., Thompson, M., van der Lei, J., van Mulligen, E., Velterop, J., (2010) The RINGS resource for glycome informatics analysis and data Waagmeester, A., Wittenburg, P., Wolstencroft, K., Zhao, J., and Mons, mining on the Web. OMICS J. Integr. Biol. 14, 475–486 B. (2016) The FAIR Guiding Principles for scientific data management 22. Artimo, P., Jonnalagedda, M., Arnold, K., Baratin, D., Csardi, G., de Castro, and stewardship. Sci. Data 3, 160018 E., Duvaud, S., Flegel, V., Fortier, A., Gasteiger, E., Grosdidier, A., Her- 41. Laskey, K. B., and Laskey, K. (2009) Service oriented architecture: Service nandez, C., Ioannidis, V., Kuznetsov, D., Liechti, R., Moretti, S., Mo- oriented architecture. Wiley Interdiscip. Rev. Comput. Stat. 1, 101–105 staguir, K., Redaschi, N., Rossier, G., Xenarios, I., and Stockinger, H. 42. Domagalski, M. J., Alocci, D., Almeida, A., Kolarich, D., and Lisacek, F. (2012) ExPASy: SIB bioinformatics resource portal. Nucleic Acids Res. (2017) PepSweetener: A Web-based tool to support manual annotation 40, W597–W603 of intact glycopeptide MS spectra. PROTEOMICS - Clin. Appl. 1700069 23. Hayes, C. A., Karlsson, N. G., Struwe, W. B., Lisacek, F., Rudd, P. M., 43. Appel, R. D., Bairoch, A., and Hochstrasser, D. F. (1994) A new generation Packer, N. H., and Campbell, M. P. (2011) UniCarb-DB: a database of information retrieval tools for biologists: the example of the ExPASy resource for glycomic discovery. Bioinforma. Oxf. Engl. 27, 1343–1344 WWW server. Trends Biochem. Sci. 19, 258–260 24. Mariethoz, J., Khatib, K., Alocci, D., Campbell, M. P., Karlsson, N. G., 44. Gasteiger, E., Gattiker, A., Hoogland, C., Ivanyi, I., Appel, R. D., and Packer, N. H., Mullen, E. H., and Lisacek, F. (2016) SugarBindDB, a Bairoch, A. (2003) ExPASy: The proteomics server for in-depth protein resource of glycan-mediated host–pathogen interactions. Nucleic Acids knowledge and analysis. Nucleic Acids Res. 31, 3784–3788 Res. 44, D1243–D1250 45. Lombard, V., Golaconda Ramulu, H., Drula, E., Coutinho, P. M., and Hen- 25. Gotz, L., Abrahams, J. L., Mariethoz, J., Rudd, P. M., Karlsson, N. G., rissat, B. (2014) The carbohydrate-active enzymes database (CAZy) in Packer, N. H., Campbell, M. P., and Lisacek, F. (2014) GlycoDigest: a tool 2013. Nucleic Acids Res. 42, D490–D495 for the targeted use of exoglycosidase digestions in glycan structure 46. Cooper, C. A., Gasteiger, E., and Packer, N. H. (2001) GlycoMod–a soft- determination. Bioinformatics 30, 3131–3133 ware tool for determining glycosylation compositions from mass spec- 26. Alocci, D., Mariethoz, J., Horlacher, O., Bolleman, J. T., Campbell, M. P., trometric data. Proteomics 1, 340–349 and Lisacek, F. (2015) Property Graph vs RDF Triple Store: A comparison 47. Blom, N., Sicheritz-Ponte´ n, T., Gupta, R., Gammeltoft, S., and Brunak, S. on glycan substructure search. PLOS ONE 10, e0144578 (2004) Prediction of post-translational glycosylation and phosphorylation 27. Gastaldello, A., Alocci, D., Baeriswyl, J.-L., Mariethoz, J., and Lisacek, F. of proteins from the amino acid sequence. Proteomics 4, 1633–1649 (2016) GlycoSiteAlign: glycosite alignment based on glycan structure. J. 48. Cooper, C. A., Joshi, H. J., Harrison, M. J., Wilkins, M. R., and Packer, N. H. Proteome Res. 15, 3916–3928 (2003) GlycoSuiteDB: a curated relational database of glycoprotein gly- 28. Varki, A. (2017) New and updated glycoscience-related resources at NCBI. can structures and their biological sources. 2003 update. Nucleic Acids Glycobiology 27, 993–993 Res. 31, 511–513 29. Horlacher, O., Nikitin, F., Alocci, D., Mariethoz, J., Mu¨ ller, M., and Lisacek, 49. Bollineni, R. C., Koehler, C. J., Gislefoss, R. E., Anonsen, J. H., and Thiede, F. (2015) MzJava: An open source library for mass spectrometry data B. (2018) Large-scale intact glycopeptide identification by Mascot data- processing. J. Proteomics 129, 63–70 base search. Sci. Rep. 8 30. Herget, S., Ranzinger, R., Maass, K., and Lieth, C.-W. (2008) v. d. GlycoCT—a 50. Hu, Y., Shah, P., Clark, D. J., Ao, M., and Zhang, H. (2018) Reanalysis of unifying sequence format for carbohydrates. Carbohydr. Res. 343, global proteomic and phosphoproteomic data identified a large number 2162–2171 of glycopeptides. Anal. Chem. 90, 8065–8071 31. Harvey, D. J., Merry, A. H., Royle, L., Campbell, M. P., Dwek, R. A., and 51. Launay, G., Salza, R., Multedo, D., Thierry-Mieg, N., and Ricard-Blum, S. Rudd, P. M. (2009) Proposal for a standard system for drawing structural (2015) MatrixDB, the extracellular matrix interaction database: updated diagrams of N- and O-linked carbohydrates and related compounds. content, a new navigator and expanded functionalities. Nucleic Acids Proteomics 9, 3796–3801 Res. 43, D321–D327 32. Federhen, S. (2012) The NCBI Taxonomy database. Nucleic Acids Res. 40, 52. Toukach, P. V., and Egorova, K. S. (2016) Carbohydrate structure database D136–D143 merged from bacterial, archaeal, plant and fungal parts. Nucleic Acids 33. UniProt Consortium, T. (2018) UniProt: the universal protein knowledge- Res. 44, D1229–D1236 base. Nucleic Acids Res. 46, 2699–2699 53. Ceroni, A., Dell, A., and Haslam, S. M. (2007) The GlycanBuilder: a fast, 34. Mungall, C. J., Torniai, C., Gkoutos, G. V., Lewis, S. E., and Haendel, M. A. intuitive and flexible software tool for building and displaying glycan (2012) Uberon, an integrative multi-species anatomy ontology. Genome structures. Source Code Biol. Med. 2, 3 Biol. 13, R5 54. Sehnal, D., Deshpande, M., Varˇeková, R. S., Mir, S., Berka, K., Midlik, A., 35. Gremse, M., Chang, A., Schomburg, I., Grote, A., Scheer, M., Ebeling, C., Pravda, L., Velankar, S., and Koîca, J. (2017) LiteMol suite: interactive and Schomburg, D. (2011) The BRENDA Tissue Ontology (BTO): the first web-based visualization of large-scale macromolecular structure data. all-integrating ontology of all organisms for enzyme sources. Nucleic Nat. Methods 14, 1121–1122 Acids Res. 39, D507–D513 55. Burley, S. K., Berman, H. M., Kleywegt, G. J., Markley, J. L., Nakamura, H., 36. Schriml, L. M., Arze, C., Nadendla, S., Chang, Y.-W. W., Mazaitis, M., Felix, and Velankar, S. (2017) in Protein Crystallography, eds Wlodawer A, V., Feng, G., and Kibbe, W. A. (2012) Disease Ontology: a backbone for Dauter Z, Jaskolski M (Springer New York, New York, NY), pp 627–641 disease semantic integration. Nucleic Acids Res. 40, D940–D946 56. Lee, L. Y., Moh, E. S. X., Parker, B. L., Bern, M., Packer, N. H., and 37. Alocci, D., Ghraichy, M., Barletta, E., Gastaldello, A., Mariethoz, J., and Thaysen-Andersen, M. (2016) Toward automated N -glycopeptide iden- Lisacek, F. (2018) Understanding the glycome: an interactive view of tification in glycoproteomics. J. Proteome Res. 15, 3904–3915 glycosylation from glycocompositions to glycoepitopes. Glycobiology 57. Liu, G., Cheng, K., Lo, C. Y., Li, J., Qu, J., and Neelamegham, S. (2017) A 28, 349–362 comprehensive, open-source platform for mass spectrometry-based 38. Pe´ rez, S., Sarkar, A., Rivet, A., Breton, C., and Imberty, A. (2015) in glycoproteomics data analysis. Mol. Cell Proteomics 16, 2032–2047 Glycoinformatics, eds Lu¨ tteke T, Frank M (Springer New York, New York, 58. Chalkley, R. J., and Baker, P. R. (2017) Use of a glycosylation site database NY), pp 241–258 to improve glycopeptide identification from complex mixtures. Anal. 39. Ranzinger, R., Aoki-Kinoshita, K. F., Campbell, M. P., Kawano, S., Lutteke, T., Bioanal. Chem. 409, 571–577 Okuda, S., Shinmachi, D., Shikanai, T., Sawaki, H., Toukach, P., Mat- 59. Wuhrer, M., Stam, J. C., van de Geijn, F. E., Koeleman, C. A. M., Verrips, subara, M., Yamada, I., and Narimatsu, H. (2015) GlycoRDF: an ontology to C. T., Dolhain, R. J. E. M., Hokke, C. H., and Deelder, A. M. (2007) standardize glycomics data in RDF. Bioinformatics 31, 919–925 Glycosylation profiling of immunoglobulin G (IgG) subclasses from hu- 40. Wilkinson, M. D., Dumontier, M., Aalbersberg Ij Appleton, J. G., Axton, M., man serum. PROTEOMICS 7, 4070–4081 Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, 60. Horlacher, O., Jin, C., Alocci, D., Mariethoz, J., Mu¨ ller, M., Karlsson, N. G., P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, and Lisacek, F. (2017) Glycoforest 1.0. Anal. Chem. 89, 10932–10940
Molecular & Cellular Proteomics 17.? 13 2.2. Concluding remarks 51
2.2 Concluding remarks
The Glycomics@ExPASy initiative aggregates all prominent glycomics web-based resources and boosts the development of new tools to fulfil the needs of glycosci- entists. As a consequence, it facilitates the spread of bioinformatics within glyco- biology and attempts to make glycoscience easier to understand for scientists who belong to a different discipline like proteomics, genomics and so on.
53
Chapter 3
Integrative tools
3.1 Overview
This chapter focuses on integrative tools, which are designed to connect several gly- comics pieces. In particular, we describe two different tools: GlyS3 and EpitopeX- tractor. The former uses the RDF technology to search, among a collection of glycans, all structures that contain a specific subregion defined by the user. The latter, instead, verifies which glycan epitopes are present in one or more glycan structures based on a curated list of almost 600 ligands. EpitopeXtractor uses GlyS3 core features and, as well as GlyS3, is developed using RDF. To give an in-depth description of the logic behind GlyS3, we introduce here the paper: “Property Graph vs RDF Triple Store: A comparison on Glycan Substructure Search”. This work describes the details of the RDF as well as property graph implementations with their pros and cons. Additionally, it sheds light on why we have chosen RDF versus graph databases. Although the GlyS3 search engine is based on RDF, the data structure which stores glycans is taken from the MzJava library. Therefore, in the appendixA containing all co-authors pa- pers, we have added the paper "MzJava: An open source library for mass spectrometry data processing.", which defines how the MzJava library has been built and presents the data structure for glycans.
As these two pieces of work focus only on the implementation without showing why GlyS3 is an essential integrative tool, we have added two papers presenting two real use cases: "SugarBindDB, a resource of glycan-mediated host-pathogen in- teractions." and "GlycoSiteAlign: Glycosite Alignment Based on Glycan Structure." 54 Chapter 3. Integrative tools
In the first case, GlyS3 is used to connect ligands in SugarBindDB to full glycan struc- tures in GlyConnect. In the second case, instead, we have adopted GlyS3 to classify glycans in GlyConnect by structural properties (Core-fucose, non-core-fucose, bi- sected, bi-antennary, tri-antennary, tetra-antennary, etc.). This information, stored in GlycoSiteAlign, allows users to select glycosites based on specific features of the structures accommodated in each sites.
We do not have a specific paper which describes EpitopeXtractor since it has been presented in a study together with Glynsight and Glydin’. So, we introduce this article in the next chapter dedicated to explorative tools. RESEARCH ARTICLE Property Graph vs RDF Triple Store: A Comparison on Glycan Substructure Search
Davide Alocci1,2, Julien Mariethoz1, Oliver Horlacher1,2, Jerven T. Bolleman3, Matthew P. Campbell4, Frederique Lisacek1,2*
1 Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, Geneva, 1211, Switzerland, 2 Computer Science Department, University of Geneva, Geneva, 1227, Switzerland, 3 Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Geneva, 1211, Switzerland, 4 Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, Australia
Abstract
Resource description framework (RDF) and Property Graph databases are emerging tech- nologies that are used for storing graph-structured data. We compare these technologies OPEN ACCESS through a molecular biology use case: glycan substructure search. Glycans are branched Citation: Alocci D, Mariethoz J, Horlacher O, tree-like molecules composed of building blocks linked together by chemical bonds. The Bolleman JT, Campbell MP, Lisacek F (2015) molecular structure of a glycan can be encoded into a direct acyclic graph where each node Property Graph vs RDF Triple Store: A Comparison on Glycan Substructure Search. PLoS ONE 10(12): represents a building block and each edge serves as a chemical linkage between two build- e0144578. doi:10.1371/journal.pone.0144578 ing blocks. In this context, Graph databases are possible software solutions for storing gly-
Editor: Manuela Helmer-Citterich, University of can structures and Graph query languages, such as SPARQL and Cypher, can be used to Rome Tor Vergata, ITALY perform a substructure search. Glycan substructure searching is an important feature for
Received: July 16, 2015 querying structure and experimental glycan databases and retrieving biologically meaning- ful data. This applies for example to identifying a region of the glycan recognised by a glycan Accepted: November 22, 2015 binding protein (GBP). In this study, 19,404 glycan structures were selected from Glyco- Published: December 14, 2015 meDB (www.glycome-db.org) and modelled for being stored into a RDF triple store and a Copyright: © 2015 Alocci et al. This is an open Property Graph. We then performed two different sets of searches and compared the query access article distributed under the terms of the response times and the results from both technologies to assess performance and accu- Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any racy. The two implementations produced the same results, but interestingly we noted a dif- medium, provided the original author and source are ference in the query response times. Qualitative measures such as portability were also credited. used to define further criteria for choosing the technology adapted to solving glycan sub- Data Availability Statement: All relevant data are structure search and other comparable issues. within the paper and its Supporting Information files.
Funding: This work is supported by the Seventh Framework Programme - Initial Training Network N° 316929 (http://ec.europa.eu/research/fp7/index_en. cfm). The grant recipient is FL. This activity at SIB is supported by the Swiss Federal Government through Introduction the State Secretariat for Education, Research and Nowadays the use of high throughput technologies and optimized pipelines allows life scien- Innovation SERI. DA is supported by EU-ITN (FP7- tists to generate terabytes of data in a reduced amount of time and subsequently feed quickly PEOPLE-2012-ITN # 316929). MPC acknowledges funding support from the Australian National and comprehensively online bioinformatics databases. In this scenario, the interoperability eResearch Collaboration Tools and Resources between data resources has become a fundamental challenge. Issues are gradually being project (NeCTAR). solved in applications involving genome (DNA) or transcriptome (RNA) analyses but
PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 1/17 Property Graph and RDF Triple Store Comparison
Competing Interests: The authors have declared problems remain for less documented molecules such as lipids, glycans (also referred to as that no competing interests exist. “carbohydrate”, “oligosaccharide” or “polysaccharide” to designate this type of molecule) or metabolites. Glycosylation is the addition of glycan molecules to proteins and/or lipids. It is an important post-translational modification that enhances the functional diversity of proteins and influ- ences their biological activities and circulatory half-life. A glycan is a branched tree-like mole- cule that naturally lends itself to graph encoding. However, glycans have long been described in the IUPAC linear format [1], that is, as regular expressions delineating branching structures with different bracket types. Such encoding can generate directional/linkage/topology ambigu- ity and is not sufficient in the handling of incomplete or repeated units. More recently, several encoding formats for glycans have developed based on sets of nodes and edges, e.g., GlycoCT [2], Glyde-II [3,4], IUPAC condensed [5], KCAM/KCF [6,7] or more recently WURCS [8]. To date the GlycoCT format is acknowledged as the default format for data sharing between databases [9] and consequently the most commonly used format for storing structural data. Glycans are composed of monosaccharides (8 common building blocks and dozens of less fre- quent ones as described in MonosaccharideDB (http://www.monosaccharidedb.org) that are cyclic molecules. These monosaccharides are linked together in different ways depending on carbon attachment positions in the cycle as detailed further. In a graph representation of a glycan, each monosaccharide residue is a node possibly asso- ciated with a list of properties and each linkage is an edge also potentially associated with a list of properties. In fact, chemical bonds between building blocks, called glycosidic linkages, are transformed into edges in the acyclic graph structure. An example is shown in Fig 1, where the simplified graphic representation popularised by the Consortium for Functional Glycomics (CFG) [10] (originally proposed by the authors of Essentials in Glycobiology [11]) is matched to a graph. This notation assigns each monosaccharide to a coloured shape (e.g., yellow circle for galactose, shortened as Gal). Shared colours or shapes express structural similarity among monosaccharides. For example, N-Acetylgalactosamine (yellow square) differs from galactose (yellow circle) through a so-called substituent (removal of an OH group and addition of an amino-acetyl group). “Substituent” as a property is precisely the type that qualifies a node. Starting with the CarbBank project in 1987 [12], a range of glycoinformatics resources con- taining glycan-related information has been developed thereby creating a variety of reference databases [13–15]. In the last two years, two articles [16,17] have been published proposing mechanisms for connecting glycan-related (or glycomics) databases. The respective authors suggested moving towards new technologies designed for semantic web, which are adapted to aggregating information from different sources. The outcome for these proposals was a com- mon standard ontology called GlycoRDF [16] that is now being widely adopted by the commu- nity to develop glycomics resources that cooperate and share standard formats enabling federated queries [9]. Nonetheless, in an effort to confirm the relevance of RDF as the technol- ogy of choice for glycomics data representation and integration, comparison with other graph- based technologies such as Property Graph is necessary. In this paper we discuss methods for interconnecting different databases through addressing the question of glycan substructure (motif or pattern) searching. This report should be considered as a preliminary study of feasi- bility and a comparison between different software implementations. Our main goal is to compare the performance of these two technologies (RDF and Property Graph) for glycan substructure search. In this study, Neo4j was chosen as a representative option for Property Graphs and multiple RDF triple stores were tested mainly due to the shared graph query lan- guage. Specifically, we have used for the latter: Virtuoso Open-Source Edition [18], Sesame [19], Jena Fuseki [20] and Blazegraph [21].
PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 2/17 Property Graph and RDF Triple Store Comparison
Fig 1. Glycan CFG encoding and graph encoding. On the left hand side a glycan structure encoded with CFG nomenclature is presented, while the right hand side shows the same structure translated into a graph. Each monosaccharide or substituent becomes a node and each glycosidic bond becomes an edge in the graph. Avoiding any loss of information all the properties of each monosaccharide or substituent are converted in node properties whereas glycosidic bond properties are translated in edge properties. To be more clear the colour code associate with the monosaccharide type is preserved among the images. doi:10.1371/journal.pone.0144578.g001
Searching (sub)structures is meaningful in glycomics. It matches the concept of glycan epi- tope or glycan determinant associated with the part of the whole structure that is recognized by glycan-binding proteins, which include lectins, receptors, toxins, adhesins, antibodies and enzymes [22]. Several substructure search solutions have been developed including for instance, regular expression matching [23] as applied to processing glycans in linear format (IUPAC) [5]. Although these approaches have showed reasonable robustness, they also have intrinsic limitations especially in handling structural ambiguities. Frequently, experimental data is insufficient to assign a precise monosaccharide (e.g., Galactose) to a position in the structure so that it is characterised only by its carbon content (e.g., Hexose). Regular expres- sions do not account for dependencies while a graph description handles inheritance of proper- ties (e.g., Hexose -> Galactose). Emerging technologies capable of storing graph models can be used, namely, Resource Description Framework (RDF) [24] and Property Graph databases [25]. Even though RDF and Property Graph databases can be used to reach the same goal, these two approaches are not synonymous and can be distinguished. RDF is designed for stor- ing statements in the form of subject–predicate–object called triples, whereas a Property Graph
PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 3/17 Property Graph and RDF Triple Store Comparison
is designed to implement different types of graphs such as hypergraphs, undirected graphs, weighted graphs, etc. Nonetheless, all possible types of graphs can be built with triples and stored in an RDF triple store [26]. Furthermore, Property Graphs are node-centric whereas RDF triple stores are edge-centric. For this reason, RDF triple stores use a list of edges, many of which are properties of a node and not critical to the graph structure itself. Moreover, Property Graph databases tend to be optimized for graph traversals where only one big graph is present. With RDF triple stores, the cost of traversing an edge tends to be logarithmic [27]. Finally, the query language is another key point in the comparison: RDF triple stores support SPARQL [28] as a native query language whereas Property Graphs have mainly proprietary languages. Even though Neo4J has a plugin for SPARQL, it relies essentially on its own proprietary lan- guage called Cypher [29]. In either case, a substructure search can be defined with the query language provided by these new technologies. To reach our benchmarking goal in this applica- tion, we first describe the main steps of building substructure search software using alterna- tively an RDF triple store or a Property Graph databases. We then provide a selection of quantitative and qualitative measures investigated during the study such as portability. Finally, we summarise the main achievements and draw some conclusions on the expected characteris- tics of a web application for glycan substructure search.
Material and Methods The development of a software solution to perform glycan substructure searching involves two main tasks 1) the storage of glycan structures and 2) translation of a query into a specific query language for pulling out all the glycan structures that match the query pattern. The glycan encoding and the data storage sections provide details regarding these tasks using native graph stores. The substructure search query shows how specific languages pro- vided by RDF and Property Graph databases can be used.
Data structure As previously described most dedicated glycan databases store structures using string encoding standards. This multiplicity led to create a new layer between the data store and the glycan encoding formats to decouple the glycan structure information from any string-encoding format. A comprehensive framework was developed within the EUROCarbDB project for parsing multiple glycan encoding formats [30]. Even though this framework is still used and includes parsers for each glycan encoding format, for pragmatic reasons we have relied on an in-house library, called MzJava [31]. It is an open source library that provides a data structure and spe- cific readers for multiple glycan encoding formats that are not limited to processing mass spec- trometry data. The MzJava data structure organizes the information from a glycan structure into a directed acyclic graph, which can be directly stored into a graph storage solution. Mono- saccharides, the basic units, which compose the glycan structures, are treated as graph nodes and all the biological properties of these building blocks are stored as separate node properties. Substituents—particular building blocks, which can be combined with basic units—are han- dled separately from monosaccharides and are represented as extra nodes in the data structure. All glycosidic linkages are transformed into edges in the acyclic graph structure. Other edges are introduced for every linkage between monosaccharides and substituents, so-called substituent linkages. Each edge carries a list of properties, which identify different chemical aspects of the bond itself. Each glycan is represented by a set of nodes selected between avail- able monosaccharides and substituents and a set of edges, which connect the nodes (Fig 1).
PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 4/17 Property Graph and RDF Triple Store Comparison
The direct acyclic graph used by the Mzjava data structure can be directly loaded into a native graph store.
Data Storage The data store interacts with the data structure layer that was introduced in the previous sec- tion. Native storage of the direct acyclic graphs used by the data structure layer was the first main requirement for the data store. Because both RDF and Property Graph technologies fulfil this requirement, we have developed two different data stores using each of these technologies. For populating the data stores, a collection of known glycan structures was extracted from GlycomeDB [13], the largest repository of known glycan structures publicly available. It con- tains more than 34,000 structures collected from different online resources. The GlycoCT ver- sion of GlycomeDB [13] was used in this study to compare the performance of RDF versus Property Graph and the MzJava reader used to translate all structures into the supported data structure, which were then stored into both RDF triple store and Property Graph data stores (see “Glycan encoding” section). All the structures with repeats, underdetermined regions or containing not fully characterised monosaccharides were omitted from this study in order to produce consistent results with each query and ease comparison between the two implementa- tions. In the end, the data store contained 19404 possibly redundant glycan structures (redun- dancy comes for instance for the same structure identified in two different species since at this stage, we do not account for taxonomy). Each structure was treated as a single graph, meaning that both implementations contain 19,404 disconnected graphs. In the end, the dataset con- tained 233,633 distinct nodes divided in 19,404 different graphs, with an average of 12 nodes for each graph. Here, the largest is composed of 63 nodes other details about the distribution of nodes among the structures are provided in S10 Table. Details about the development of the two data storage implementations are illustrated in the following sections.
RDF implementation RDF can only deal with triples, statements in the form subject-predicate-object. For this reason, it is necessary to develop a model that translates a glycan structure with all potential biological properties into a list of triples. Fig 2 shows how this model can be used for a simple structure with a monosaccharide and a substituent. The proposed model is based on the GlycoCT standard where the structure is encoded in a residue list and a connectivity list. All monosaccharides and substituents are treated as separate components and annotated in the residue list with a specific ID. The connectivity list contains the linkages between the components annotated in the residue list. Our model follows the same principle: all substituents are treated as separate components as opposed to merging them with their associated monosaccharides in order to avoid contaminating the model with biological assumptions. “Glycan” is the first entity in the model and it has two predicates associated: “closematch” taken from the SKOS ontology and “has_residue”. The “closematch” predicate connects the “Glycan” entity with URLs that identify the specific glycan in other databases. In this way, ref- erences to external databases can be kept in our model for refining the search. The user can always add alternative references using the same predicate. The “Glycan” entity links each monosaccharide and substituent to the “has_residue” predicate, grouping together all the building blocks that belong to the same glycan structure. Every “Residue” entity represents a particular monosaccharide or substituent, so the num- ber of components for each structure is equal to the sum of monosaccharides and substituents.
PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 5/17 Property Graph and RDF Triple Store Comparison
Fig 2. Ontology overview. Overview of the ontology developed for translating glycan structures into RDF/semantic triples. The figure shows all the predicates and the entities used for defining a glycan structures into the RDF triple store. doi:10.1371/journal.pone.0144578.g002
Chemical properties related to the components can be stored using RDF:property. In this model we provide “residue:monosaccharide” and “residue:substituent” as RDF:type for specify- ing the type of components and properties like “monosaccharide:Gal” or “substituent:NAcetyl” can be used for defining the actual molecule. All predicates available are shown in Fig 2 whereas the list of all monosaccharides and sub- stituents already defined in the ontology can be found in S11 Table. The user can extend the ontology adding new predicates or property to encode supplementary information regarding nodes. Component entities are connected to each other following the actual pattern of the glycan structure. In other words, a triple with a “links_To” predicate is added to the triple store for each linkage between two different components. In this ontology no entity represents linkages, we rely on multiple triples for storing the specifications of a linkage. Each time a piece of infor- mation about a specific edge is added, a new triple with the parent node as subject and the child node as object is inserted in the triple store. The information is encapsulated in the
PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 6/17 Property Graph and RDF Triple Store Comparison
predicate itself. The ontology provides the user with several predicates for describing each piece of information related to the linkage. Starting from the root node every linkage between a parent component and a child is added to the triple store following a breadth first search algorithm. Thus we prevent any possible edge duplication in the graph. In the end, each glycan structure stored in the RDF triple store is composed of at least a “Glycan” entity associated with many different “Residue” entities. By introducing a predicate for each linkage and component property, we allow the stratifica- tion of information. The potential loss of performance due to spreading information on differ- ent layers is justified by our wish to preserve approximate search options (e.g. search with missing information or tolerating mismatches). Indeed, merging all properties within a predi- cate lacks the flexibility that is crucial for glycan substructure search tolerating fuzzy matches.
Neo4j implementation The Java API provided by Neo4j was used for storing glycan structures into a Property Graph. Each structure has been added to the same graph and the glycan structure ID has been stored in all the nodes related to a particular structure. The final result is a disconnected graph where structures can be grouped by ID. Spreading the glycan ID, as opposed to keeping it in the root node, is useful to quickly retrieve the identifier when the substructure does not include the root node. In finer details, each monosaccharide or substituent has been added to the graph database as a node whereas each linkage is encoded as a relationship between two nodes. To encode biological properties of components or linkages we have used respectively node and relationship properties. Property Graphs can directly store graphs including nodes and edges properties.
Substructure Search Query Starting from a substructure query, each data store implementation is queried in order to retrieve all glycan structures that contain the query pattern. In fact, there is no difference between a complete structure and a substructure in terms of encoding format. Consequently, the workflow presented in the Glycan encoding section can be used for extrapolating and orga- nizing the substructure information into a common data structure. Then the direct acyclic graph is translated into a technology specific language for performing a graph pattern search among the structures contained in the data store. The languages provided by RDF and Property Graph databases describe a graph pattern and find all the graphs which contain it, and support our software solution for the retrieval of all the structures in the data store that contain a query substructure. The following illustrates the process of translating the data structure graph into specific RDF and Property Graph databases query languages.
RDF SPARQL Query Following the ontology described in Fig 2, glycan substructures can be translated into a SPARQL query. A native support to this query language is provided by each RDF triple store, thereby not tying our solution to any particular product, however, the queries support SPARQL 1.1 [32]. Fig 3A shows the translation process on a substructure where a galactose residue is connected to a glucose with an alpha 1,3 linkage. Every monosaccharide or substituent becomes an entity and each property is encoded in one of the predicates or the properties described in the model. Ses- ame API [19] together with the appropriate JDBC driver has been used for querying Virtuoso
PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 7/17 Property Graph and RDF Triple Store Comparison
Fig 3. Query building example. A. Example of use of the RDF model to build a SPARQL query from a glycan substructure focussing on the translation process. The prefix part of the query is omitted but further detailed examples are provided in the S1 File. B. The same example is shown with building a Cypher query, the native language in Neo4J. Similarly, additional examples are provided in the S1 File. doi:10.1371/journal.pone.0144578.g003
Openlink, Blazegraph, Sesame and Jena Fuseki. Detailed examples of larger structures are pro- vided in the S1 File.
Neo4j Cypher Query In order to interact with the database, Neo4j provides a native query language called Cypher. The translation process of a glycan substructure into a Cypher query is shown in Fig 3B. Two main parts are present in the query: the graph pattern and the property specification. The first part delineates the shape of the substructure while the second specifies the properties of each node or edge. In addition to Cypher, Neo4j provides a native object access Java API to interact with the database. Cypher is known to be slower than the native object access [33], however version 2.2 has improved performance and comes with a new cost-based query optimizer.
Setup All tests were performed using a Dell Precision T7400 with the following hardware features: • CPU: Intel(R) Xeon(R) X5482 • RAM: 56 GB DDR2 ECC
PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 8/17 Property Graph and RDF Triple Store Comparison
• Hard disk capacity: 500 GB • Operating system: Linux Cent OS version 6.6 64-bit • Java JDK version: 8 • RDF triple store: Virtuoso Open-Source Edition version 7.2, Sesame 4.0, Jena Fuseki v2, Bla- zegraph 1.5.3 • Property Graph: Neo4j Community version 2.2 For Virtuoso, NumberOfBuffers and MaxDirtyBuffers were changed to 170,000 and 130,000 respectively. Moreover, the MaxCheckpointRemap was adjusted according to the size of our database. A further increase of the buffers reported the same performance. Neo4j embedded version was preferred over the REST implementation. The embedded version has the advantage of removing any latency introduced by the REST communication between the service and the server. Moreover, it can be accessed directly from the java application with a specific API provided by Neo4j. For building the Neo4j instance, Node_auto_indexing and Relationship_auto_indexing have been activated. In addition, we have set the cache_type option to “strong” in order to keep the whole database in RAM. We have used Java JDK version 8 to run Neo4j, Blazegraph, Jena Fuseki and Sesame setting the heap size to 8 GB.
Results and Discussion In an effort to compare the usage of Property Graph vs. RDF triple store to address the sub- structure search problem, we built a dataset with 19,404 glycan structures extracted from the glycan structure repository GlycomeDB and compared the average query time of two data sets described in S8 and S9 Tables. The size of the queries have been divided into five groups: • Very short: less than 5 residues • Short: between 5 and 15 residues • Middle: between 15 and 25 residues • Large: between 25 and 35 residues • Very large: more than 35 residues The first set contains 128 queries (S8 Table) that represent biologically relevant use-cases and have been identified as glycoepitopes, that is, parts of glycans recognised by glycan-binding pro- teins. This list was obtained from the GlycoEpitope database [34] and further substantiated by information reviewed in [22]. Glycoepitopes are limited in size, i.e. between 2 and 13 residues but approximately 70% of these substructures contain between four and seven residues. This ref- erence set is important for the future implementation of a web application that will perform sub- structure search for glycobiologist users. It contains mainly short and very short queries. The second set (S9 Table) contains 60 queries. The structures were randomly picked in Gly- comeDB, with the prerequisite of spanning between 25 and 60 residues. This set was exclusively created for benchmark purposes and contains only large and very large queries. There is no bio- logically relevant relation in these substructures, but this data set pushes the software to its limits and tests the reliability of different implementations of RDF triple stores and Property Graphs.
PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 9/17 Property Graph and RDF Triple Store Comparison
Quantitative Measures At first an empty query was performed to initiate the test environment and then each query structure, present in S8 (first set) and S9 (second set) was run 10 times and the average query time was calculated for the last nine queries. All time is measured in seconds (s). The results (S1 Table) for the first set of queries show that the Virtuoso RDF triple store and Blazegraph have a better query response in 95% of the queries, the other 5% is cover by Sesame. Neo4j, as the only representative of Property Graph, only performed faster than Virtuoso for 12 queries (Ids 49 to 52 in S1 Table) and never exceeded the performance of Blazegraph. The average query time through the whole set of queries shows that Neo4j performance is compara- ble with Sesame but is still 0.8 s slower than Blazegraph. S2 Table shows the results for the second query set that were technically more instructive. Virtuoso Triple store column is empty because we could not run the benchmark with this RDF triple store. Using the machine described in the setup, we have tested both the development and stable versions of Virtuoso and none of them produced an output result. However, we observed two different behaviours: using the stable version, the machine froze after submitting the query until Virtuoso crashed. With the development version, we were not able to get any query results after hours of computation. We attempted to solve this problem while setting to 5 the swappiness of the operating system, to no avail. In the end, we contacted the Openlink sup- port but have not yet received an answer. RDF technology is relatively new and some imple- mentations still need debugging or meet scaling issues. When compared to the RDF triple stores, Neo4j slowed down as the size of the queries increased. In this case, the difference between Neo4j and Blazegraph, calculated on the average query time in the whole set, is close to 4 seconds. Surprisingly Jena Fuseki, the slowest database in the first benchmark, is almost 2 second faster than Neo4j. The average response time was cal- culated for each query in both datasets and the results are summarised in Fig 4. The average query response time of three glycan determinants shown in Fig 5 is detailed in Table 1.Complete information regarding the results obtained with the two sets is provided in the S4 and S5 Tables. Contrary to our initial intention, we have not fully tested Neo4j with SPARQL since the sup- porting third-party plugin runs approximately 2.4 to 27 times slower than Jena Fuseki [35], which is already the slowest RDF triple store in our benchmark. Dealing with a large discon- nected graph is the key point of the discussion. Property Graphs are in general optimised for storing large connected graphs, for example friend-of-a-friend networks. The data structure is designed to efficiently perform a graph traversal that computes the shortest path between two nodes or retrieves all the friends of a friend. These conditions are not satisfied in our substruc- ture search problem where the graph is a collection of small and disconnected graphs. For each query, the Property Graph database performs a full scan of the collection searching for a spe- cific pattern. In other words the software engine has to search the root of each structure and start the traversal. This matches the use case guiding Neo4j’s implementation and design but not the specificity of glycan substructure search. In contrast, RDF is an information representation with different technological implementa- tions. In most RDF triple stores, a single table of four columns holds one quad, i.e. triple plus graph identifier, per row. When a specific pattern is searched in the structure collection, the engine retrieves all the glycan entities and runs a traversal on each one. The process is sped up by the use of entity indices that allows the RDF engine to find the next structure more effi- ciently than a Property Graph database. In the end, graph pattern searching as implemented by RDF databases appears to match more closely the glycan substructure use case, which (in the absence of bugs) explains why this type of graph database performs well on our benchmark.
PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 10 / 17 Property Graph and RDF Triple Store Comparison
Fig 4. Average query time. The mean value calculated on the response times of each query in both sets is shown in two bar charts. Panel (A) shows the mean query times for the first set and panel (B) contains the values for the second set. The column assign to Virtuoso in the second set of query is empty because we could not record any data due to a problem in running large and very large queries. doi:10.1371/journal.pone.0144578.g004
All the results divided by specific methods can be found in supporting materials S3–S7 Tables.
Qualitative Measures We provide qualitative measures concerning the query language, the ease of usage and the level of support. Although difficult to evaluate, these criteria are significant for choosing which type of technology to use. In the last part we underline some cross-platform issues related to the Vir- tuoso triple store. Query language. As described above RDF triple stores support the World Wide Web Con- sortium (W3C) endorsed SPARQL standard as the common query language. Consequently, the proposed (sub)structure search can be used by different RDF implementations without changing the application. In comparison, Property Graphs are language specific relying on
PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 11 / 17 Property Graph and RDF Triple Store Comparison
Fig 5. 2D structure of query glycoepitopes. The 2-dimensional structure of three well-known glycoepitopes listed in Table 1, namely (A) Lactosamine Type One. (B) Blood Group A. (C) Sialyl Lewis X is shown. Response time for each is shown in Table 1. doi:10.1371/journal.pone.0144578.g005
proprietary languages such as Cypher (Neo4j), which potentially limits application portability and requires drastic change in the software when moving to alternative implementations. The W3C promotes the development of high-quality standards, making SPARQL a good choice for production software. In the case of Neo4j, there are add-on components that allow this Property Graph to be accessed as a triple store and potentially queried with SPARQL. As mentioned earlier, third-party contributors implemented these plugins and there is no guaran- tee of future development and support. In this study, we have not tested these plugins mainly because they have poor performance [35]. SPARQL is translated into a native object access API introducing latency in the query process and negatively impacting performance. Finally, providing a standardised query language gives RDF an important advantage, espe- cially when the stability of the software technology used is crucial. Level of support. Although RDF and Property Graph databases were recently introduced, they have lively communities behind them showing the need for natively storing graph data. RDF and SPARQL standards have extensive support both from the W3C and the triple store
Table 1. Comparison of average response query time for 3 glycoepitopes (see Fig 5). A comparison of the average query time of Property Graph and RDF triple store databases tested in this study (columns) for three well known epitopes (rows). SPARQL and Cypher queries for these glycoepitopes are pro- vided in the S1 File.
Virtuoso Neo4j Sesame Jena Fuseki Blazegraph Lactosamine Type One 0.114 s 0.827 s 0.814s 0.941 s 0.495 s Blood Group A 0.104 s 0.945 s 0.817 s 0.869 s 0.225 s Sialyl Lewis X 0.469 s 1.012 s 0.706 s 1.970 s 0.120 s doi:10.1371/journal.pone.0144578.t001
PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 12 / 17 Property Graph and RDF Triple Store Comparison
vendor websites. Google groups and user groups are available for supporting the development and discussing issues for both technologies. The Property Graph community compared to that of RDF is fragmented due to the lack of standards in query language and API. Most of the support for Neo4j and Cypher comes from the online manual available on the company website. With each new release, the manual is updated and explains how to get the maximum performance out of this database. Despite a detailed documentation and multiple tutorials provided by the Neo4j site, the query language lacks support from other industry partners. Three different query languages were proposed for Neo4j in the past 11 years and the current Cypher may not be the last. The RDF community instead is led by the W3C that periodically updates the standard fol- lowing community suggestions. All implementations of RDF triple store have to follow the W3C recommendations. The actual availability of multiple triple store implementations gives the RDF community more stability compared to Neo4J. Moreover, Open Data [36] and Linked Data movements are pushing the usage on RDF and SPARQL as a way of interconnecting online resources. In bioinformatics, some online resources such as UniProt [37] are already accessible through SPARQL. Regarding the four RDF triple stores tested in this study, each vendor website provides useful information about setting up the environment and tweak the settings. In the end, the possibility of connecting multiple online resources and the stability shown by the RDF community place RDF in a favourable position compared to Property Graph. Ease of usage. The embedded version of Neo4j databases can be easily added to a java project through a jar or a dependency. The API for building the graph is simple and intuitive. A great advantage is provided by the option of directly adding properties to nodes and edges. Theoretically speaking Neo4j and Property Graph databases fit our problem better than RDF, nevertheless they are not designed for multiple disconnected graphs in one instance. For this reason, the structure identifier had to be spread in all the nodes that belong to the same gly- can. The duplication of information in the database implies an increased use of memory, which can be a problem for large datasets. This problem can be addressed but any corresponding solution requires extra nodes or properties leading to the same issue. RDF shows more flexibility in terms of provided API. Different java libraries like Jena or Sesame are available for connecting a java application with every triple store. Moreover, both libraries provide an in memory triple store that can be used for testing purposes. In our study we have used both libraries and there are well documented and easy to use. Virtuoso, like Bla- zegraph and Jena Fuseki, runs as a standalone server with a useful web interface for managing configuration parameters. The main obstacle to design substructure search software with RDF has been the ontology definition. Building up ontology for converting glycan structures into triples has been the most time consuming part. It involves reflecting on how to spread the information through different triples in a way that each piece is easy to retrieve and every software requirement is fulfilled. In contrast, defining a specific ontology for substructure search allows storing multiple discon- nected graphs in the same database without losing performance. The shared API and SPARQL, the common query language, made it possible to run four dif- ferent RDF triple stores without changing our application code but only choosing the right connection driver. In conclusion, Property Graph databases provide a ready to use solution for substructure search whereas RDF needs a specific ontology for tackling the problem. The development of an ontology can be challenging and time consuming but provides a more flexible solution. Cross Platform Issues. We tested both substructure search implementations under Linux and Windows environments with the same dataset of glycan structures.
PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 13 / 17 Property Graph and RDF Triple Store Comparison
Using the Neo4j implementation hardly any difference in terms of query speed and size of results was observed. As Neo4j is implemented in Java, it behaves in the same way on all the systems that can run a Java Virtual Machine, the same situation for Sesame, Jena Fuseki and Blazegraph. However, the Virtuoso implementation produces different query results in differ- ent environments. The structures retrieved in the Windows environment were often a subset of the ones retrieved under Linux. In order to establish which answer was correct all results were checked manually notably all the structures retrieved in the Linux environment were correct and in agreement with the Neo4J results. This may mean that Virtuoso triple store possibly contains inconsistencies between the Windows and the Linux version. We have found a second obstacle during the test with the second query set. In this case, Linux and Windows implementations either crashed after query submission or they caused a crash or a freeze of the machine itself. We could not test the second query set despite several attempts to fix the problem. The novelty of RDF and Property Graph databases technologies can easily explain the presence of issues in the code. Only a large community of users over sev- eral years of development will help resolve this type of problems.
Conclusion In this paper, we delineated two strategies for substructure searching using new technologies like Property Graph and RDF triple store databases. Using two specific sets of substructures we document that in all the cases RDF has been faster than Property Graph and the gap is increas- ing with the size of the query. This led us to conclude that our model with Blazegraph RDF Tri- ple Store reduces the query response time that remains close to one second in all the queries of both tested sets. Qualitative measures were discussed to provide the reader with additional cri- teria for selecting the appropriate technology. Overall, even though both technologies show advantages and disadvantages related to specific qualitative measures, the general lack of stan- dards related to Property Graphs can play a key role in choosing between RDF triple stores and Property Graph databases. In the specific case of substructure search, Property Graphs were seen as a ready-to-use solution for prototyping software. However, when performance and interoperability between different resources is considered, an RDF triple store appears as a more efficient technology. This study lays the foundation to a possible web application for substructure search. The proposed model is destined to evolve and include a wider set of biological properties in future versions.
Supporting Information S1 File. Epitope queries. The file contains the SPARQL and Cypher queries for the epitopes of Table 1. (TXT) S1 Table. Comparison results on the first set of query. Summary of the results obtained using RDF and Property Graph implementations. (XLSX) S2 Table. Comparison results on the second set of query. Summary of the results obtained using RDF and Property Graph implementations. (XLSX) S3 Table. Blazegraph results. Query response times obtain using the Blazegraph implementa- tion. The query structure IDs refer to S8 and S7 Tables. (XLSX)
PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 14 / 17 Property Graph and RDF Triple Store Comparison
S4 Table. Jena Fuseki results. Query response times obtain using the Jena Fuseki implementa- tion. The query structure IDs refer to S8 and S7 Tables. (XLSX) S5 Table. Neo4j results. Query response times obtain using the Neo4j implementation. The query structure IDs refer to S8 and S7 Tables. (XLSX) S6 Table. Sesame results. Query response times obtain using the Sesame implementation. The query structure IDs refer to S8 and S7 Tables. (XLSX) S7 Table. Openlink Virtuso results. Query response times obtain using the Openlink Virtuoso implementation. The query structure IDs refer to S8 Table and S7 Table. (XLSX) S8 Table. First set of query structures. List of 128 query relevant biological epitopes used in this study. For completeness, we provide here the GlycoCT encoded structure, the 2d image (cfg format) and the size, in terms of residues, of each epitope. These structures are mainly very short and short according to the partition explained in the paper. (XLSX) S9 Table. Second set of query structures. List of 60 query epitopes randomly picked up from GlycomeDB. For completeness, we provide here the GlycoCT encoded structure and the size, in terms of residues, of each epitope. These structures are mainly large and very large according to the partition explained in the paper. For limiting the size of the file, 2d images are not pro- vided for this specific dataset. (XLSX) S10 Table. Dataset composition. Further information about the size of the structures con- tained in the dataset. (XLSX) S11 Table. Monosaccharide and substituent. A complete list of all the monosaccharides and substituents used during the study. (XLSX)
Acknowledgments We thank Kiyoko Aoki-Kinoshita, Nobuyuki Aoki and Catherine A. Hayes for fruitful discus- sions and valuable advice.
Author Contributions Wrote the paper: DA FL MPC JTB JM OH. Designed the RDF model with the advice of MPC and JTB: DA. Conceived the property graph implementation: DA OH. Performed the calcula- tions with the assistance of JM: DA. Analysed the data with the advice of FL: DA. Drafted the manuscript: DA FL. Corrected the manuscript: MPC JTB JM OH.
References 1. Sharon N. IUPAC-IUB Joint Commission on Biochemical Nomenclature (JCBN). Nomenclature of gly- coproteins, glycopeptides and peptidoglycans. Glycoconj J. 1986; 3: 123–133. doi: 10.1007/ BF01049370
PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 15 / 17 Property Graph and RDF Triple Store Comparison
2. Herget S, Ranzinger R, Maass K, Lieth C-W v. d. GlycoCT—a unifying sequence format for carbohy- drates. Carbohydr Res. 2008; 343: 2162–2171. doi: 10.1016/j.carres.2008.03.011 PMID: 18436199 3. Sahoo SS, Thomas C, Sheth A, Henson C, York WS. GLYDE-an expressive XML standard for the representation of glycan structure. Carbohydr Res. 2005; 340: 2802–2807. PMID: 16242678 4. York William S., Kochut Krzysztof J., Miller John A., Sahoo Satya, Thomas Christopher, Henson Cory. GLYDE-II—GLYcan structural Data Exchange using Connection Tables. Available: http://glycomics. ccrc.uga.edu/GLYDE-II/GLYDE-description_v0.7.pdf. 5. McNaught AD. International Union of Pure and Applied Chemistry and International Union of Biochem- istry and Molecular Biology. Joint Commission on Biochemical Nomenclature. Nomenclature of carbo- hydrates. Carbohydr Res. 1997; 297: 1–92. PMID: 9042704 6. Aoki KF, Yamaguchi A, Ueda N, Akutsu T, Mamitsuka H, Goto S, et al. KCaM (KEGG Carbohydrate Matcher): a software tool for analyzing the structures of carbohydrate sugar chains. Nucleic Acids Res. 2004; 32: W267–272. doi: 10.1093/nar/gkh473 PMID: 15215393 7. Kotera M, Tabei Y, Yamanishi Y, Moriya Y, Tokimatsu T, Kanehisa M, et al. KCF-S: KEGG Chemical Function and Substructure for improved interpretability and prediction in chemical bioinformatics. BMC Syst Biol. 2013; 7: S2. doi: 10.1186/1752-0509-7-S6-S2 8. Tanaka K, Aoki-Kinoshita KF, Kotera M, Sawaki H, Tsuchiya S, Fujita N, et al. WURCS: The Web3 Unique Representation of Carbohydrate Structures. J Chem Inf Model. 2014; 54: 1558–1566. doi: 10. 1021/ci400571e PMID: 24897372 9. Campbell MP, Ranzinger R, Lütteke T, Mariethoz J, Hayes CA, Zhang J, et al. Toolboxes for a stan- dardised and systematic study of glycans. BMC Bioinformatics. 2014; 15: S9. doi: 10.1186/1471-2105- 15-S1-S9 10. Varki A, Cummings RD, Esko JD, Freeze HH, Stanley P, Marth JD, et al. Symbol nomenclature for gly- can representation. PROTEOMICS. 2009; 9: 5398–5399. doi: 10.1002/pmic.200900708 PMID: 19902428 11. Varki A, editor. Essentials of glycobiology. New York, NY: Cold Spring Harbor Lab. Press; 1999. 12. Doubet S, Albersheim P. CarbBank. Glycobiology. 1992; 2: 505. PMID: 1472756 13. Ranzinger R, Herget S, von der Lieth C-W, Frank M. GlycomeDB—a unified database for carbohydrate structures. Nucleic Acids Res. 2011; 39: D373–D376. doi: 10.1093/nar/gkq1014 PMID: 21045056 14. Campbell MP, Peterson R, Mariethoz J, Gasteiger E, Akune Y, Aoki-Kinoshita KF, et al. UniCarbKB: building a knowledge platform for glycoproteomics. Nucleic Acids Res. 2014; 42: D215–221. doi: 10. 1093/nar/gkt1128 PMID: 24234447 15. Toukach PV, Egorova KS. Carbohydrate structure database merged from bacterial, archaeal, plant and fungal parts. Nucleic Acids Res. 2015; gkv840. doi: 10.1093/nar/gkv840 16. Ranzinger R, Aoki-Kinoshita KF, Campbell MP, Kawano S, Lütteke T, Okuda S, et al. GlycoRDF: an ontology to standardize glycomics data in RDF. Bioinforma Oxf Engl. 2015; 31: 919–925. doi: 10.1093/ bioinformatics/btu732 17. Aoki-Kinoshita KF, Bolleman J, Campbell MP, Kawano S, Kim J-D, Lütteke T, et al. Introducing glyco- mics data into the Semantic Web. J Biomed Semant. 2013; 4: 39. doi: 10.1186/2041-1480-4-39 18. Erling O, Mikhailov I. RDF Support in the Virtuoso DBMS. In: Pellegrini T, Auer S, Tochtermann K, Schaffert S, editors. Networked Knowledge—Networked Media. Berlin, Heidelberg: Springer Berlin Heidelberg; 2009. pp. 7–24. Available: http://link.springer.com/10.1007/978-3-642-02184-8_2. 19. Broekstra J, Kampman A, van Harmelen F. Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. In: Horrocks I, Hendler J, editors. The Semantic Web—ISWC 2002. Berlin, Heidelberg: Springer Berlin Heidelberg; 2002. pp. 54–68. Available: http://link.springer.com/10.1007/3- 540-48005-6_7. 20. Apache Jena [Internet]. Available: https://jena.apache.org/. Accessed 29 Oct 2015. 21. Blazegraph | Blazegraph is ultra-scalable, high-performance graph database with support for the Blue- prints and RDF/SPARQL APIs. It supports high availability, scale-out, and GPU acceleration. [Internet]. Available: https://www.blazegraph.com/. Accessed 29 Oct 2015. 22. Cummings RD. The repertoire of glycan determinants in the human glycome. Mol Biosyst. 2009; 5: 1087. doi: 10.1039/b907931a PMID: 19756298 23. Aoki-Kinoshita KF. [Introduction to glycome informatics]. Seikagaku. 2008; 80: 1038–1041. PMID: 19086411 24. Klyne G, Carroll J. Resource Description Framework (RDF): Concepts and Abstract Syntax [Internet]. Available: http://www.w3.org/TR/rdf-concepts/. 25. What is a Graph Database? In: Neo4j Graph Database [Internet]. Available: http://neo4j.com/ developer/graph-database/. Accessed 2 Jul 2015.
PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 16 / 17 Property Graph and RDF Triple Store Comparison
26. Triplestore [Internet]. Wikipedia, the free encyclopedia. 2015. Available: https://en.wikipedia.org/w/ index.php?title=Triplestore&oldid=666522393. 27. Angles R, Prat-Pérez A, Dominguez-Sal D, Larriba-Pey J-L. Benchmarking Database Systems for Social Network Applications. First International Workshop on Graph Data Management Experiences and Systems. New York, NY, USA: ACM; 2013. pp. 15:1–15:7. doi: 10.1145/2484425.2484440 28. SPARQL Query Language for RDF [Internet]. Available: http://www.w3.org/TR/rdf-sparql-query/. Accessed 2 Jul 2015. 29. What is Cypher? - - The Neo4j Manual v2.2.3 [Internet]. Available: http://neo4j.com/docs/stable/cypher- introduction.html. Accessed 2 Jul 2015. 30. Von der Lieth C-W, Freire AA, Blank D, Campbell MP, Ceroni A, Damerell DR, et al. EUROCarbDB: An open-access platform for glycoinformatics. Glycobiology. 2011; 21: 493–502. doi: 10.1093/glycob/ cwq188 PMID: 21106561 31. Horlacher O, Nikitin F, Alocci D, Mariethoz J, Müller M, Lisacek F. MzJava: An open source library for mass spectrometry data processing. J Proteomics. doi: 10.1016/j.jprot.2015.06.013 32. SPARQL 1.1 Query Language [Internet]. Available: http://www.w3.org/TR/sparql11-query/. Accessed 3 Jul 2015. 33. Holzschuher F, Peinl R. Performance of graph query languages: comparison of cypher, gremlin and native access in Neo4j. ACM Press; 2013. p. 195. doi: 10.1145/2457317.2457351 34. Kawasaki T, Nakao H, Tominaga T. GlycoEpitope: A Database of Carbohydrate Epitopes and Antibod- ies. In: Taniguchi N, Suzuki A, Ito Y, Narimatsu H, Kawasaki T, Hase S, editors. Experimental Gly- coscience. Tokyo: Springer Japan; 2008. pp. 429–431. Available: http://www.springerlink.com/index/ 10.1007/978-4-431-77922-3_104. 35. SPARQL and OWL 2 Inference for Neo4j : Research Group for Communication Systems [Internet]. Available: https://comsys.informatik.uni-kiel.de/res/sparql-and-owl-2-inference-for-neo4j/. Accessed 2 Nov 2015. 36. Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z. DBpedia: A Nucleus for a Web of Open Data. In: Aberer K, Choi K-S, Noy N, Allemang D, Lee K-I, Nixon L, et al., editors. The Semantic Web. Springer Berlin Heidelberg; 2007. pp. 722–735. Available: http://link.springer.com/chapter/10.1007/ 978-3-540-76298-0_52. 37. The UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 2015; 43: D204– D212. doi: 10.1093/nar/gku989 PMID: 25348405
PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 17 / 17 Published online 17 November 2015 Nucleic Acids Research, 2016, Vol. 44, Database issue D1243–D1250 doi: 10.1093/nar/gkv1247 SugarBindDB, a resource of glycan-mediated host–pathogen interactions Julien Mariethoz1, Khaled Khatib2, Davide Alocci1,3, Matthew P. Campbell4, Niclas G. Karlsson5, Nicolle H. Packer4, Elaine H. Mullen6 and Frederique Lisacek1,3,*
1Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland, 2Glycoinformatics Inc., Great Falls, VA, USA, 3Department of Computer Science, University of Geneva, Geneva, Switzerland, 4Biomolecular Frontiers Research Centre, Macquarie University, North Ryde, NSW, Australia, 5University of Gothenburg, Sahlgrenska Academy, Institute of Biomedicine, Department of Medical Biochemistry and Cell Biology, Gothenburg, Sweden and 6The MITRE Corporation, McLean, VA, USA
Received August 27, 2015; Revised October 22, 2015; Accepted October 31, 2015
ABSTRACT portunities influencing cell–cell interactions. Information on these binding events is therefore crucial to feed our un- The SugarBind Database (SugarBindDB) covers derstanding of intercellular communication. However, even knowledge of glycan binding of human pathogen in cases such as host–pathogen interactions, which have lectins and adhesins. It is a curated database; each been extensively studied over decades, information that is glycan–protein binding pair is associated with at recorded across diverse resources (e.g. 2,3), does not cover least one published reference. The core data ele- details of the recognition of host glycans by the pathogen ment of SugarBindDB is a set of three inseparable proteins, known as lectins (or adhesins if the binding part- components: the pathogenic agent, a lectin/adhesin ner is unknown). Nonetheless, these facts have been pub- and a glycan ligand. Each entity (agent, lectin or lished throughout the medical and biochemical literature ligand) is described by a range of properties that since the mid-1970s (4). To bridge this gap, we introduce the are summarized in an entity-dedicated page. Several SugarBind Database (SugarBindDB), a curated database developed to cover knowledge of glycan binding by human search, navigation and visualisation tools are imple- pathogen lectins. mented to investigate the functional role of glycans SugarBindDB was created in 2002 within the MITRE in pathogen binding. The database is cross-linked to Corporation and publicly released in 2005. It was originally protein and glycan-relaled resources such as UniPro- designed as a complement to a pathogen-capture technol- tKB and UniCarbKB. It is tightly bound to the latter ogy based on the binding of viral, bacterial and biotoxin via a substructure search tool that maps each ligand lectins to specific glycans displayed on glycoprotein films. to full structures where it occurs. Thus, a glycan– This approach relied mainly on GlycoSuiteDB (5)and lectin binding pair of SugarBindDB can lead to the UniProtKB/Swiss-Prot (6) to identify glycoproteins bear- identification of a glycan-mediated protein–protein ing a sugar sequence similar to an implicated glycan ligand interaction, that is, a lectin–glycoprotein interaction, of a pathogen lectin. In 2010, the SugarBindDB was trans- via substructure search and the knowledge of site- ferred to the SIB Swiss Institute of Bioinformatics where it was integrated in the ExPASy server (7). With this trans- specific glycosylation stored in UniCarbKB. Sug- fer, SugarBindDB was added to a collection of glycoscience arBindDB is accessible at: http://sugarbind.expasy. databases addressing the increasing bioinformatics needs org. created by recent technological advances in glycomics anal- ysis (8). The database has since been co-developed within an INTRODUCTION international consortium striving to connect and integrate glycomics data with other –omics knowledgebases. To this Glycosylation is the addition of glycan/carbohydrate/oligos end, these databases (9,10) share the same framework with accharide/sugar molecules to proteins and/or lipids. This user-friendly interfaces and are extensively cross-referenced modification enhances the functional diversity of proteins to relevant existing bioinformatics resources. The trans- and influences their biological activities while it gives gly- ferred content described in (11) was gradually augmented colipids an essential role to play in cellular recognition (1). and a new implementation matching that of sister database Glycans not only modify proteins or lipids overlaying the UniCarbKB (10) was launched in 2013. The new version in- surface of cells, but they also offer numerous binding op-
*To whom correspondence should be addressed. Tel: +41 22 379 50 50; Fax: +41 22 379 58 58; Email: [email protected]
C The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Downloaded from https://academic.oup.com/nar/article-abstract/44/D1/D1243/2503058 by guest on 29 July 2018 D1244 Nucleic Acids Research, 2016, Vol. 44, Database issue
troduced user-friendly data browsing and searching as first SugarBindDB uses PostgreSQL (Version 9.2) as the un- described in (12). New features and content are regularly derlying database system that consists of multiple schemas added and released quarterly on average. to ensure data integrity by managing structural, literature A glycan is a branched tree-like molecule synthesized and experimental data collections. At the time of writing from eight common monosaccharide building blocks in eu- the migration to PostgreSQL (Version 9.4) is being carried karyotes and dozens in prokaryotes (1). Details can be out. found in MonosaccharideDB (13). The Essentials of Glyco- Visualisation tools have been developed with the D3.js biology textbook (1) recommends a cartoon symbol for each library (Version 3.4.12). building block and their assembly into a two-dimensional structural notation. This depiction is now widely accepted Standard notation and encoding among glycobiolgists and therefore has been used in refer- ence databases such as GlycomeDB (14) and UniCarbKB The default display of glycan structures is the most (10). It was naturally also adopted for SugarBindDB. Only commonly used notation described in the textbook Es- part of the full glycan structure is often recognized by lectins sentials in Glycobiology (1) and promoted by the Con- and ‘glycan determinant’or‘glycoepitope’ usually designate sortium for Functional Glycomics (CFG: http://www. the particular sub-structure involved directly in the bind- functionalglycomics.org). Two other options of popular no- ing. Over one hundred of these ligands have been charac- tation are the two-dimensional ‘IUPAC condensed’ and terized and are actually catalogued in the GlycoEpitope ‘Oxford’ standards (17). Clicking on ‘Cartoon format’ (see database (15) and in Glyco3D (16). They are sometimes Figure 1) on the upper right hand side changes the dis- given common names, as for instance, the glycan ligands of play. The Essentials/CFG graphic representation assigns a the well-known blood group-related ABO and Lewis anti- coloured shape to each monosaccharide (e.g. yellow circle gen systems. The core data element of SugarBindDB is a for galactose, shortened as Gal) and links components in set of three inseparable components: the pathogenic agent, a graph. Shared colours or shapes generally denote struc- a lectin adhesin and a glycan ligand. Each of these enti- tural similarity among monosaccharides. For example, N- ties is named with as much precision as possible: taxonomic Acetylgalactosamine (yellow square) differs from galactose designation for pathogen agents, protein name for lectins (yellow circle) through a so-called substituent (removal of and epitope name for ligands. Synonyms are listed when- an OH group and addition of an amino-acetyl group). ever reported. When names are missing (which is frequent Glycans have long been encoded in the IUPAC linear for- for lectins and pathogen strain names, especially in older lit- mat (18), that is, as regular expressions delineating branch- erature), then these entities are labelled N/S meaning ‘non ing structures with different bracket types. Such encoding specified’. The database includes additional information to can generate directional/linkage/topology ambiguity and supplement the core data, such as related diseases and af- is not sufficient in the handling of incomplete or repeated fected tissues or organs. SugarBindDB content is displayed units. More recently, several encoding formats for glycans in views. For example, an agent view will show the taxo- have been developed based on sets of nodes and edges, e.g. nomic name linked to the NCBI Taxonomy database, agent GlycoCT (19)andWURCS(20), that are suited to solving properties (e.g. morphology, motility, etc.), the structures of topological ambiguities and providing unequivocal descrip- stored glycan ligands associated with this agent, and the ref- tions. To date the GlycoCT format is acknowledged as the erence(s) providing evidence for the agent–ligand relation- default format for data sharing between glycan databases ship linked to NCBI/PubMed. Each glycan–protein bind- (21) and consequently the most commonly used format for ing pair is associated with at least one published reference. storing structural data. A view also lists a range of links connecting to further in- formation, either internal or external to the database. Data curation The following describes and illustrates the database con- tent as well as the usage of search, navigation and visual- The content of SugarBindDB is derived from screen- isation tools that are implemented to consult or reveal in- ing the literature. This task as well as information ex- formation. The present version of the database accessible traction is performed by experts with selective criteria at: http://sugarbind.expasy.org, contains 1256 (agent, lectin, (e.g. type of method, structure resolution, etc.). Particu- ligand) combinations backed by 174 publications. lar care was given to identifying medical information de- scribing tissues, symptoms and diseases. Ideally controlled vocabularies and ontologies need to be adopted for eas- DATABASE OVERVIEW ier data curation. Protein, tissue and disease description in SugarBindDB rely on UniProtKB guidelines (http://www. Design and implementation uniprot.org/help/controlled vocabulary). Apart from these, SugarBindDB is built with the open-source framework Play the creators of SugarBindDB have opted to keep the au- (Release 2.2.6) written in Java and Scala, which follows thors’ descriptor nomenclature in each published article. the model-view-controller architecture. The views (user- This position is proving less sustainable in view of our ef- interface) are predominantly written in Scala and include forts to now cross-link SugarBindDB with glycan array JQuery and Bootstrap Javascript libraries. The model and data and nomenclature will remain an issue until the con- controller layers are written in Java and the Ebean object- trolled vocabularies and ontologies become more widely relational mapping (ORM) library is used to query the un- adopted. For this manuscript, a comprehensive literature derlying database model. and database screen was undertaken to provide a unified
Downloaded from https://academic.oup.com/nar/article-abstract/44/D1/D1243/2503058 by guest on 29 July 2018 Nucleic Acids Research, 2016, Vol. 44, Database issue D1245
Figure 1. (A) Result page prompted by querying SugarBindDB with the pathogen/agent name ‘Escherichia coli R45’. The page is organized in blocks as a function of the query. In this case, since the query is an agent, the header is the name of the agent. Then, all bindings reported in the database for this agent are shown. The glycan molecule is presented linearly for the sake of simplicity but mousing over the formula prompts a two-dimensional cartoon presenting the Essentials of Glycobiology 2nd Edition/CFG nomenclature symbols. The association boxes are displayed on the right side with a colour code corresponding to the entity type, orange for lectins, red for agents, green for affected area/tissue, pink for disease. Clicking on any of these boxes prompts the list of entities sharing the same binding. The screen capture shows the effect of clicking on disease. (B) Result page prompted by querying SugarBindDB with the ligand sequence: Fuc(a1–2)Gal(b1–3)[Fuc(a1–4)]GlcNAc(b1–3)Gal. In this case, the header is the name of the ligand and two answers are returned. It is shown here that two unrelated pathogens a bacterium (H. pylori) and a virus (Norovirus Norwalk) recognize a similar glycan ligand.
Downloaded from https://academic.oup.com/nar/article-abstract/44/D1/D1243/2503058 by guest on 29 July 2018 D1246 Nucleic Acids Research, 2016, Vol. 44, Database issue
set of glycoepitopes with known synonyms. This dictionary search. Note that the colour code is identical throughout: will soon be available and its use will facilitate compliance red for agents, blue for ligands, orange for lectins, green for with developing standard nomenclature. tissue/affected area and pink for disease. Each entity (agent, lectin or ligand) is described by a range of properties that are summarized in the entity- BROWSING AND SEARCHING dedicated page. These properties are displayed on the right There are two options for accessing information regarding side of the page and linked internally or to external re- pathogens and their associated lectins: browsing or query- sources. When an internal link is activated, it leads to a new ing the database. These two approaches are available from page where all corresponding ligands will be shown. For in- the menu bar at the top of the homepage. Browsing is possi- stance, clicking on any of the agent properties (e.g. ‘flag- ble from the upper bar menu shaded in grey. It is of course, ellated’) will prompt the set of all stored ligands that are less targeted than querying, though in the end, the same in- known to be bound by flagellated bacteria. This is a typical formation can be reached. internal link. Lower to the right in an entity page, other in- The query page is prompted from the homepage by click- ternal links appear in coloured blocks that indicate a type ing on ‘Start here with SugarBind’. Query terms belong of association (colour coded as in Figure 1) and the corre- to six predefined categories: agent, ligand, lectin, affected sponding number of links shown in brackets. area, references or diseases. Auto-completion is activated External links are summarized and sorted by categories upon user input. Queries may combine several terms either in Table 1. A bacterial agent is cross-linked to its corre- within the same category, for example a list of pathogen sponding summary page in HAMAP (High quality Auto- names (‘agent’ option selected), or across categories when mated and Manual Annotation of microbial Proteomes), a the ‘multi-criteria’ option is selected. bacteria-oriented subproject of the UniProtKB protein an- Each query result page displays a set of lectin–ligand notation scheme (22). The HAMAP summary page speci- pairs. It is structured in blocks, each corresponding to ei- fies basic properties (Gram positive/negative, (an)aerobic, ther a known or an unspecified (N/S) strain of the corre- motility, etc.) and whether the organism was or not fully sponding pathogen. These blocks are grouped under en- sequenced. HAMAP’s coverage is completed by links to tity category headers matching the categories of entities Genomes OnLine Database (GOLD) that provides equiv- of which names were input in the query window. Figure alent information (23). The same principle is applied to 1A shows an example of the result of querying with ‘Es- viruses with cross-links to ViralZone (24). Note that Viral- cherichia coli R45’ to illustrate this point. The header in Zone applies reciprocal cross-referencing to SugarBindDB. this case is the agent. The screen capture in Figure 1A Publications reporting glycan binding very seldom pro- shows the various actions a user can undertake to visu- vide a protein name, let alone a sequence accession num- alize more information such as mousing over ligand se- ber for a lectin. Lectins are also very poorly annotated quences to see a 2D representation, or clicking on fur- in bacterial genomes and in UniProtKB, while dedicated ther information as detailed in the next section. For ex- databases contain sparse information due to limited exper- ample, clicking on the ‘Disease’ box pulls down a list of imental data (25). That is why we are currently devoting ef- known diseases associated with the listed bindings. Fig- forts to include more protein sequence-based information ure 1B shows another example of query results, but this that can be derived from mining the literature as well as time while inputting a ligand, namely: Fuc(a1–2)Gal(b1– databases. Even though this will only provide rough anno- 3)[Fuc(a1–4)]GlcNAc(b1–3)Gal. The ligand search takes tation, it may provide some leads for further investigation. names or strings of monomers. In this particular exam- It will also feed our procedure for inferring lectin names by ple, 2 ligands in the database match or contain the entered similarity. string, the first one in Helicobacter pylori and the second in Finally, we are gradually including glycan structure array Norovirus Norwalk. As hinted from clicking on the disease interaction cross-links to enhance the knowledge of binding boxes, the former is known to cause gastritis and stomach to specific glycoepitopes. Published glycan array work such cancer while the latter is an identified cause of gastroenteri- as (26) is also stored in the databases of the CFG. tis. Ligand similarity brings out tissue similarity. Interactive graphs Navigation via internal and external cross-links Lectin–ligand pairs associated with distinct pathogens can As described in (10), connectivity within the database sup- be visualized all together by clicking on the ‘view result as ports easy navigation and potential discovery through un- graph’ box located above a result list (see Figure 1). A new suspected similarities. Coloured association boxes are ubiq- window/tab is automatically opened, showing, first, a so- uitous in the display of search results and are instanti- called Sankey graph, which maps all information on a hier- ated in each page where they occur. Upon a click, these archical graph defined in the following order: agent–lectin– boxes show the list of entities sharing the same properties ligand. The same data are plotted in Directed force graph as stored in the database. As in the case illustrated in Fig- below the Sankey graph. Each agent name (pathogen) is an ure 1B, these associations may point to further similarities initial node linked to its associated strains. Each strain is a that are not necessarily known. For example, it is not es- node linked to its associated lectin(s) and each lectin is an- tablished that Fuc(a1–2)Gal(b1–3)[Fuc(a1–4)]GlcNAc(b1– other node linked to the glycan structure that it binds. Each 3)Gal is specifically recognized by gastro-enteric pathogens ligand is a final node. The links are visualized as grey con- but this binding is suggested by the results of the ligand nections. Clicking on any node highlights the path that goes
Downloaded from https://academic.oup.com/nar/article-abstract/44/D1/D1243/2503058 by guest on 29 July 2018 Nucleic Acids Research, 2016, Vol. 44, Database issue D1247
Table 1. List of current, soon available and planned cross-references in SugarBindDB, categorized by the type of information provided by the added link Database name URL Annotation type Ref Current PubMed http://www.ncbi.nlm.nih.gov/pubmed Supporting evidence of binding - Agents Taxonomy http://www.ncbi.nlm.nih.gov/taxonomy Pathogen taxonomy - HAMAP http://hamap.expasy.org Summary description of bacteria with link (22) to possible genome sequence GOLD https://gold.jgi-psf.org Summary description of bacteria with link (23) to possible genome sequence ViralZone http://viralzone.expasy.org/ Summary description of viruses with link to (24) corresponding viral proteins Lectins UniProtKB http://www.uniprot.org Lectin functional annotation (6) Glyco3D/lectin http://glyco3d.cermav.cnrs.fr/search.php? Lectin three-dimensional structure (16) type = lectin CFG arrays http://www.functionalglycomics.org/ Known binding patterns of lectins (not (36) glycomics/publicdata/primaryscreen.jsp curated data) Ligands UniCarbKB http://unicarbkb.org/ Full structures containing ligands (10) Next release Lectins Pfam http://pfam.xfam.org/ Lectin domain classification (37) UniRef http://www.uniprot.org/uniref Lectin amino acid sequence classification (6) GlycanBuilder https://code.google.com/p/glycanbuilder/ Interface for building and searching glycan (30) structure Prospective Agents BCSDB http://csdb.glycoscience.ru/bacterial Known bacterial glycans possibly (32) recognized by human lectins PACDB http://jcggdb.jp/search/PACDB.cgi Pathogen–ligand binding data Unpub Lectins Glycosciences lab arrays https://glycosciences.med.ic.ac.uk/data.html Known binding patterns (curated data) (35) Ligands Glyco3D http://glyco3d.cermav.cnrs.fr 3D models of ligands (16) Glycam http://glycam.org 3D models of ligands (38) GlycoEpitope http://www.glycoepitope.jp Glycan epitope characterisation (15) GlycomeAtlas https://rings.t.soka.ac.jp/GlycomeAtlasV3/ Tissue expression (33) GUI.html
through it via a colour change to orange. Note that clicking ligands to full structures containing these substructures is on any name in the graphs leads directly to the correspond- the most natural link between SugarBindDB and UniCar- ing entity page. bKB. We thus implemented a substructure search, the result Figure 2 shows the Sankey graph for two strains of Heli- of which is directly accessible from ligand pages. cobacter pylori and the paths (highlighted in orange) linking In a graph representation of a glycan, each monosaccha- the two well-documented lectins BabA and SabA to their ride residue is a node possibly associated with a list of prop- glycan ligands. The graph convincingly shows the speci- erties and each linkage is an edge also potentially associated ficity of the two lectins in recognising two distinct types with a list of properties. An RDF ontology that translates of glycan molecules. While an obvious preference for O- a glycan structure with all potential biological properties glycoepitopes characterizes BabA binding, SabA selectively into a list of triples is described in (28). The proposed on- binds extended oligosaccharides attached to lipids. These tology is based on GlycoCT. All monosaccharides and sub- observations need careful scrutiny, since the glycoconjugate stituents are treated as separate components and annotated type (protein or lipid) may reflect both limits in the state- in the residue list with a specific ID. The connectivity list of-the-art binding assay techniques as well as in our knowl- contains the linkages between the components annotated edge of glycan expression. With the SugarBindDB interac- in the residue list. Our ontology follows the same principle: tive graphic view, the user can investigate lectin or ligand all substituents are treated as separate components as op- specificity, or alternatively, compare the behaviour of dif- posed to merging them with their associated monosaccha- ferent pathogens that infect the same tissues and generate rides in order to avoid contaminating the model with bio- similar clinical symptoms. logical assumptions. Following this ontology, glycan sub- structures are translated into a SPARQL query. A native support to this query language is provided by each RDF Substructure search triple store, thereby not tying our solution to any particular The recent introduction of semantic web technologies in product; however, the queries support SPARQL 1.1. Sesame glycobioinformatics (27) has led us to rely on the standard- API (29) together with the Java driver provided by Openlink ized Resource Description Framework (RDF) format for that has been used for querying Virtuoso RDF triple store. efficient structure matching. In fact, matching glycoepitope
Downloaded from https://academic.oup.com/nar/article-abstract/44/D1/D1243/2503058 by guest on 29 July 2018 D1248 Nucleic Acids Research, 2016, Vol. 44, Database issue
Figure 2. This Sankey graph highlights the specificity of the two lectins of Helicobacter pylori strain J99 in binding a variety of glycan determinants as stored in the database. Clicking on these lectins SabA and BabA in the graph changes the path colour from grey to orange. Then the specificity of each lectin is brought out by the absence of common path leading from the lectin to the glycan ligand. Moreover, the glycans that are recognized by each lectin make two distinct structural groups.
Structure matching is pre-calculated and included in the This example emphasizes a simple and hypothesis-driven database. Each match is linked to the corresponding Uni- procedure that can be followed to identify protein–protein CarbKB glycan structure entries. interactions mediated by glycans in four steps:
1. Start with a SugarBindDB glycan–protein binding pair and recall the UniProtKB accession number of the cor- responding lectin; Protein–protein interactions mediated by glycans 2. Map the corresponding ligand to full glycan structures SugarBindDB can be used to identify protein partners in- of UniCarbKB (via the pre-calculated sub-structure teracting via glycans. This is illustrated in Figure 3. In this search); example, VP1, a viral lectin of the Norovirus Norwalk, 3. Use cross-links of selected glycan structures to UniPro- is recorded as binding the so-called B antigen (tri-) gly- tKB to retrieve information on carrier glycoproteins; can determinant/glycoepitope. The VP1 protein is linked 4. Correlate UniProtKB accession numbers of steps 1 and 3 to UniProtKB information and details of its glycan deter- to hypothesize interactions between pathogen lectin and minant can be viewed in the corresponding ligand page of host glycoprotein. SugarBindDB. In this page, the precalculated substructure search reports 34 matches of reported glycan structures that The initial glycan–protein binding pair of SugarBindDB contain this determinant in UniCarbKB. Examining the is the essential piece of information to start with in order to corresponding full glycan structure entries shows that the identify potentially new protein–protein interactions. vast majority of these structures are borne by mucins. Fig- ure 3 shows one such entry and the link to the associated FUTURE PROSPECTS glycoprotein (MUC-4). With this association, it can then be hypothesized that the VP1 viral lectin interacts with a hu- One of the next features of SugarBindDB will be the man mucin via its glycosylation. Needless to say, it is one integration of the Glycan Builder tool (30) that pro- potential interaction among many other possible ones. The vides an interface for creating a glycan structure with the careful examination of the alternative 33 structures, even CFG/Essentials of Glycobiology symbol notation. Once though similar through a shared relation to mucins, is im- drawn by the user, the 2D representation will be di- perative prior to establishing the interaction as a fact. rectly used for querying. In addition, we plan to increase
Downloaded from https://academic.oup.com/nar/article-abstract/44/D1/D1243/2503058 by guest on 29 July 2018 Nucleic Acids Research, 2016, Vol. 44, Database issue D1249
Figure 3. This figure shows how protein interacting partners can be found by following SugarBindDB cross-links. VP1, a viral lectin of theNorovirus Norwalk strain binds the B antigen (tri-) glycan determinant. The VP1 protein is linked to UniProtKB Q83884. The substructure search associates the B antigen (tri-) glycan determinant with 34 full glycan structure entries of UniCarbKB. One of these 34 matches is shown as an example. This structure is linked to UniProtKB Q99102 describing the host glycoprotein (MUC-4) on which the glycan is attached. The dashed line suggests a possible interaction between VP1 and MUC-4.
the connectivity of the database as highlighted in Ta- targets for cross-linking. Last but not least, we plan to en- ble 1 (prospective). We will interconnect further with the hance the annotation of lectins by substantially increasing Glyco3D databases and related tools to better visualize 3D the number of references to glycan array experiments from interactions (16,31). We also foresee the importance of com- the CFG as well as from the Glycosciences Lab of the Im- bining the current information in SugarBindDB with the perial College in London (35). These two sources are par- potential host response to infection via the human lectin ticularly rich for characterising viral lectins. recognition of bacterial glycans as recorded in BCSDB (32). Furthermore, glycan tissue profiling is likely to help deci- pher the role of glycans; and we are planning integration ACKNOWLEDGEMENTS with GlycomeAtlas data (33). In fact, the latter two cited re- We thank Chloe´ Loiseau and Tiphaine Mannic for their sources follow among others, the current coordinated move contribution towards improving SugarBindDB annota- towards RDF-based data integration that is already shap- tions. ing future developments of glycoscience databases (21,34). Within our consortium, UniCarbKB represents data in RDF and the RDF scheme of SugarBindDB is in prepa- FUNDING ration. In parallel, the tight integration of all resources of Swiss National Science Foundation [SNSF 31003A 141215, the Japan Consortium for Glycobiology and Glycotech- 2012–14]; Australian National eResearch Collaboration nology databases (JCGGDB) relying on GlycoRDF grants Tools and Resources project [NeCTAR RT016, 2012–2013]; access to a wealth of information. Collaboration is ongo- EU [FP7-PEOPLE-2012-ITN # 316929]; Swiss Federal ing to build appropriate SPARQL queries and achieve our Government through the State Secretariat for Education, goal of boosting connectivity. In particular, GlycoEpitope Research and Innovation SERI; ExPASy is maintained by or PACBD, the closest resources to SugarBindDB (though the web team of the SIB Swiss Institute of Bioinformatics not including specific lectin information) will be our first and hosted at the Vital-IT Competency Center. Funding for
Downloaded from https://academic.oup.com/nar/article-abstract/44/D1/D1243/2503058 by guest on 29 July 2018 D1250 Nucleic Acids Research, 2016, Vol. 44, Database issue
open access charge: Swiss Federal Government through the glycopeptides and peptidoglycans: JCBN recommendations 1985. State Secretariat for Education, Research and Innovation Glycoconj. J., 3, 123–133. SERI. 19. Herget,S., Ranzinger,R., Maass,K. and Lieth,C.-W. v. d. (2008) GlycoCT––a unifying sequence format for carbohydrates. Carbohydr. Conflict of interest statement. None declared. Res., 343, 2162–2171. 20. Tanaka,K., Aoki-Kinoshita,K.F., Kotera,M., Sawaki,H., Tsuchiya,S., Fujita,N., Shikanai,T., Kato,M., Kawano,S., Yamada,I. et al. (2014) WURCS: the Web3 unique representation of REFERENCES carbohydrate structures. J. Chem. Inf. Model., 54, 1558–1566. 21. Campbell,M.P., Ranzinger,R., L¨utteke,T., Mariethoz,J., Hayes,C.A., 1. Varki,A. (ed). (2009) Essentials of glycobiology, 2nd edn. Cold Spring Zhang,J., Akune,Y., Aoki-Kinoshita,K.F., Damerell,D., Carta,G. Harbor Laboratory Press, NY. et al. (2014) Toolboxes for a standardised and systematic study of 2. Xiang,Z., Tian,Y. and He,Y. (2007) PHIDIAS: a pathogen-host glycans. BMC Bioinformatics, 15, S9. interaction data integration and analysis system. Genome Biol., 8, 22. Pedruzzi,I., Rivoire,C., Auchincloss,A.H., Coudert,E., Keller,G., de R150. Castro,E., Baratin,D., Cuche,B.A., Bougueleret,L., Poux,S. et al. 3. Urban,M., Pant,R., Raghunath,A., Irvine,A.G., Pedro,H. and (2015) HAMAP in 2015: updates to the protein family classification Hammond-Kosack,K.E. (2015) The Pathogen-Host Interactions and annotation system. Nucleic Acids Res., 43, D1064–D1070. database (PHI-base): additions and future developments. Nucleic 23. Reddy,T.B.K., Thomas,A.D., Stamatis,D., Bertsch,J., Isbandi,M., Acids Res., 43, D645–D655. Jansson,J., Mallajosyula,J., Pagani,I., Lobos,E.A. and Kyrpides,N.C. 4. Heyningen,S.V. and null (1974) Cholera toxin: interaction of (2015) The Genomes OnLine Database (GOLD) v.5: a metadata subunits with ganglioside GM1. Science, 183, 656–657. management system based on a four level (meta)genome project 5. Cooper,C.A., Joshi,H.J., Harrison,M.J., Wilkins,M.R. and classification. Nucleic Acids Res., 43, D1099–D1106. Packer,N.H. (2003) GlycoSuiteDB: a curated relational database of 24. Masson,P., Hulo,C., De Castro,E., Bitter,H., Gruenbaum,L., glycoprotein glycan structures and their biological sources. 2003 Essioux,L., Bougueleret,L., Xenarios,I. and Le Mercier,P. (2013) update. Nucleic Acids Res., 31, 511–513. ViralZone: recent updates to the virus knowledge resource. Nucleic 6. UniProt Consortium. (2014) Activities at the Universal Protein Acids Res., 41, D579–D583. Resource (UniProt). Nucleic Acids Res., 42, D191–D198. 25. Perez,S.,´ Rivet,A. and Imberty,A. (2014) 3D-Lectin Database. In: 7. Stockinger,H., Altenhoff,A.M., Arnold,K., Bairoch,A., Bastian,F., Endo,T, Seeberger,PH, Hart,GW, Wong,C-H and Taniguchi,N Bergmann,S., Bougueleret,L., Bucher,P., Delorenzi,M., Lane,L. et al. (eds). Glycoscience: Biology and Medicine. Springer, Tokyo, pp. 1–7. (2014) Fifteen years SIB Swiss Institute of Bioinformatics: life science 26. Stevens,J., Blixt,O., Paulson,J.C. and Wilson,I.A. (2006) Glycan databases, tools and support. Nucleic Acids Res., 42, W436–W441. microarray technologies: tools to survey host specificity of influenza 8. Hart,G.W. and Copeland,R.J. (2010) Glycomics hits the big time. viruses. Nat. Rev. Microbiol., 4, 857–864. Cell, 143, 672–676. 27. Ranzinger,R., Aoki-Kinoshita,K.F., Campbell,M.P., Kawano,S., 9. Hayes,C.A., Karlsson,N.G., Struwe,W.B., Lisacek,F., Rudd,P.M., L¨utteke,T., Okuda,S., Shinmachi,D., Shikanai,T., Sawaki,H., Packer,N.H. and Campbell,M.P. (2011) UniCarb-DB: a database Toukach,P. et al. (2015) GlycoRDF: an ontology to standardize resource for glycomic discovery. Bioinforma. Oxf. Engl., 27, glycomics data in RDF. Bioinforma. Oxf. Engl., 31, 919–925. 1343–1344. 28. Alocci,D., Mariethoz,J., Horlacher,Oliver, Campbell,Matthew and 10. Campbell,M.P., Peterson,R., Mariethoz,J., Gasteiger,E., Akune,Y., Lisacek,Frederique (2015) Graph Database vs RDF Triple Store: A Aoki-Kinoshita,K.F., Lisacek,F. and Packer,N.H. (2014) Comparison on Glycan Substructure Search. PLoS ONE. UniCarbKB: building a knowledge platform for glycoproteomics. 29. Broekstra,J., Kampman,A. and van Harmelen,F. (2002) Sesame: A Nucleic Acids Res., 42, D215–D221. Generic Architecture for Storing and Querying RDF and RDF 11. Shakhsheer,B., Anderson,M., Khatib,K., Tadoori,L., Joshi,L., Schema. In: Horrocks,I and Hendler,J (eds). The Semantic Web –– Lisacek,F., Hirschman,L. and Mullen,E. (2013) SugarBind database ISWC 2002. Springer, Heidelberg, Vol. 2342, pp. 54–68. (SugarBindDB): a resource of pathogen lectins and corresponding 30. Ceroni,A., Dell,A. and Haslam,S.M. (2007) The GlycanBuilder: a glycan targets. J. Mol. Recognit. JMR, 26, 426–431. fast, intuitive and flexible software tool for building and displaying 12. Mariethoz,J., Khatib,K., Campbell,M.P., Packer,N.H., Mullen,E. glycan structures. Source Code Biol. Med., 2,3. and Lisacek,F. (2014) SugarBindDB SugarBindDB, a Resource of 31. Perez,S., Tubiana,T., Imberty,A. and Baaden,M. (2015) Pathogen Pathogen Lectin-Glycan Interactions Lectin-glycan Three-dimensional representations of complex carbohydrates and interactions. In: Endo,T, Seeberger,PH, Hart,GW, Wong,C-H and polysaccharides–SweetUnityMol: A video game-based computer Taniguchi,N (eds). Glycoscience: Biology and Medicine. Springer, graphic software. Glycobiology, 25, 483–491. Tokyo, pp. 1–7. 32. Toukach,P.V. (2011) Bacterial carbohydrate structure database 3: 13. Rojas-Macias,M.A., Loss,A., Bohne-Lang,A., Frank,M. and principles and realization. J. Chem. Inf. Model., 51, 159–170. L¨utteke,T. (2015) Databases and Tools of GLYCOSCIENCES.de 33. Konishi,Y. and Aoki-Kinoshita,K.F. (2012) The GlycomeAtlas tool Web Server. In: Taniguchi,N, Endo,T, Hart,GW, Seeberger,PH and for visualizing and querying glycome data. Bioinforma. Oxf. Engl., Wong,C-H (eds). Glycoscience: Biology and Medicine. Springer, 28, 2849–2850. Tokyo, pp. 233–239. 34. Aoki-Kinoshita,K.F., Bolleman,J., Campbell,M.P., Kawano,S., 14. Ranzinger,R., Herget,S., von der Lieth,C.-W. and Frank,M. (2011) Kim,J.-D., L¨utteke,T., Matsubara,M., Okuda,S., Ranzinger,R., GlycomeDB–a unified database for carbohydrate structures. Nucleic Sawaki,H. et al. (2013) Introducing glycomics data into the Semantic Acids Res., 39, D373–D376. Web. J. Biomed. Semant., 4, 39. 15. Kawasaki,T., Nakao,H. and Tominaga,T. (2008) GlycoEpitope: A 35. Liu,Y., Palma,A.S. and Feizi,T. (2009) Carbohydrate microarrays: Database of Carbohydrate Epitopes and Antibodies. In: key developments in glycobiology. Biol. Chem., 390. Taniguchi,N, Suzuki,A, Ito,Y, Narimatsu,H, Kawasaki,T and 36. Smith,D.F., Song,X. and Cummings,R.D. (2010) Use of Glycan Hase,S (eds). Experimental Glycoscience. Springer, Tokyo, pp. Microarrays to Explore Specificity of Glycan-Binding Proteins. In: 429–431. Methods in Enzymology. Elsevier, Vol. 480, pp. 417–444. 16. Sarkar,A., Drouillard,S., Rivet,A. and Perez,S. (2015) Databases of 37. Finn,R.D., Bateman,A., Clements,J., Coggill,P., Eberhardt,R.Y., Conformations and NMR Structures of Glycan Determinants. Eddy,S.R., Heger,A., Hetherington,K., Holm,L., Mistry,J. et al. Glycobiology, 25, 1480–1490. (2014) Pfam: the protein families database. Nucleic Acids Res., 42, 17. Harvey,D.J., Merry,A.H., Royle,L., Campbell,M.P., Dwek,R.A. and D222–D230. Rudd,P.M. (2009) Proposal for a standard system for drawing 38. Kirschner,K.N., Yongye,A.B., Tschampel,S.M., structural diagrams of N- and O-linked carbohydrates and related Gonzalez-Outeiri´ no,J.,˜ Daniels,C.R., Foley,B.L. and Woods,R.J. compounds. Proteomics, 9, 3796–3801. (2008) GLYCAM06: a generalizable biomolecular force field. 18. Sharon,N. (1986) IUPAC-IUB Joint Commission on Biochemical Carbohydrates. J. Comput. Chem., 29, 622–655. Nomenclature (JCBN). Nomenclature of glycoproteins,
Downloaded from https://academic.oup.com/nar/article-abstract/44/D1/D1243/2503058 by guest on 29 July 2018 Technical Note
pubs.acs.org/jpr
GlycoSiteAlign: Glycosite Alignment Based on Glycan Structure Alessandra Gastaldello,*,†,‡ Davide Alocci,†,‡ Jean-Luc Baeriswyl,§,† Julien Mariethoz,†,‡ and Frederique Lisacek*,†,‡,§
† Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, 7 route de Drize, 1227 Geneva, Switzerland ‡ Computer Science Department CUI, University of Geneva, 1227 Geneva, Switzerland § Section of Biology, Faculty of Sciences, University of Geneva, 1211 Geneva, Switzerland
ABSTRACT: GlycoSiteAlign is a tool designed to align amino acid sequences of variable length surrounding glycosylation sites depending on the knowledge of glycan structure. It is an exploratory resource intended for the identification of characteristic amino acid patterns of unique glycan−protein interactions. GlycoSiteAlign uses data from the UniCarbKB and UniProtKB databases, and it is hosted on ExPASy, the Swiss Institute of Bioinformatics resource portal. The user can select either specific or general glycan features, set the length of the protein fragments, and trigger an alignment with the option of including 90% homologous proteins. The tool previews and downloads alignments that may reveal amino acid patterns corresponding to selected features (e.g., “fucosylated” versus “non-fucosylated”). GlycoSiteAlign will integrate new data as they become available to confirm and expand results. It is presented as a promising tool for assessing and refining the knowledge about the constraints that link a particular glycan structure to a particular glycosite, and in the long term, this application could help improve prediction tools. KEYWORDS: glycoprotein, sequence alignment, glycosylation, glycan structure, glycosite, amino acid patterns
■ INTRODUCTION ribonuclease-2, and the β subunit of the interleukin-12, display 8−10 Protein glycosylation is the enzyme-catalyzed covalent attach- this type of glycosylation. ment of a glycan to a polypeptide resulting in the formation of a Notable advance has been made in the comprehension of glycoprotein. It is one of the most common cotranslational and glycan function and characterization of their structure, but post-translational modifications taking place in the endoplasmic some fundamental aspects of glycan biology remain poorly reticulum (ER) and Golgi apparatus. There are three major known and understood. Among these are the mechanisms of glycosylation and the correlation between the peptide Downloaded via UNIV OF GENEVA on July 30, 2018 at 15:15:19 (UTC). types of protein glycosylation: N-, O-, and C-glycosylation. In N-glycosylation, glycans are attached to a nitrogen atom of an sequences around glycosylation sites and the glycan structures. asparagine (N) residue in the following protein consensus This unknown correlation challenges the prediction of glycosyla- sequence: N-X-S or T, where S is a serine, T a threonine, and tion sites on proteins, also limited by the scarcity of data.
See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles. Indeed, the experimental determination of glycosylation sites is X any amino acid except proline (P). In O-glycosylation, glycans ffi fi are instead linked to an oxygen atom of an S or a -T residue di cult to achieve because a large amount of puri ed proteins is and, rarely, to a hydroxyproline or a tyrosine.1,2 No particular needed, and glycosylation has been shown to be an organism- fi 11 ff consensus sequence has been detected for O-glycosylation, even and tissue-speci c process. In recent years, remarkable e ort if it occurs usually in a portion of sequence rich in hydroxy has been made to boost both the discovery and the mapping of amino acids and, as shown by sequence alignment studies, O- and N-glycosylation sites. For example, a genetic engineering approach using human cell lines and developed by Clausen proline residues frequently occur around the glycosylation sites, 12 especially at amino acid positions of −1 and +3 (by convention, and colleagues has enabled proteome-wide discovery of position 0 corresponds to the glycosylated residue).3,4 To the GalNAc-type O-glycosylation sites. As described in this work, fi contrary, the presence of charged amino acids at these many previously unknown sites were identi ed. The systematic sites tends to prevent the binding of glycans.5,6 N-linked and mapping of N-sites was undertaken earlier and gave rise, for O-linked glycosylation are the most abundant types of example, to UniPep, a database of human N-linked glycosites fl glycosylation in mammals.7 In C-glycosylation, also called found on proteins isolated from plasma, cerebrospinal uid, C-mannosylation, glycans get attached to the carbon atom of a various tissues, and cell sources. This database can be searched tryptophan (W) residue, and the W-X-X-W motif was identified as the acceptor consensus sequence for glycan binding. Few Received: May 26, 2016 human proteins, such as the thrombospondin-1 (TSP-1), the Published: August 15, 2016
© 2016 American Chemical Society 3916 DOI: 10.1021/acs.jproteome.6b00481 J. Proteome Res. 2016, 15, 3916−3928 Journal of Proteome Research Technical Note
Figure 1. Diagram of the mode of operation of GlycoSiteAlign. Information about glycan features, glycosylated proteins, and sites are retrieved from UniCarbKB, and glycoprotein sequences are retrieved from UniProtKB. For each feature, sequence fragments around glycosylation sites are selected and aligned; 90% homologous proteins can be added to the alignment. with different parameters and displays information such as the but to the best of our knowledge, these methods fail to take subcellular localization of the protein, the predicted N-linked into consideration an important factor that is likely to largely glycosites and their location, the mass spectrometrically contribute to constraining the peptide features of glycosylated identified glycopeptides, and their sequence along with relevant sites and the outcome of the glycosylation process: the struc- annotations and a full protein topology.13 Currently, the database ture of the attached glycans. This oversight unfortunately contains 1522 unique N-linked glycosites. reflects the scarcity of fully characterized glycoprotein Despite the clear value of these data sets and the develop- structures. However, the release of new data in UniCarbKB21 ment of new technology such as SimpleCell, the experimental that upgraded GlycoSuiteDB22 enables now investigating the determination of glycosylation sites only partially describes the relationship between features of glycosylation site and glycan overall process and explains the low level of full glycoprotein structures. Preliminary studies were already performed to characterization and annotation. To compensate for the correlate, for example, the N-glycan type with the accessibility lack of data, in silico prediction has been and continues to of glycosylation sites on glycoproteins.23,24 More recently, a − be developed. Tools such as NetOGlyc,14 16 NetNGlyc,17,18 quantitative and qualitative analysis of the glycoprofile of GlycoEP,19 and GlycoMine20 rely on machine-learning 474 N-linked glycosites on 169 mammalian N-glycoproteins techniques (neural network and Support Vector Machine) and established a correlation between N-glycan structural motifs are trained to recognize the protein sequence context, the and the structure of the carrier glycosylation sites. This in silico secondary structure, and the accessibility of the glycosylation study highlights the contribution of solvent accessibility and sites. Note that NetOGlyc and NetNGlyc have been available physicochemical properties of protein chains surrounding the since the 1990s and, over the years, the NetOGlyc server has glycosites.25 Following a similar path, the web application considerably improved its sensitivity, particularly following presented in this paper and named GlycoSiteAlign has been the massive information input provided by Clausen and col- developed to contribute to the assessment of this relationship leagues.12 More recently, the GlycoEP server was developed to while accounting for a variety of glycan properties and the amino improve prediction models for N-, O-, and C-linked glycosites in acid sequence found in the glycosite vicinity. It is built upon the eukaryotic protein sequences.19 The method used data sets of assumption that glycoproteomics will soon deliver new data sets, experimentally verified eukaryotic glycoproteins extracted from but as this new data is still often limited to glycoprotein site the SWISS-PROT June 2011 release, in which the redundancy associated with glycan composition, we designed GlycoSiteAlign was reduced at the level of protein sequence and of glycosite to be queried with such fuzzy data. The tool is intended to help patterns. The GlycoMine project also proposes a novel investigate the link between glycan structures and the attachment bioinformatics approach to improve the prediction performance site prior to revising prediction methods. for all the three major types of glycosylation in humans.20 The GlycoSiteAlign groups and aligns amino acid regions of method uses intensive feature selection techniques to choose and glycoproteins that surround glycosylation sites carrying glycans to integrate several informative features (e.g., sequence-based with selected structural features. In this way, particular amino structural and functional features) to predict the glycosylation acid patterns and motifs (in addition to the well-known sites in a protein of interest. The authors stated that this approach patterns of N-X-S or -T for N-glycosylation and W-X-X-W for allows GlycoMine to outperform other tools, including NetNGlyc C-mannosylation) can be more easily identified. GlycoSiteAlign and NetOGlyc. does not predict glycosylation sites. It was designed as an The improvement, mostly in terms of the accuracy of exploratory tool for bringing out the expected specificity that prediction, brought by the tools described so far is important, constrains the attachment of a particular glycan structure at a
3917 DOI: 10.1021/acs.jproteome.6b00481 J. Proteome Res. 2016, 15, 3916−3928 Journal of Proteome Research Technical Note particular glycoprotein site. In the short term, the application Table 1. Search Categories and Glycan Features Found in will help investigate the mutual requirements conditioning Each Category glycan attachment to a glycoprotein. In the long run, observed regularities could be used to improve existing prediction tools search type features or to develop a new one. refined search GlycoSiteAlign is hosted on ExPASy, the bioinformatics UniCarbKB e.g., 4, 25, 96, 482, 541, 659, 705, 1031, 1119, 1231, 1315, ID 2859, and 7344 resource portal of the SIB Swiss Institute of Bioinformatics, determinants A (type 2), Lewis x (SSEA-1), O-mannose Lewis x, I antigen under the section dedicated to glycomics (http://www.expasy. disialyl T antigen, T (Tf antigen), Sda/CT antigen, a-Gal org/glycomics). In GlycoSiteAlign, the user can choose one or antigen more glycan features, and the application selects the group sialyl Tn antigen, Sialyl T antigen, HNK-1 antigen, type 2 of proteins linked to the glycans with those features. The LN2 application isolates fragments of adjustable length (by default, polysialic acid, type 1 LN, 3-sialyl-LN (type 2), sialylated 20 amino acids) on each side of each glycosylated residue. LDN These fragments are then aligned. The resulting alignments are HNK-1 on O-mannose, O-linked mannose, LacdiNAc fi (LDN) returned as text les and .svg images and also made available for type 2 LN (N-LacNAc), fucosylated LDN, 2,6-branched download. These outputs can be investigated to find recurring O-mannose amino acid patterns. The application will integrate new data as 4′-sulfated LDN, GM4 it is produced, thereby supporting the gradual refinement of our flexible search understanding of glycosylation. composition hexose, HexNAc, deoxyhexose, NeuAc, NeuGc, pentose, sulfate, phosphate, HexA, methyl, acetyl, other ■ EXPERIMENTAL PROCEDURES mass range from 164 to 4548 Da coarse search Mode of Operation of GlycoSiteAlign glycan fucosylated, nonfucosylated, core-fucosylated, noncore- structural fucosylated Data Collection. GlycoSiteAlign takes as input glycan properties structures and outputs aligned amino acid sequences surround- bisecting N-acetylglucosamine core-fucosylated, bisecting ing glycosylation sites. To accomplish this, GlycoSiteAlign N-acetylglucosamine nonfucosylated sialylated, nonsialylated, bisecting N-acetylglucosamine needs three types of data: glycan features, the position of sialylated glycosylation sites on glycoproteins, and glycoprotein sequences bisecting N-acetylglucosamine nonsialylated, bisecting (see the diagram in Figure 1). N-acetylglucosamine The glycan features are retrieved from UniCarbKB (www. core structures N-linked unicarbkb.org, November 18th, 2015 release) and are available hybrid, complex, high-mannose, truncated, cylose, hybrid to the user as filtering options. They include: UniCarbKB xylose, complex xylose, high-mannose xylose, truncated identifier (ID), monosaccharide composition, glycan determi- -xylose, not available nants, mass and several structural properties and core types. O-linked The full list is provided in Table 1. 0 (Tn antigen), 1 (T antigen), 2, not available The glycosylation sites are retrieved from UniCarbKB as N- and O-linked well; they are represented by the glycosylated amino acid(s), core not available the UniProt IDs of the proteins in which they occur, and the O- and C-linked position in the corresponding sequences (Figure 1). core not available The full sequences of glycoproteins are extracted from advanced search combine all of the above listed features UniProtKB (www.uniprot.org, November 2015 release), and glycan only portions surrounding glycosylation sites are selected features (Figure 1). The data is organized in dictionaries; a set of dictionaries most specific features, as reported in Table 1. Glycan composi- associates a list of UniCarbKB glycan IDs to each glycan feature, tion and mass range are two features allowing the user to be and a single dictionary associates these IDs with glycosylation specificyetstillflexible (“flexible search” in Table 1). In contrast, sites and UniProtKB protein IDs. Another dictionary is built to glycan structural properties and core structures are con- relate the latter with glycoprotein sequences. sidered as general glycan features (“coarse search” in Table 1). Selection of Protein-Sequence Fragments. Once the The granularity of the search need not be set. Mixed criteria can user has specified the desired glycan features and the frag- also be established in the advanced search page, where glycan ment length n (20 by default), GlycoSiteAlign retrieves the structures can be selected upon various and simultaneous corresponding UniCarbKB IDs and then the UniProtKB IDs, features (for example, an N-linked high-mannose that is both the glycosylation sites, and the glycoprotein sequences on fucosylated and sialylated). which the glycans are attached (see the diagram in Figure 1). Furthermore, to increase the amount of protein fragments From the latter, GlycoSiteAlign selects a portion of n amino and, consequently, the quality of the resulting alignments, acids on each side of each glycosylation site, creating fragments GlycoSiteAlign accepts experimentally determined glycan to be aligned (Figure 1). composition(s) and the corresponding glycosylation sites as a The number of retrieved glycans IDs and, as a consequence, direct input by the user. of the final glycoprotein fragments depends on the type of the Homologous Protein-Sequence Retrieval. GlycoSiteA- specified glycan features. Due to the currently limited knowledge lign proceeds with the alignment only if at least two glyco- of glycan structures at specific sites, the more specific the feature, protein fragments are selected. In the other cases, to compensate the fewer the glycosylation sites and the glycoprotein fragments. for the lack of data, GlycoSiteAlign offers the user the option In particular, UniCarbKB ID and determinants options are the to include in the alignment 90% homologous proteins.
3918 DOI: 10.1021/acs.jproteome.6b00481 J. Proteome Res. 2016, 15, 3916−3928 Journal of Proteome Research Technical Note
Figure 2. Collage of the screenshots of two selection pages of GlycoSiteAlign. (A) Composition page where users can select a glycan composition up to 12 monosaccharides and provide their own glycosites. (B) Glycan structural properties page where 11 different structural features can be selected by the user. In both pages, there are extra boxes to choose the type of protein linkages: N-, O-, or C-.
Indeed, UniProt entries include a field called “similar proteins” In the current version, GlycoSiteAlign only takes into considera- that covers the information on homology. The UniProt tion a similarity (homology) of 90% to increase the probability Reference Clusters 90 (UniRef90) is the result of clustering of having conserved amino acids in the site positions. protein sequences using the CD-HIT algorithm,26 such that Alignment. The total length of the fragments varies each cluster is composed of sequences that have at least 90% depending on the glycosylation types. If n is the number of sequence homology. UniRef90 is included in the UniProt selected amino acids (20 by default), in the event of O-linked “similar protein” section, and it is used to select 90% homol- sites, the fragments will have a length of n + 1 because there is ogous proteins by GlycoSiteAlign. Starting from the IDs of the no consensus motif. In this case, GlycoSiteAlign selects the available glycoproteins (from now on called reference proteins), n amino acids on the right side of the glycosylation site starting GlycoSiteAlign retrieves the IDs along with the sequences of the from the amino acid found immediately after the glycosylation 90% homologous proteins from UniProtKB and aligns them position. In the case of N-linked sites, the fragment length will with the sequences of the reference proteins using a third-party be n + 3 because the consensus sequence is composed of three clients for Python 2.7 provided by the European Bioinformatics amino acids (N-X-S or -T). With respect to the previous case, Institute.27 This service uses Clustal Omega28 to do multiple- there are two “exceeding” amino acids, and so GlycoSiteAlign sequence alignment. This step is necessary for matching the selects the fragment on the right side of the glycosylation site, position of each glycosylated site on the reference proteins with starting from the third amino acid after the glycosylation the positions of the hypothetical glycosylated sites on the homo- position. In C-glycosylation, the consensus motif has four amino logous proteins. If a site position is conserved, GlycoSiteAlign acids (W-X-X-W), and so the fragments have a total length of selects amino acid fragments surrounding the site found on both n + 4 amino acids. GlycoSiteAlign selects the n-long fragment the reference protein and the similar protein, otherwise, the on the right side of the glycosylation site from the fourth amino nonconserved site on the 90% homologous protein is discarded. acid after the glycosylation position. To constrain the alignment
3919 DOI: 10.1021/acs.jproteome.6b00481 J. Proteome Res. 2016, 15, 3916−3928 Journal of Proteome Research Technical Note
Table 2. Number of Considered Glycan Features and glycoprotein fragments of the consensus sequences characterizing Number of Resulting Alignments and of Aligned Glycan each type of glycosylation site: position n + 1 for O-glycosylation, Structures without 90% Homologous Proteins positions from n +1ton + 3 for N-glycosylation, and positions from n +1ton + 4 for C-glycosylation. In this way, only the aligned glycan feature type features alignments structures portions found on the left and right sides of the glycosylation consensus sequences are actually aligned. UniCarbKB ID 563 140 264 Output. GlycoSiteAlign returns the aligned sequences in composition 238 159 488 both text files and .svg images. Each text file contains the mass 234 160 485 UniCarbKB ID(s) and the glycan structure(s) of selected structural property 11 11 563 glycan(s) at the top of the page (one line equals one structure) core structure 10 N-linked 9 followed by the aligned sequences (one line equals one 4 O-linked 4 sequence), which are matched with the UniProtKB IDs, the 1 N- and O-linked 1 562 position of the glycosylated amino acids with reference to 1 C- and O-linked 1 determinant 1 ABO blood group 1 the whole original sequence, and the protein name. .svg images are made programmatically through Jalview, a system for 2 Lewis 2 30 9 antigen 8 547 viewing, editing, and analyzing multiple-sequence alignments. 14 others 13 The color scheme Clustal X has been chosen to paint and highlight whatever amino acid pattern was occurring in align- ments. Each residue in the alignment is assigned a color if the outside the consensus sequences, GlycoSiteAlign uses COBALT, amino acid profile of the alignment at that position meets some fi a constraint-based alignment tool. This tool nds a collection of minimum criteria specific for the residue type.31 pairwise constraints derived from database searches, sequence Web Interface similarity, and user input, then combines and incorporates them into a progressive multiple alignment.29 The constraints that The web interface of GlycoSiteAlign is built using Flask, GlycoSiteAlign transfers to COBALT are the positions on the a BSD licensed micro web application framework written in
Figure 3. Alignment corresponding to the N-linked “high-mannose” glycan with UniCarbKB Id 1061. (A) Alignment without similar proteins. (B) Alignment with similar proteins. To highlight the appearance of short amino acids patterns such as LLCL (L = leucine), HAV (H = histidine, A = alanine, and V= valine), and VFV (F = phenylalanine), the alignment was cut out; for the whole image, we refer the reader to the application site.
3920 DOI: 10.1021/acs.jproteome.6b00481 J. Proteome Res. 2016, 15, 3916−3928 Journal of Proteome Research Technical Note
Figure 4. Alignment corresponding to the N-linked “core-fucosylated” glycan with UniCarbKB ID 1099. (A) Alignment without similar proteins. (B) Alignment with similar proteins. It shows the increase of the short amino acid pattern HVH already present in A and the appearance of charged amino acids bands stained in magenta and red. The alignment was cut out; for the whole image, we refer the reader to the application site. Python and based on the Werkzeug toolkit and Jinja2 template ■ RESULTS AND DISCUSSION engine.32 The home page proposes a drop-down menu through which the General Overview user can choose the desired glycan features. These are divided in Table 2 summarizes the type and the number of glycan features four categories as already shown in Table 1. In addition, a box can retrieved from UniCarbKB, how many alignments without the be checked to include 90% similar proteins in the alignments. The inclusion of 90% homologous proteins arose for each feature, Go button validates the selection and prompts a new input page. and the number of glycan structures involved. In total, 563 Figure 2A shows an example of the Composition page. glycan structures, 220 glycosylation sites, and 137 glycoproteins The user can choose either a precise number or a range of were recovered from the UniProtKB 2015 release. monosaccharides; if a field is not filled, the number is set to 0, while an asterisk will sets the number to any. The bigger N-Glycosylation Sites fi the number of asterisks, the less speci c the choice, and the The alignments without similar proteins show mostly the more glycans and glycoproteins will be selected and aligned. presence of hydrophobic amino acids, glycine (G), cysteine As mentioned before, users can enter their own experimentally (C), and P residues in specific positions around the glycosyla- determined glycan composition(s) together with the protein tion sites. In the Clustal X color scheme chosen to graphically sequence(s) and glycosite(s) uploading a fasta file or copying represent the alignments, these are highlighted as blue data in the corresponding box. Figure 2B shows the page for glycan structural properties for the same example. (hydrophobic residues and C residue) vertical and horizontal bands and as yellow and orange spots (for P and G residues, The search by UniCarbKB ID prompts a page where glycan 31 structure IDs of UniCarbKB can be directly input if known. respectively). This is illustrated in Figures 3Aand4A, showing “ ” Otherwise, the page provides a link to the UniCarbKB Web site the alignments of the high-mannose glycan with UniCarbKB for querying the database and retrieving the proper IDs. ID 1061 and of the “core-fucosylated” glycan with UniCarbKB The output page of GlycoSiteAlign displays the results of ID 1099, respectively. While the first shows mostly bands of the alignments, which can be previewed and downloaded as text single hydrophobic amino acids, the latter shows also amino acid files and .svg images. pairs and trios.
3921 DOI: 10.1021/acs.jproteome.6b00481 J. Proteome Res. 2016, 15, 3916−3928 Journal of Proteome Research Technical Note
Figure 5. Alignment corresponding to the O-linked “non-core-fucosylated” glycan with UniCarbKB ID 657. (A) Alignment without similar proteins. It shows the EGF-like repeat pattern NGG(T/S)C and the TSP1 repeat pattern CSV preceding and the CGGG following the glycosylated amino acid. (B) Alignment with similar proteins. It shows how the addition of 90% homologous proteins strengthens patterns in A and reveals other patterns (e.g., the magenta band on the left). The alignment was cut out; for the whole image, we refer the reader to the application site.
O-Glycosylation Sites glycan 657) seems to contradict previous observations that cysteine in these positions could prevent O-glycosylation.4 With respect to N-linked proteins, the alignments without “ ” similar proteins show, in most cases, fewer blue bands or Another interesting alignment corresponds to glycan Core-2 , longer patterns or both but confirm an important presence of which shows an abundance of charged amino acids (E = glutamic acid, R= arginine, and K= lysine) at specific positions such as P residues all around the glycosylation sites and, in particular, − at the positions of −1 and +3. In addition, C and G residues are +1 + 15, + 19, and 14 (Figure 6A). often associated with proline. For example, the alignments of C-Glycosylation Sites glycans with UniCarbKB IDs 656 and 657, glucose and fucose, For the last major type of glycosylation, four glycosylated sites respectively, include Epidermal Growth Factor(EGF)-like are found in the human protein TSP-1 and two in two other repeats and TSP-1 repeats that are indeed rich in C and G different glycoproteins. Similarly to the previous described align- residues. In particular, in the glycoprotein fragments associated ments, they show mostly that specific positions are preferentially with glycan 656, there are two EGF-like repeat patterns occupied by hydrophobic amino acids (not shown). starting from position −2 to position +9 and other positions displaying a high degree of conservation (not shown). In glycan Introduction of Homologous Proteins 657 (fucose) alignment, there are two distinct types of amino N-Glycosylation Sites. In most cases, the introduction of acid patterns preceding or following the glycosylated amino the 90% homologous proteins to the alignments confirms the acid. The first pattern NGG(T/S)C immediately upstream is presence of the most frequent amino acids (usually hydro- associated mostly with the presence of a leucine (L) or a phobic and G and C residues) found in the alignments without T at position +5 and is characteristic of EGF-like repeats.33 similar proteins. However, in some cases, it brings to light The second pattern is a consensus sequence typically found longer patterns or amino acid residues or both not previously in the TSP-1 domain: CSV (V = valine) preceding the glycosyla- highlighted. A pair of examples of this are shown in Figures 3B tion site and CGGG following34 (Figure 5A). Interestingly, the and 4B. The first, the alignment of the “high-mannose” glycan presence of C residues within these motifs and close to glycosyla- with UniCarbKB ID 1061, reveals amino acid patterns of two tion sites (position −2 and +3 for glycan 656 and position +1 for and four residues preceding the glycosylation sites, such as the
3922 DOI: 10.1021/acs.jproteome.6b00481 J. Proteome Res. 2016, 15, 3916−3928 Journal of Proteome Research Technical Note
Figure 6. Alignment corresponding to “Core-2” O-linked glycans. (A) Alignment without similar proteins. It shows the presence of charged amino acids E (magenta band) and R and K (red bands) preceding and following the glycosylated amino acid found at position 21. (B) Alignment that includes similar proteins. It shows how the addition of 90% homologous proteins confirms and singles out charged amino acids in specific positions (magenta and red bands) and displays other patterns composed of hydrophobic amino acids (blue band) and H (histidine). The alignment was cut out; for the whole image, we refer the reader to the application site. pattern LLCL always associated with VF and GG associated examples of alignment by UniCarbKB ID were shown in with SVFV or the pattern GGYYVY (Y= tyrosine) not present Figures 3 and 4. in the alignment shown in Figure 3A. In contrast, Figure 7 compares the alignment of N-linked In Figure 4B, red and magenta bands, not present in sites corresponding to the general property “core-fucosylated” Figure 4A, underline charged amino acids (E, D (aspartic acid), (in panel A), with the alignments of N sites corresponding to K, and R), and blue-cyan strips show other patterns, such as the the specific features “Type 1 LN” determinant (in panel B) and HVH trio. “UniCarbKB ID 2553” (in panel C). In both B and C, certain O-Glycosylation Sites. The addition of 90% homologous amino acids such as hydrophobic L, F, A, Y (tyrosine), W, and proteins to the alignments has mostly the effect of increasing V or of the charged D and E (in panel C) are obviously more hydrophobic, charged amino acids and P and G residues around abundant. However, in panel A, the distribution of G and P glycosylated sites. This brings out amino acid patterns that are around glycosylated sites is more dominant. usually found in EGF-like repeats and TSP1 domains. The Figure 8 compares the alignments of O-linked sites correspond- Figure 5B illustrates this situation. On the left side at position ing to the general glycan features “Core-1” (in panel A) and − 20, there is a preponderance of the charged amino acid D “sialylated” (in panel B) with those of the precise glycan features (magenta band) followed by TSP1 (more rarely) or EGF-like “UnicarbKB ID 650” (in panel C) and the “Type 2 LN repeats (CSV(S/T)CG and NGG(S/T)C, respectively), the N-LacNAc” (in panel D) determinant. Similarly to N-linked latter being also associated with other amino acids than the sites in Figure 7, sequence fragments aligned according to more D residue. The presence of P and G residues and of hydrophobic general features (in panels A and B) do not show particular amino acids such as valine and alanine was recognized as a amino acid patterns or conservation of specific amino acids in condition favoring O-glycosylation in many studies.4 “ ” multiple positions except for a great amount of P residues. To The alignment corresponding to Core-0 shows with the the contrary, alignments in panels C and D show the presence introduction of similar proteins the presence of constrained of charged (E, D, K, and R) and hydrophobic amino acids amino acids around glycosylated sites. In this case, charged fi fi “ ” in speci c positions. The glutamic acid (E) residue at +1 in residues at speci c positions (+12, + 15, and +16). Core-2 the alignment proposed in panel D and in −5 in the align- shows a remarkable conservation of these types of residues ment in panel B confirm previous observations stating a higher (Figure 6B) that were present also in the alignment without 4 fi frequency of this amino acid in these positions. similar proteins (Figure 6A). The both of them con rm hydro- With the current data collection, we have to point out that phobic amino acids in specific positions. for the majority of the alignments corresponding to specific Comparison between Broad and Fine Criteria of Glycan glycan features, the number of protein fragments is still very Selection limited. This could in part explain the finding of more amino Overall, the alignments corresponding to general glycan acids patterns and a bigger concentration of certain amino acids features (such as glycan structural properties, core structures, in specific positions in the alignments corresponding to finer large mass range, and composition) show less-common amino glycan features. Nonetheless, for example, we verified that the acid residues and patterns between the sequence fragments addition of homologous proteins confirmed the absence or the than those produced with specific features (UniCarbKB ID, presence of specific patterns corresponding to more-general determinants, narrow mass range, and composition). A pair of and more-specific features, respectively, in alignments shown in
3923 DOI: 10.1021/acs.jproteome.6b00481 J. Proteome Res. 2016, 15, 3916−3928 Journal of Proteome Research Technical Note
Figure 7. Comparison of N-linked sites alignments corresponding to general and specific glycan features. (A) Alignment of sites attached to “core- fucosylated” glycans (general feature). It shows the absence of amino acid patterns and the presence of just a blue band on the left side. The alignment was cut out; for the whole image, we refer the reader to the application site. (B) Alignment corresponding to the “Type 1 LN” glycan determinant. (C) Alignment of the “core-fucosylated” glycan with UniCarbKB ID 2553. These two specific alignments show a stronger presence of certain amino acids in specific positions (blue bands and blue and magenta spots) with respect to the alignment in A.
Figure 7 and in Figure 8. This situation arises in the majority thanks to a finer determination of the amino acid composition of the alignments. Furthermore, the aligned fragments usually and frequency around glycosylated sites. To underline this belong to a limited group of organisms (e.g., mammals), which aspect, similarly to the authors of GlycoEP,19 we represented reduces the coverage of the different life kingdoms or the amino acid frequency of some alignments using frequency groups inside the same kingdom (e.g., plants or reptiles) or weblogos. The amino acids residues have been colored accord- both. To extend this coverage and increase the reliability of the ing to their chemistry. Polar residues in green, neutral in purple, alignments, more glycoprotein fragments are necessary. That is basic in blue, acidic in red, and hydrophobic in black. why updating the data regarding glycans and glycosylation sites Figure 9 shows the weblogos of all of the N- and O-linked is essential. sites (general features) in panels A and D, respectively, of Contribution to the Improvement of Prediction N-sites attached to glycans with a mass included in the range of 3000−3300 Da in panel B, of N-sites linked to the In the previous section, through the use of Jalview images, a-Gal antigen in panel C, of O-sites attached to the glycan we highlighted how the alignments corresponding to specific determinant “sialyl T antigen” in panel E, and of O-sites glycan features correlate with specific amino acid residues occupied by the glycan with UniCarbKB ID 4 in panel F, all around glycosylation sites and reinforce patterns found in of which are intended as specific features. In the alignment of alignments corresponding to general features. In the long term, N- and O-linked sites, showed in panels A and D, there is a with a bigger amount of data, this could contribute to increas- great variety in the type of residues in each position, and the ing the accuracy and specificity of glycosylation site prediction difference in their frequency (height of letters) is scarce.
3924 DOI: 10.1021/acs.jproteome.6b00481 J. Proteome Res. 2016, 15, 3916−3928 Journal of Proteome Research Technical Note
Figure 8. Comparison of O-linked sites alignments corresponding to general and specific glycan features. (A) Alignment of sites attached to “Core-1” glycans (general feature). It shows the absence of amino acid patterns and the presence of just a blue band on the right side. (B) Alignment of sites attached to “sialylated” glycans (general feature). As in (A), there is no particular pattern except for an important presence of proline and a blue band on the right side of the glycosylated site. The alignments in (A) and (B) were cut out; for the whole images, we refer the reader to the application site. (C) Alignment corresponding to the glycan with UniCarbKB ID 650. It shows the charged and hydrophobic amino acids in specific positions. (D) Alignment of the protein fragments attached to the “Type 2 LN N-LacNAc” glycan. As in (C), there is a stronger presence of certain amino acids in specific positions (blue, magenta, and red bands) with respect to the alignment in (A) and (B).
The alignment in B shows instead a lesser variety of the type −2to−7(Figure 9E,F), or it is present at the positions −14, of residues and, in addition, a greater difference between the −17, −19, and −20 (Figure 9D,F). In all cases, GlycoSiteAlign frequency of the most common residues and the others than in helps distinguishing the various occurrences. We cannot the alignment of N-linked sites showed in A. This is even more rule out that specific patterns only reflect data biases. Possible visible for the alignment represented in panel C, where the variety patterns are only indicative. Again, the tool was designed to of amino acid residues is even smaller than in panel B. build potential explanations as opposed to producing established With respect to the alignments in panels B and C, those results. represented in panels E and F show a bigger difference in the frequencies between the different amino acids and a lesser ■ CONCLUSIONS variation in the amino acid residue types. As stated before, this The qualitative analysis of the results discussed in the previous is due to the smaller number of aligned fragments for O-linked section, showm as the comparison of amino acid regions sites. Nonetheless, in several positions, the most-frequent carrying similar glycan structures, can single out individual or amino acid is the charged glutamic acid (E). In particular, patterns of amino acids otherwise difficult to highlight in wider it mostly precedes the glycosylated site at the positions from alignments. Indeed, the more specific the aligned glycan feature,
3925 DOI: 10.1021/acs.jproteome.6b00481 J. Proteome Res. 2016, 15, 3916−3928 Journal of Proteome Research Technical Note
Figure 9. Comparison of frequency weblogos between general and specific glycan features. A and D: alignments of N- and O-linked sites, respectively (general features). It shows the high variation of the type of amino acid residues around glycosylation sites. (B) Alignment of N-sites attached to glycans with a mass included between 3000 and 3300 Da. (C) Alignment corresponding to the glycan determinant “a-Gal antigen” (N-linked). (E) Alignment corresponding to the glycan determinant “sialyl T antigen” (O-linked). (F) Alignment of sites attached to glycans with UniCarbKB Id 4 (GalGalNAc, O-linked). All of these are specific features. With respect to (A), in (B) and (C) there is less variation in the type of amino acids in each position along the sequence fragments, especially in the alignment shown in (C). Moreover, the difference in the height of letters (amino acid frequency) is more pronounced. The same can be observed in (E) and (F) with respect to (D). the more likely it is to find these patterns or specific residues features, respectively. This could indicate a real tendency of (or both) around glycosylation sites. Moreover, the introduc- glycan structures with distinct features to have a specificaffinity tion of 90% homologous proteins is helpful in displaying for particular amino acid sequences. To clarify this point and new motifs, confirming patterns displayed in the absence of confirm that structure-based alignments could improve pattern these proteins, or both. Some of the patterns highlighted by recognition and, in turn, prediction methods, further align- GlycoSiteAlign (for example, EGF-like repeats, PTS rich- ments with both a larger number of glycoprotein fragments regions, and some hydrophobic amino acids such as valine and and a finer subdivision of the glycan structures are needed. alanine) are already well-known to contain or favor O-glycosylation GlycoSiteAlign will then be regularly updated with new data (or both).4,12,35 In contrast, other patterns revealed by coming from glycoproteomic research, a field becoming GlycoSiteAlign and found close to glycosylation sites, such as increasingly productive. The next data set in preparation that cysteine and charged residues, were considered as having a will soon be integrated relates to the N-glycans of immuno- negative influence on this type of glycosylation (more precisely, globulins. More generally, GlycoSiteAlign is tuned on the evolu- cysteine at the positions of +1 and +3 and glutamic acid at the tion of UniCarbKB content. However, the option of including position of −6).4 This observation emphasizes the potential for user-input experimentally determined glycan structures and GlycoSiteAlign to recognize amino acid patterns and residues glycosylation sites in the alignment produced by GlycoSiteAlign usually “diluted” or masked in alignments that take into con- is yet another means of compensating for the current lack sideration only the glycosylation type (N-, O-, and C-linked). of data and potentially accounting for data not stored in The frequency weblogos we used to represent the alignments UniCarbKB. of all N- and O- sites and of some specific glycan features GlycoSiteAlign is also limited by the bias in experimental confirmed this “dilution” effect and accentuated the ability of data that are not necessarily representative of the entire pool finer alignments to emphasize patterns. of known glycosylation sites and glycoproteins. There are, for We acknowledge that our results are in part explained by instance, more data of full structures on N-sites than on O-sites the fact that the more specific the glycan feature, the fewer simply because the former is simpler to generate than the latter. glycoprotein fragments are aligned. Nonetheless, the introduc- Moreover, glycosylation sites extracted from 90% similar proteins tion of 90% homologous proteins often confirms the absence are inferred following sequence alignment and only rely on the or the presence of patterns in more-general and more-specific conservation of a few residues.
3926 DOI: 10.1021/acs.jproteome.6b00481 J. Proteome Res. 2016, 15, 3916−3928 Journal of Proteome Research Technical Note
To improve the usefulness of GlycoSiteAlign as an (7) Thaysen-Andersen, M.; Larsen, M. R.; Packer, N. H.; Palmisano, exploratory tool and possibly a future prediction tool, in the G. Structural analysis of glycoprotein sialylation-Part I: pre-LC-MS next release, we are planning to include 2D and 3D structural analytical strategies. RSC Adv. 2013, 3, 22683−22705. constraints to define glycosylation sites. Indeed, as described (8) Doucey, M.-A.; Hess, D.; Blommers, M. J.; Hofsteenge, J. already in several studies, it seems that the primary structure Recombinant human interleukin-12 is the second example of a − of the glycosylation sites is necessary but not sufficient to C-mannosylated protein. Glycobiology 1999, 9, 435 441. determine if a site will be glycosylated.36,37 With the introduc- (9) Hofsteenge, J.; Blommers, M.; Hess, D.; Furmanek, A.; Miroshnichenko, O. The Four Terminal Components of the tion of 3D structure in GlycoSiteAlign, we aim to produce Complement System AreC-Mannosylated on Multiple Tryptophan more-reliable alignments and to deepen the knowledge of the Residues. J. Biol. Chem. 1999, 274, 32786−32794. relationship between specific glycan structures and glycosyla- fi (10) Furmanek, A.; Hofsteenge, J. Protein C-mannosylation: facts tion site features. Such re nement could also be achieved by and questions. Acta Biochim. Pol. 1999, 47, 781−789. accounting for information on site heterogeneity, but quantitative (11) Croset, A.; Delafosse, L.; Gaudry, J.-P.; Arod, C.; Glez, L.; data remain limited in present glycoproteomics studies. Losberger, C.; Begue, D.; Krstanovic, A.; Robert, F.; Vilbois, F.; Finally, GlycoSiteAlign can serve MS data analysis. For Chevalet, L.; Antonsson, B. Differences in the glycosylation of example, the search by glycan composition and mass can be recombinant proteins expressed in HEK and CHO cells. J. Biotechnol. used to quickly investigate what kind of glycoprotein carries 2012, 161, 336−348. a certain glycan structure with experimental masses. The (12) Steentoft, C.; et al. Precision mapping of the human O-GalNAc search by mass can also be useful for assessing if a protein is glycoproteome through SimpleCell technology. EMBO J. 2013, 32, glycosylated in the case of a discrepancy between a predicted 1478−1488. and an experimentally determined mass. (13) Zhang, H.; et al. UniPep-a database for human N-linked In conclusion, GlycoSiteAlign shows potential as a tool for glycosites: a resource for biomarker discovery. Genome biology 2006, 7, assessing the constraints conditioning the relationship between R73. glycan structures and glycoprotein sites. Using the tool may (14) NetOGlyc 4.0 Server. http://www.cbs.dtu.dk/services/ help in the understanding of glycoproteins and glycan func- NetOGlyc/ (accessed Mar 22, 2016). (15) Hansen, J. E.; Lund, O.; Tolstrup, N.; Gooley, A. A.; Williams, tions and their physiological roles, pathological roles, or fi K. L.; Brunak, S. NetOglyc: prediction of mucin type O-glycosylation both, therefore, as a contribution to the eld of functional sites based on sequence context and surface accessibility. Glyco- glycoproteomics. conjugate J. 1998, 15, 115−130. (16) Gupta, R.; Brunak, S. Prediction of glycosylation across the ■ AUTHOR INFORMATION human proteome and the correlation to protein function. Pac. Symp. Corresponding Authors Biocomput. 2002, 310−322. (17) NetNGlyc 1.0 Server. http://www.cbs.dtu.dk/services/ *E-mail: [email protected]; phone: +41 (0)22 NetNGlyc/ (accessed Mar 22, 2016). 379 01 59. * (18) Julenius, K.; Mølgaard, A.; Gupta, R.; Brunak, S. Prediction, E-mail: [email protected]; phone: +41 (0)22 379 conservation analysis, and structural characterization of mammalian 58 98. mucin-type O-glycosylation sites. Glycobiology 2004, 15, 153−164. Notes (19) Chauhan, J. S.; Rao, A.; Raghava, G. P. In silico platform for The authors declare no competing financial interest. prediction of N-, O-and C-glycosites in eukaryotic protein sequences. PLoS One 2013, 8, e67008. ■ ACKNOWLEDGMENTS (20) Li, F.; Li, C.; Wang, M.; Webb, G. I.; Zhang, Y.; Whisstock, J. C.; Song, J. GlycoMine: a machine learning-based approach for We thank our UniCarbKB collaborators Nicki Packer for predicting N-, C-and O-linked glycosylation in the human proteome. fruitful discussions and Matthew Campbell for his help on using Bioinformatics 2015, 31, 1411. the data. This work is supported by the Swiss Federal (21) Campbell, M. P.; Peterson, R.; Mariethoz, J.; Gasteiger, E.; Government through the State Secretariat for Education, Akune, Y.; Aoki-Kinoshita, K. F.; Lisacek, F.; Packer, N. H. Research and Innovation (SERI). ExPASy is maintained by UniCarbKB: building a knowledge platform for glycoproteomics. the web team of the Swiss Institute of Bioinformatics and Nucleic Acids Res. 2014, 42, D215. hosted at the Vital-IT Competency Center. (22) Cooper, C. A.; Joshi, H. J.; Harrison, M. J.; Wilkins, M. R.; Packer, N. H. GlycoSuiteDB: a curated relational database of ■ REFERENCES glycoprotein glycan structures and their biological sources. 2003 update. Nucleic Acids Res. 2003, 31, 511−513. (1) Taylor, C. M.; Karunaratne, C. V.; Xie, N. Glycosides of (23) Go, E. P.; Irungu, J.; Zhang, Y.; Dalpathado, D. S.; Liao, H.-X.; hydroxyproline: some recent, unusual discoveries. Glycobiology 2012, Sutherland, L. L.; Alam, S. M.; Haynes, B. F.; Desaire, H. Glycosylation 22, 757−767. Site-Specific Analysis of HIV Envelope Proteins (JR-FL and CON-S) (2) Trinidad, J. C.; Schoepfer, R.; Burlingame, A. L.; Medzihradszky, Reveals Major Differences in Glycosylation Site Occupancy, Glyco- K. F. N-and O-glycosylation in the murine synaptosome. Mol. Cell. − form Profiles, and Antigenic Epitopes Accessibility. J. Proteome Res. Proteomics 2013, 12, 3474 3488. − (3) Wilson, I.; Gavel, Y.; Von Heijne, G. Amino acid distributions 2008, 7, 1660 1674. around O-linked glycosylation sites. Biochem. J. 1991, 275, 529−534. (24) Harazono, A.; Kawasaki, N.; Itoh, S.; Hashii, N.; Ishii-Watabe, (4) Hema Thanka Christlet, T.; Veluraja, K. Database analysis of A.; Kawanishi, T.; Hayakawa, T. Site-specific N-glycosylation analysis O-glycosylation sites in proteins. Biophys. J. 2001, 80, 952−960. of human plasma ceruloplasmin using liquid chromatography with (5) O’Connell, B.; Tabak, L. A.; Ramasubbu, N. The influence of electrospray ionization tandem mass spectrometry. Anal. Biochem. flanking sequences on O-glycosylation. Biochem. Biophys. Res. Commun. 2006, 348, 259−268. 1991, 180, 1024−1030. (25) Thaysen-Andersen, M.; Packer, N. H. Site-specific glycoproteo- (6) Nehrke, K.; Ten Hagen, K. G.; Hagen, F. K.; Tabak, L. A. Charge mics confirms that protein structure dictates formation of N-glycan distribution of flanking amino acids inhibits O-glycosylation of several type, core fucosylation and branching. Glycobiology 2012, 22, 1440− single-site acceptors in vivo. Glycobiology 1997, 7, 1053−1060. 1452.
3927 DOI: 10.1021/acs.jproteome.6b00481 J. Proteome Res. 2016, 15, 3916−3928 Journal of Proteome Research Technical Note
(26) Li, W.; Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22, 1658−1659. (27) EBI. EMBL-EBI Web Services. http://www.ebi.ac.uk/Tools/ webservices/services/msa/clustalo_rest (accessed Mar 21, 2016). (28) Sievers, F.; Wilm, A.; Dineen, D.; Gibson, T. J.; Karplus, K.; Li, W.; Lopez, R.; McWilliam, H.; Remmert, M.; Söding, J.; Thompson, J. D.; Higgins, D. G. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 2011, 7.53910.1038/msb.2011.75 (29) Papadopoulos, J. S.; Agarwala, R. COBALT: constraint-based alignment tool for multiple protein sequences. Bioinformatics 2007, 23, 1073−1079. (30) Waterhouse, A. M.; Procter, J. B.; Martin, D. M.; Clamp, M.; Barton, G. J. Jalview Version 2a multiple sequence alignment editor and analysis workbench. Bioinformatics 2009, 25, 1189−1191. (31) Jalview. Jalview Clustal X Colour Scheme. http://www.jalview. org/help/html/colourSchemes/clustal.html (accessed Mar 3, 2016). (32) Ronacher, A. Flask: web development, one drop at a time. http://flask.pocoo.org/ (accessed Mar 3, 2016). (33) Davis, C. G. The many faces of epidermal growth factor repeats. New Biol. 1990, 2, 410−419. (34) Lawler, J.; Hynes, R. O. The structure of human thrombospondin, an adhesive glycoprotein with multiple calcium- binding sites and homologies with several different proteins. J. Cell Biol. 1986, 103, 1635−1648. (35) Shao, L.; Luo, Y.; Moloney, D. J.; Haltiwanger, R. S. O- glycosylation of EGF repeats: identification and initial characterization of a UDP-glucose: protein O-glucosyltransferase. Glycobiology 2002, 12, 763−770. (36) Zielinska, D. F.; Gnad, F.; Wisniewski,́ J. R.; Mann, M. Precision mapping of an in vivo N-glycoproteome reveals rigid topological and sequence constraints. Cell 2010, 141, 897−907. (37) Lam, P. V. N.; Goldman, R.; Karagiannis, K.; Narsule, T.; Simonyan, V.; Soika, V.; Mazumder, R. Structure-based comparative analysis and prediction of N-linked glycosylation sites in evolutionarily distant eukaryotes. Genomics, Proteomics Bioinf. 2013, 11,96−104.
3928 DOI: 10.1021/acs.jproteome.6b00481 J. Proteome Res. 2016, 15, 3916−3928 3.2. Concluding remarks 93
3.2 Concluding remarks
Integrative tools are essential to bridge two aspects of glycomics. In this chapter, we have introduced GlyS3 and EpitopeXtractor which are connecting glycan structures with ligands recognised by GBPs. Using GlyS3 we can define which glycan struc- tures can interact with a specific GBP. On the contrary, EpitopeXtractor is capable of listing all the GBPs that could potentially interact with a specific glycan.
95
Chapter 4
Explorative tools
4.1 Overview
In this chapter, we introduce all explorative tools which have been developed dur- ing the thesis. These software applications help scientists in their daily data analysis bringing out new hypotheses and highlighting novel results. In particular, we focus on PepSweetener, Glynsight and Glydin’. The first helps researchers with the man- ual annotation of glycopeptides and has been developed with Marcin Domagalski in collaboration with Daniel Kolarich’s group of glycomics. My role was to design the data storage for the peptide search space and significantly contribute in the imple- mentation of the web interface. The second, Glynsight, proposes a standard way to visualise and compare glycan profiles. It is associated with GlyConnect, which will be presented in the next chapter, to suggest putative structures which match specific glycan compositions. It is also bridged with Glydin’ via EpitopeXtractor, which has been introduced in the previous chapter. The third and last tool, Glydin’, shows a curated list of 600 glycan epitopes in a network representation. This software ap- plication has been developed by a master student, Marie Ghraichy, and enhanced during this PhD. Each ligand is connected to other once based on the concept of sub- structures. To build the network we have used GlyS3 which has been introduced in the integrative tool chapter.
To give an in-depth description of these three tools, this chapter contains two pa- pers: "Understanding the glycome: an interactive view of glycosylation from glyco- compositions to glycoepitopes" and "PepSweetener: A Web-Based Tool to Support 96 Chapter 4. Explorative tools
Manual Annotation of Intact Glycopeptide MS Spectra". The first details Glynsight and Glydin’ showing how to move from one to the other using EpitopeXtractor. It also suggests the importance of this collection of tools by performing the analysis of a dataset published by (Anugraham et al., 2014). The second describes the imple- mentation of PepSweetener and proposes a collection of use cases. Glycobiology, 2018, vol. 28, no. 6, 349–362 doi: 10.1093/glycob/cwy019 Advance Access Publication Date: 6 March 2018 Glyco-Informatics
Glyco-Informatics Understanding the glycome: an interactive view of glycosylation from glycocompositions to glycoepitopes Davide Alocci2,3, Marie Ghraichy2,4, Elena Barletta2,3, Alessandra Gastaldello2,3, Julien Mariethoz2,3, and Frederique Lisacek2,3,5,1
2Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, 7 Route de Drize, 1227 Geneva, Switzerland, 3Computer Science Department CUI, University of Geneva, 7 Route de Drize, 1227 Geneva, Switzerland, 4Division of Immunology, University Children’s Hospital Zurich, Zurich, Switzerland, and 5Section of Biology, University of Geneva, Geneva, Switzerland
1To whom correspondence should be addressed: Tel: +41-22-379-0195, Fax: +41-22-379-0250; e-mail: [email protected]
Received 25 January 2018; Revised 16 February 2018; Editorial decision 28 February 2018; Accepted 2 March 2018
Abstract Nowadays, due to the advance of experimental techniques in glycomics, large collections of gly- can profiles are regularly published. The rapid growth of available glycan data accentuates the lack of innovative tools for visualizing and exploring large amount of information. Scientists resort to using general-purpose spreadsheet applications to create ad hoc data visualization. Thus, results end up being encoded in publication images and text, while valuable curated data is stored in files as supplementary information. To tackle this problem, we have built an interactive pipeline com- posed with three tools: Glynsight, EpitopeXtractor and Glydin’. Glycan profile data can be imported in Glynsight, which generates a custom interactive glycan profile. Several profiles can be compared and glycan composition is integrated with structural data stored in databases. Glycan structures of interest can then be sent to EpitopeXtractor to perform a glycoepitope extraction. EpitopeXtractor results can be superimposed on the Glydin’ glycoepitope network. The network visualization allows fast detection of clusters of glycoepitopes and discovery of potential new targets. Each of these tools is standalone or can be used in conjunction with the others, depending on the data and the specific interest of the user. All the tools composing this pipeline are part of the Glycomics@ExPASy initia- tive and are available at https://www.expasy.org/glycomics.
Key words: glycan profile, glycoepitope/web tool, glycoinformatics, visualization
Introduction et al. 2011). In fact, glycomics is starting to play an important role In the last decade, high-throughput experimental techniques for gly- in biomarker discovery (Adamczyk et al. 2012; Ruhaak et al. 2013; can identification and quantification have been improved (Ruhaak Taniguchi and Kizuka 2015) especially related to cancer. Aberrant et al. 2009; Ruhaak et al. 2010; Kemna et al. 2017; Planinc et al. glycosylation patterns have been observed in several types of cancer 2017; Adamczyk et al. 2018). Furthermore, several studies have cells (Jin et al. 2017), in diseases like diabetes (Keser et al. 2017)or been published with large collections of samples generating consid- in processes like inflammation (Schnaar 2016). Changes in glycan erable amount of glycome profiling data (Knezevic´ et al. 2009; Pucic´ structural properties take place not only in reaction to a pathological
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected] 349
Downloaded from https://academic.oup.com/glycob/article-abstract/28/6/349/4923253 by guest on 29 July 2018 350 D Alocci et al.
state but also according to aging, body index mass and other fac- collection of curated glycoepitopes. EpitopeXtractor analysis results can tors like smoking (Knezevic et al. 2010; Krištic´ et al. 2014; Gudelj be sent to Glydin’ to be mapped on the network. Glynsight, et al. 2015). Each of these high-throughput studies has generated a EpitopeXtractor and Glydin’ are web-based tools developed by the large quantity of the so-called glycan profiles definedbyalistof Proteome Informatics Group at SIB Swiss Institute of Bioinformatics. The glycan compositions associated with their quantification in a spe- first two have been developed using web components in synergy with D3. cific sample. js where Glydin’ is build with only HTML, CSS and D3.js. All three tools Although glycoinformaticians have created software solutions to are part of the collection hosted by Glycomics@Expasy (www.expasy. support some aspects of glycan identification and quantification org/glycomics) (manuscript in preparation). (Ceroni et al. 2008; Jansen et al. 2015, 2016), most of the analyses We validated this approach while processing data found in sup- are still carried out manually. This is both time-consuming and labor- plementary material of Anugraham et al. (2014). In this study, the intensive. The lack of tools for comparing and visualizing glycan pro- glycan profiles of healthy vs. ovarian cancer cells were generated, files confines scientists to using general purpose spreadsheet applica- analyzed and compared. We show that most results presented in the tions to create ad hoc visualizations. Thus, results end up being article could be confirmed by viewing and inspecting the profiles dis- encoded in publication images and text, while valuable curated data played in Glynsight and suggest the use of this tool as a substantial are stored in tables and published as Supplementary material. In this time saver in extracting relevant information. We also illustrate how scenario, GlycoViewer (Joshi et al. 2010) was the first most innovative our pipeline leads to narrow down explanations on the role of dif- and pioneering visualization to summarize and compare sets of glycan ferentially expressed glycans in the comparative mode. Finally and structures. Although it did not account for the quantification data at in the same pipeline, we match the results regarding glycan determi- the time, this software could summarize up to hundreds of structures nants and provide further contextual information for characterizing in a single figure allowing the generation of high-level views of glycans potential ligands. from one protein, or a cell, or a tissue or a whole organism. Then, GlycomeAtlas (Konishi and Aoki-Kinoshita 2012) can be considered as the first successful visualization tool for quantitative glycomic data. Results fi This software has been speci cally designed for exploring glycan pro- We used the published dataset (Anugraham et al. 2014)toillustrate fi les produced by Consortium for Functional Glycomics (CFG) (Raman and validate the use of the pipeline. This study on epithelial ovarian et al. 2005, 2006; Ismail et al. 2011; North et al. 2012). Although cancer (EOC) compared the membrane glycosylation of two ovarian GlycomeAtlas can be used to easily navigate in CFG data, it cannot be noncancer epithelial with four cancerous cell lines. Data were extrapo- extended nor run with other datasets. GlycoPattern (Agravat et al. lated from a study published by the American Society for Biochemistry 2014), together with GlycanMotifMiner (Cholleti et al. 2012), represent and Molecular Biology, which highlights the potential for specific another effort to visualize and mine CFG data. The implementation structural and isomeric changes in N-glycans in detecting and poten- of dendrograms and heatmaps supports the detection of similar tially treating ovarian malignant tumors. We kept the headings of the motifs in glycan binding proteins (GBP) and the comparison of sev- paper to emphasize the similarity between the original results and those eral glycan array experiments. Both software have been designed and output by our pipeline. developed within CFG to define a common way of storing, visualiz- ing and analyzing data for the consortium. To tackle the scarcity of visualization tools that do not depend on Relative quantitation of N-glycans of membrane the origin of the dataset, and to speed up the analysis of glycan pro- proteins from noncancer and cancer cell lines files and link the latter to functional information, we introduce a ser- The profiles created in Glynsight show the glycan compositions in nine ies of three tools that can be run independently as well as chained to groups according to the incrementally ordered content in monosacchar- one another to promptly visualize and link information at different ides (see Material and methods). In the individual mode, each experi- levels of granularity and highlight functional properties of glycans. ment was displayed and every profile showed that glycans whose Glynsight is the first tool of the pipeline. It has been designed to visu- compositions contain 9–12 monosaccharides are more highly expressed alize and compare glycan profiles and it processes glycan composi- than the others (Figure 1A). In the differential mode, we systematically tions. Glynsight is using an innovative visualization to bring out compared the cancer and noncancer cell lines. The greatest quantitative hidden patterns inside or among glycan profiles. This web-based tool difference in glycan compositions is found between HOSE 17.1 (non- is provided with an upload function for spreadsheets and allows the cancer) and SKOV3 (cancer) (Figure 1A). In qualitative terms the indi- user to export each profile as an image or the whole collection as a vidual compositions of IGROV1 are the most distinct compared to the JSON file. Glynsight is also connected to GlyConnect (https://glyconnect. others. This is potentially reflecting its high carcinogenic properties pre- expasy.org, manuscript in preparation), our glycomics dashboard, to viously found in nude mice (Bénard et al. 1985; Kurbacher et al. 2011; suggest potential glycan structures related to specific compositions. Buehler et al. 2013)(Figure1B). Note that 33 out of 55 individual gly- Secondly, EpitopeXtractor is a unique tool for extracting glycan determi- can compositions are present in all cell lines. Of the remaining 20, 17 nants from a list of glycan structures. It is based on our substructure were found exclusively in cancer cell lines (red—in IGROV1 are all search engine, GlyS3 (Alocci et al. 2015),andacuratedcollectionofgly- expressed), and three are found only in both noncancer cell lines can determinants (Kawasaki, Nakao, and Tominaga 2008; Cummings (H4N4F2, H5N4F2 and H5N4F3—blue). 2009; Pérez et al. 2015; Mariethoz et al. 2016). Potential glycan struc- In order to determine the possible quantitative differences between tures highlighted in Glynsight can be forwarded to EpitopeXtractor to the glycans, in the membrane proteins of the cancer and noncancer cell extract glycan determinants. The last tool of the pipeline is Glydin’, a gly- lines, Anugraham and coauthors statistically analyzed the N-glycans coepitope network viewer created for exploring and visualizing the rela- common to all the cell lines and divided them into four classes: high tionship between glycan determinants based on shared monosaccharide mannose, hybrid, complex (neutral and sialylated) and core fucosy- composition or the glycosyltransferase needed to get one from the other. lated (Anugraham et al. 2014). Such division is not necessary using Glydin’, like EpitopeXtractor, is based on GlyS3 and relies on the same Glynsight since the differential display immediately shows that the
Downloaded from https://academic.oup.com/glycob/article-abstract/28/6/349/4923253 by guest on 29 July 2018 Interactive view of glycosylation 351
Fig. 1. Differential display of glycan compositions between HOSE 17.1 (noncancer cell line) and (A) SKOV3 (cancer cell line) / (B) IGROV1 (cancer cell line). Compositions are displayed in columns and in each column the overall number of monosaccharides is constant. The height of a bar represents the difference between the expression of a specific composition in HOSE 17.1 with that of the same composition in (A) SKOV3 and (B) IGROV1. Brighter blue (respectively, red) bars highlight differences arising from compositions unique to HOSE 17.1 (respectively, (A) SKOV3 and (B) IGROV1) while pale blue (respectively, red) bars show differences arising from compositions expressed in all experiments but higher (respectively, lower) in HOSE 17.1 than in (A) SKOV3 and (B) IGROV1. For example, H5N4F2 is over-expressed only in HOSE 17.1 (and 0 in (A) SKOV3 and (B) IGROV1) while H5N4F1 is over-expressed in HOSE 17.1 in contrast with its nonzero but lesser expression in (A) SKOV3 and (B) IGROV1.
compositions with a high mannose content (compositions with only monomers. To begin with, the relative intensities of fucosylated compo- Hex (H) and HexNAc (N), where Hex > 4andHexNAc= 2) appear sitions, as reported in (Anugraham et al. 2014), were multiplied by their highly expressed in cancer cells lines compared to noncancer ones; the respective quantity of Fuc. At this point, in order to obtain the relative compositions with more than four HexNAc (N-acetylglucosamine and total intensity for each cell line, the intensities previously calculated on N-acetylgalactosamine) are less represented in the nontumor cell lines each composition were summed, and these sums were divided by the than the tumor. total amount of monosaccharides present in all fucosylated composi- The compositions involving fucosylation and sialylation were ana- tions. Six values were obtained, one for each cell line (Table SI). Then, lyzed according to the method introduced by Hayes and coauthors for the same procedure was applied to the sialylated compositions the conversion of raw data (Hayes et al. 2012). For each composition (Table SII). In the end, the expression of fucosylated compositions we therefore considered not only the type of monosaccharide but also shows more variation in the cancer cell lines, mostly in IGROV1, but is the corresponding number of Fuc and N-acetylneuraminic acid (NeuAc) higher in the two noncancer cell lines. In contrast, the expression of
Downloaded from https://academic.oup.com/glycob/article-abstract/28/6/349/4923253 by guest on 29 July 2018 352 D Alocci et al.
sialylated compositions was not distinguishable but rather equally dis- H6N5F1 is more likely to stem from H6N5, identified as nonbisected tributed in either cancer/noncancer cases. and represented in pale blue, than from H5N5F1. As previously sta- ted fucosylation, appears to be a prerogative of noncancer cells. In contrast, H5N5F1 is more likely to stem from the possible bisection Bisecting GlcNAc of H5N4F1, than from the fucosylation of H5N5, through the add- Bisecting GlcNAc is added to a glycan structure through a reaction ition of an N-acetylhexosamine (Figure 2), which is line with (Miwa catalyzed by a specific enzyme encoded by the MGAT3 gene. As et al. 2012), who observed that “the presence of a bisecting GlcNAc demonstrated by qRT-PCR analysis, the lower transcript abundance on a biantennary N-glycan terminating in GlcNAc prevents the subse- of MGAT3 observed for mouse and human normal ovary contrasts quent action of the core fucosyltransferase FUT8”. with the large increase in MGAT3 transcripts observed in all cases We proceeded with the analysis of other compositions. Overall, of epithelial endometrioid carcinoma analyzed in both mouse and only nine were identified in the database as bisected. These composi- human (Abbott et al. 2008). This means that this specific glycan syn- tions were only expressed in the cancer cell lines except for H3N5F1 thesis reaction will be under-expressed in the noncancer cell lines that is weakly expressed in noncancer cell lines. The other 16 com- compared to EOC cell lines. Overall, 25 compositions (HexNAc > positions match nonbisecting structures and are mostly expressed in 4), with a hypothetical bisecting GlcNAc were considered in our cancer cell lines. Individual analysis of glycan compositions that are analysis (File S3A). Most of the glycan compositions with an odd differentially expressed in cancer cell lines confirmed the association number of HexNAc appeared, as expected, in the EOC cell lines. of nonbisecting GlcNAc with the disease. In particular, the most For example, H5N5F1 present only in the cancer lines, reaches a over-expressed compositions in cancer are exclusive to IGROV1 very high expression in OVCAR3 (5.93) and H5N5F2 is exclusively and SKOV3 cell lines. present in IGROV1. As a consequence, almost all the compositions containing five HexNAc are either more abundant or exclusively represented in cancer cells (Figure 2). In contrast, H6N5F1 showed LacdiNAc-type N-glycans (glycoepitopes) a differential range value of 2.26 to −2.52 towards the nontumor To evaluate our full pipeline, we considered the profiles of HOSE HOSE 17.1 cell line when compared with the cancer ones (the range 17.1 and SKOV3, since they show the most distinctive quantitative for HOSE 6.3 was still significant, greater than 1). trends. Starting from Glynsight in differential mode (Figure 1A), the Such differential expression led us to consider the possible glycan differentially expressed compositions in each profile were divided in structures known to match these specific compositions (H6N5F1). two groups: cancer (bright and pale red bars) and noncancer (bright This information is recorded in the GlyConnect database (https:// and pale blue bars). Using the connection with Glyconnect, we glyconnect.expasy.org) (manuscript in preparation). None of the enriched the information from compositions to glycan structures, reported 24 structures, searched in GlyConnect in “ALL” tissues, dis- selecting all the N-glycans that are expressed in Homo sapiens and played a bisecting GlcNAC, which is rarely found in EOC cells. The match a selected composition in one of the two groups. No restric- same appears when “Urogenital System” is selected as specific tissue, tion was imposed on tissue. In the end, the cancer-related group con- narrowing down the structures to eleven (File S3B). Consequently, tains 163 structures whereas the noncancer group contains 245.
Fig. 2. Influence of HexNAc number variation shown in the differential display between HOSE 17.1 (noncancer cell line) and SKOV3 (cancer cell line). The differ- ential display shown in Figure 1 is reproduced here and complemented with orange (chosen for the readability of the figure but modifiable as explained in Material and methods) connectors between different bars to single out the addition of HexNAc (N). User-selected brighter connections emphasize the addition of a HexNAc to compositions with four HexNAc (N4) and highlight the potential for a bisecting GlcNAc. A change in the bar color from pale blue (over-expressed in noncancer cell lines) to bright red (only expressed in cancer cell lines) can also be observed.
Downloaded from https://academic.oup.com/glycob/article-abstract/28/6/349/4923253 by guest on 29 July 2018 Interactive view of glycosylation 353
Fig. 3. Mapping of 19 cancer-related glycoepitopes on the Glydin’ network. Glycoepitopes related to cancer composition (bright and pale red bars) from the dif- ferential display of SKOV3 and HOSE 17.1 (Figure 1A), have been extracted using EpitopeXtractor. The 19 glycoepitopes unique to cancer are mapped on the Glydin’ network and appear as red nodes. They tend to cluster together. The three most populated clusters are shown: (A) this cluster contains Sialyl Tn Antigen and Sialyl LacdNAc, (B) Blood group A like cluster and (C) High-mannose cluster.
Subsequently, the glycan determinants present in the two groups for the cancer group and to {H4N4F2, H5N4F2, H5N4F3} for the of glycans were submitted to EpitopeXtractor. The cancer structures noncancer group. Following the same steps as described above led to matched 80 glycoepitopes and the noncancer, 82. These data can be single out the same glycoepitope subnetworks. However, comparing directly sent to Glydin’, the Glycoepitope network. Due to the high Figure 3AwithFigure6 shows that LacdiNAc (LDN in the figure) is amount of glycoepitopes extracted, an accurate visual comparison is only present in the second analysis (Figure 6) whereas both contain a challenge, both in EpitopeXtractor and in Glydin’. To ease this the sialylated LacdiNAc. comparison, only the unique glycoepitopes distinctly identified in The presence of LacdiNAc in cancer-related glycans is brought out each group were considered, in an attempt to detect exclusive char- in these results and matches the observations made by Anugraham and acteristics. This led to keep 19 and 20 glycan determinants, respect- coauthors from the structures identified by mass spectrometry. ively, in the cancer and noncancer sets. However, our pipeline also selects blood group antigens and Lewis The 19 glycan determinants only found in the cancer set, are singled related glycoepitopes as potential distinctive properties of respectively out as red nodes in the Glydin’ network. Inspecting the subnetworks cancer and noncancer-related profiles. We noted that other studies on surrounding these nodes of interest, three main groups of glycan deter- ovarian cancer (Christiansen et al. 2014) have reported the presence of minants emerged: (i) blood group A related determinants, (ii) Sialyl Tn Sialyl Tn antigen in association with cancer cells. This determinant is antigen with Sialylated LacdiNAc and (iii) mannose related glycoepi- also distinguished as related to cancer in our analysis. topes (Figure 3). Similarly in the noncancer case, Lewis related glycoepi- All the steps of the glycoepitope extraction analysis, including topes, in particular X, C and A-like, blood group B and H antigen the IDs of GlyConnect structure and Glydin’ glycoepitope, are avail- appear as distinctive traits of noncancer glycan structures (Figure 4). able in the Supplementary data (File S3D). This file also includes the The above shows results obtained from compositions present in links for exploring the cancer and noncancer results in Glydin’. either of the two groups defined from Glynsight as cancer (red bars) and noncancer (blue bars). A more focused analysis of HOSE 17.1 and SKOV3 glyco-profiles was performed using only the compositions Gene expression of specific glycosyltransferases in exclusively expressed either in cancer or noncancer cell lines, i.e. those ovarian cancer cell lines displayed in bright red and bright blue bars in Glynsight. As shown in Anugraham and coauthors have also reported on differential enzym- Figure 1A, this limits the compositions to {H4N5, H5N5, H4N5F1, atic regulation of the glycosylation pathway in each cancer and non- H3N6F1, H5N5F1, H4N5F1S1, H4N5F1S2, H5N5F1S1, H6N6F1} cancer cell lines, via transcriptomics experiments. As mentioned in
Downloaded from https://academic.oup.com/glycob/article-abstract/28/6/349/4923253 by guest on 29 July 2018 354 D Alocci et al.
Fig. 4. Mapping of 20 noncancer-related glycoepitopes on the Glydin’ network. Glycoepitopes were generated from noncancer-related glycan compositions shown as bright and pale blue bars in the Glynsight differential display of SKOV3 and HOSE 17.1 (Figure 1A) and have been extracted using EpitopeXtractor. The 20 glycoepitopes unique to noncancer are mapped on the Glydin’ network and appear as red nodes. They tend to cluster together. The four most populated clusters are shown: (A) and (C) Lewis X/A clusters (B) Blood group H cluster and (D) Blood group B cluster.
Discussion, glycan compositions were searched in GlyConnect in also found, representing sialyl Le(x) elements in addition to conven- order to match all human N-linked structures stored in the database, tional N-glycans established for human serum transferrin (hST) (van irrespective of the tissue information. In this section, we filtered the Rooijen et al. 1998), and also in a tri-sialyl oligosaccharide isolated human N-linked structure set with the “Urogenital System” require- from human serum (Nakagawa et al. 1995). ment. This smaller selection in Glynsight individual profiles of The (α1–6) fucosylation linkage is related to the FUT8 gene SKOV3 and OVCAR3 is exclusively composed of structures with expression. The (α−1,6)-fucosylated structures are qualitatively bisecting GlcNAc. associated with the highly tumorigenic cancer line IGROV1 and to This observation matches the statistically significant differential cancer in general, while in noncancer essentially two compositions expression of the MGAT3 gene between the cancer and noncancer (H5N4F1 and H5N4F1S1) are the ones that dimmed and shifted cell lines reported by Anugraham and coauthors. Furthermore, we our conclusions towards the noncancer cell lines because of their have observed that all fucosylated glycan compositions are highly extremely high values. expressed in noncancer cell lines and that the core and non-core Bi-fucosylated and tri-fucosylated compositions, i.e. H4N4F2, fucosylated glycans could be distinguished. H5N4F2 and H5N4F3 (possibly derived from H4N4F1 and H5N4F1) We examined the fucosylation linkage while dividing the compo- were found only in noncancer cell lines while H4N5F2S1, H3N6F3, sitions according to the fucose (Fuc) residue number (F1, F2 and F3). H5N5F2, H4N5F3, H3N6F2, H4N5F2 and H3N5F2 were shown The mono-fucosylated compositions were searched in GlyConnect uniquely in IGROV1. This supports either GlcNAc bisection or FUT8 (N-linked, Homo sapiens and Urogenital System) and out of all the gene expression in cancer cells. Considering the possible non-core fuco- reported structures showing core fucosylation (140/151 structures— sylation linkages for F2 and F3, (α1–2) and (α1–3) were the other two 92,7%), a majority displayed a higher level of (α1–6) fucosylation options in structures matched in the GlyConnect database. It is not (File S3C). Six mono-fucosylated compositions (H6N5F1, H6N6F1, possible to determine which fucosyltransferase gene is related to these H5N4F1S1, H5N5F1S1, H5N4F1S2 and H6N5F1S3), matched struc- compositions, but looking at the quantitative RT-PCR results of the tures with the non-core (α1–2) and the (α1–3) bond appeared to be on mRNA transcripts (Anugraham et al. 2014) we can suggest a correl- display (File S3C). Clicking on those structures redirects to entries in ation with the higher expression of the FUT2 enzyme and possibly link our database where the reference articles pointed at evidence of pos- the high expression of FUT4 to (α−1,3)-fucosylation (Figure 5). sible (α1–2)-linked Fuc residue while not excluding core-fucosylated Anugraham and coauthors reported differences in the sialylation analogues (Stroop et al. 2000). The (α1–3)-fucosylated N-glycans were gene expression. Despite our ability to spot sialylated compositions,
Downloaded from https://academic.oup.com/glycob/article-abstract/28/6/349/4923253 by guest on 29 July 2018 Interactive view of glycosylation 355
Fig. 5. Influence of fucosylation shown in the differential display between HOSE 17.1 (noncancer cell line) and (A) SKOV3 (cancer cell line) and (B) IGROV1 (can- cer cell line). The differential display shown in Figure 1 is reproduced here and complemented with red connectors between bars to single out the addition of a fucose (fucosylation). The brightness of the connectors is user-defined. Connectors are faded by default and brightened upon a click. Fucosylated compositions are mostly over-expressed in noncancer cell lines. Connectors from H5N4 to H5N4F3 via H5N4F1 and H5N4F2 were brightened to emphasize compositions unique to noncancer HOSE 17.1 (bright blue) with respect to both (A) SKOV3 and (B) IGROV1. However, HOSE 17.1 vs. (A) SKOV3 and (B) IGROV1 shows a high- er density of red connectors in (B). The expression of more fucosylated glycans is possibly correlated with the higher expression of fucosyltransferases observed in IGROV1 cell line by (Anugraham et al., 2014).
exclusively in the cancer cell lines (H4N5F1S1, H4N5F2S1, H5N5F1S1, dataset on membrane glycosylation in EOC. Data were extracted H4N5F1S2) we could not relate this information with the expression of from spreadsheets in Supplementary material and imported into ST3GAL or ST6GAL genes since both (α1–3) and (α2–6) sialylation Glynsight, the entry point of our pipeline. cannot be distinguished in Glynsight. Glynsight performed a fast and in-depth comparison of the pro- files relative to six cell lines reported in the dataset. SKOV3 and IGROV1 appeared to be the most differentiated cell lines, IGROV1 fi fi Discussion being the only pro le where all the glycan compositions were signi - cantly expressed. This might lead to a subsequent and more careful Validation of experimental results analysis of this specific cancer cell line. In an effort to illustrate and validate the use of Glynsight, Three fucosylated compositions H4N4F2, H5N4F2 and H5N4F3, EpitopeXtractor and Glydin’, we have analyzed a recently published not previously discussed in the examined article (Anugraham et al.
Downloaded from https://academic.oup.com/glycob/article-abstract/28/6/349/4923253 by guest on 29 July 2018 356 D Alocci et al.
Fig. 6. LacdiNAc subnetwork in Glydin’. Glycan determinant cluster that emerges when mapping glycoepitopes from compositions exclusively expressed in can- cer (bright red bars) related to the differential display of SKOV3 and HOSE 17.1 (Figure 1A). Glycoepitopes shown in red are the ones extracted by EpitopeXtractor. The presence of LacdiNAc (LDN) confirms the observation made in (Anugraham et al. 2014). Other red nodes can be seen as possible targets for further experimental studies.
2014), were identified as discriminant. These compositions can only be incompleteness of Glyconnect, but it enables functional interpretation characteristically found in noncancer cells (Figure 5), in both HOSE through running EpitopeXtractor and executing in silico glycoepitope cell lines. Furthermore, we confirmed a higher mannose content in can- extraction. By simulating the identification of potential ligands, cer cells lines compared to noncancer as well as a GlcNAc bisection, EpitopeXtractor combined with Glydin’ helps estimating the binding contrary to fucosylated glycan compositions that were found to be characteristics of highly expressed compositions. This saves time and expressed in noncancer cell lines. The GlcNAc bisection has been con- work required to map glycoepitopes. To our knowledge, this in silico firmed to be most likely related to the expression of MGAT3 gene, glycoepitope extraction and mapping from differentially expressed gly- which encodes the enzyme adding the monosaccharide in a beta 1–4 can compositions is new in glycoinformatics. With the data tested, linkage to the core of the N-glycans (Abbott et al. 2008; Anugraham EpitopeXtractor marked the evidence of Lewis x/a as unique glycoepi- et al. 2014), contrasting with fucosylated structures that were discov- topes, underlining a specific characteristic of noncancer glycan compo- ered to be qualitatively associated to cancer. Whereas increased sitions. However, this observation does not reflect a reliable correlation MGAT3 transcripts were identified in the EOC through RT-PCR between the presence of Lewis epitopes and noncancer cell lines. It experiments (Anugraham et al. 2014), as supported by our Glynsight partly contradicts a previous study (Escrevente et al. 2006), where analysis, fucosylation and sialylation genes could not be described. As m130 and GG ovarian carcinoma cell lines were shown to express Glynsight links each composition to a collection of putative glycan high levels of Lewis x, sialyl Lewis a and Lewis b that have been asso- structures via GlyConnect the structural variety of these structures ciated with major cancer activities. Further experimental work would makes it difficult to map unique enzyme information and correlate be needed to determine the specificity of Lewis x glycoepitopes. with gene expression data. We could however hypothesize on the Nonetheless, EpitopeXtractor correctly reported the presence of FUT8 gene associated with the (α1–6) fucosylation linkage. Structures LacdiNAc in cancer-related structures, which was proven by Anugraham containing Fuc(α1–6) proved to be equally distributed in all highly and coauthors using mass spectrometry. In addition, Sialyl Tn Antigen expressed compositions. emerged as a candidate to distinguish cancer from noncancer cell lines. Each structure is associated with tissue expression information that These epitopes have been already described in relation to ovarian can- cannot yet be confidently leveraged due to the scarcity of structure-to- cer in (Christiansen et al. 2014). tissue correlations. The precision of the proposed approach will increase with regular updates of knowledge in GlyConnect via new Potential for further applications annotated glycan structure import from public results. At the time of writing, GlyConnect contains 3404 glycan structures of which 1557 With this approach, glycoepitope analysis shows the potential for areexpressedinhuman. bringing out new hypotheses and helping scientists narrow down the range of glycan determinants of interest. Note that we currently mainly focus on mammalian N- or O-linked glycans. Novelty brought by the pipeline Our pipeline, in particular Glynsight, can be tested as a discovery Extracting and displaying glycan structures highlights trends in the pos- tool in the research field as well as quality control tool for pharmaceut- sible monosaccharide arrangements of each composition. Such a collec- ical companies. Batch to batch glycosylation analysis has become more tion of putative structures may not be comprehensive due to the and more important to assess the quality of therapeutic monoclonal
Downloaded from https://academic.oup.com/glycob/article-abstract/28/6/349/4923253 by guest on 29 July 2018 Interactive view of glycosylation 357
antibodies (mAbs) (Planinc et al. 2017). Glynsight proposes an innova- and consequently their quantity was added together to obtain a single tive way of displaying glycan profile data, which can be used as a sig- value. The Glynsight session file is available in Supplementary data nature for each batch. Moreover, the differential display can easily (File S1) and can be used to reproduce the results. catch discrepancies among batch glycan profiles leading to accurate and precise quality assurance. For confidential data, a desktop version Tools of Glynsight is available upon request. Glycan profile interactive visualization can potentially be used in The pipeline which allows the user to explore data from composi- a wide range of studies, from glycoproteomics (Stadlmann et al. tions to glycoepitopes is split in three different web tools: Glynsight, 2017) to therapeutic mAbs glycan analysis (Cymer et al. 2017), to EpitopeXtractor and Glydin. Each tool was designed to carry out a fi compare as well as publish results avoiding the usage of endless speci c task and can be used either alone or in conjunction with the spreadsheets and tables. others. To conclude, Glynsight is tackling several issues, which arise in The next sections give an in depth overview of each tool, show- the analysis of glycan profiles. In addition, our strategy for perform- ing the main features and how to use them. ing fast glycan determinant extraction, using EpitopeXtractor and ’ mapping with Glydin suggests potential new targets for glycan bio- Glynsight marker discovery. Quantitative mass-profiling experiments allow the monosaccharide compositions of distinct sets of isobaric glycans (i.e. structures hav- ing the same mass) to be identified and the abundance of each iso- Future plans baric set to be estimated, thereby producing a glycome in the form In the future we plan to expand the GlyConnect knowledge base as of a glycan composition-based profile. Usually, a profile consists in a well as to either enlarge or diversify pipelines adding new tools and list of independent glycan compositions associated with correspond- new strategies. In the course of 2018, a centralized glycan epitope ing intensities. These profiles are the input of Glynsight. ’ database will be connected with EpitopeXtractor and Glydin . This The first step implemented in Glynsight is to bring out relation- ’ will allow for users contribution to knowledge of glycan determi- ships between input compositions. To that end, compositions are ’ nants. We have also scheduled changes in the Glydin display to incrementally ordered and structured in columns where all composi- accommodate more information. For example the edges of the net- tions with a constant overall number of monosaccharides are listed. work will be associated with enzyme knowledge while the modes For instance, if the profile is an N-glycome, then the first column of will be associated with expression data to highlight glycoepitopes in the profile is the N-core H3N2 (5 monosaccharides in total) and the over or under-expressed glycans. Finally, we also plan to release a next is H3N3 or H3N2F1, H4N2, H2N3F1 (6 monosaccharides in Glynsight version for the simultaneous comparison of larger collec- total), etc. In this way, the implicit inclusion of H3N2 in H3N3 or fi tions of pro les. To feed our future plan further, we also collect H3N2F1 is made obvious. More generally, this presentation of the comments and feedback from our collaborators and users since profile highlights the monosaccharide shared between compositions. matching their needs is our primary goal. Suggestions are an endless In this way, Glynsight shows the relationships between compositions source of ideas for developing better and smarter tools. in an interactive way. This allows the user to follow an increment in galactosylation or fucoslyation, etc., as explained below. The interface can be used to retrieve all the potential glycan Material and methods structures, which match a particular composition. Potential struc- Data tures are not generated using software but are retrieved from We have selected a recently published dataset (Anugraham et al. 2014) curated databases accessible via the GlyConnect platform (https:// from which we extracted the details in Supplementary material. We glyconnect.expasy.org). Glynsight is available online at https:// considered Table I in (Anugraham et al. 2014) and transformed all glycoproteome.expasy.org/glynsight/. N-glycan structures into compositions (Table I). We left O-glycan data due to the small range that was included. We define glycan compos- Glycan profile upload ition as the number of monosaccharides, which composes the glycan To begin with, the user needs to upload a set of glycan profiles. This out of a list of 7 main building blocks: Hexose (Hex), N-acetylhexosa- operation is made simple through a multiple file upload in Glynsight mine (HexNAc), Fuc, NeuAc, N-Glycolylneuraminic acid (NeuGc) under “Import Experiments” in “Manage Experiments” menu. Each including Sulfate for O-linked. We use the simple notation shown in glycan profile must be stored in a comma-separated value (CSV) Table I. For instance the N-linked core is H3N2. type of file. All CSV files must contain two columns: This data was then converted into the Glynsight format and loaded into our software. Potential isomers were not taken into consideration 1. glycan composition and 2. glycan abundance as percentage or any other unit.
Table I. Notation for monosaccharides As shown in the example file (File S2), the glycan composition col- “ ” Monosaccharide Encoding umn must be named composition whereas glycan expression col- umn must be named “quantification”. Hex H When the file is correctly imported, a grey tick will appear next HexNAc N to the filename and a new entry will be available in the experiment Fuc F list situated on the left side of the Glynsight interface. By default the NeuGc G name of the file is taken as the name of the experiment, but clicking NeuAc S on “Rename Experiments” in “Manage experiments” allows the Sulfate s user to change any experiment name.
Downloaded from https://academic.oup.com/glycob/article-abstract/28/6/349/4923253 by guest on 29 July 2018 358 D Alocci et al.
If a wrong experiment has been mistakenly uploaded, the button associated color in a specific glycan encoding format, the color of the “Remove selected experiment” removes only the selected experiment edge can be adapted using the color picker under the switch. By (s) in the list. If a complete reset of Glynsight is needed, clicking on default, colors are following the Symbol Nomenclature for Glycans “Remove all experiments” will wipe out all the experiments uploaded. (SNFG) (e.g. red for Fuc, purple for sialic acid). In the upper right corner of the viewer tab, a button with an arrow pointing down, allows the user to take a screenshot of the Manage a session viewer for sharing or documenting purposes. The format of the A Glynsight session is a collection of uploaded glycan profiles. A pro- image is set to scalable vector graphics. Custom settings of the cessed collection can be downloaded using “Download session” but- viewer are kept when a new experiment is loaded. Consequently, ton under the “Manage Session” menu. Glynsight packs all the selecting a new experiment will only modify the height of each bar information into a single JSON file ready to be shared and uploaded reflecting the new data. It is important to note that all glycan com- for future usage. Once the session file is uploaded, using “Upload ses- positions showed in the viewer are taken from the first imported sion” button, Glynsight will automatically show the collection of gly- profile. In other words, if the user uploads experiments where the can profiles under the “Experiment list”. In this case, reuploading all second contains more compositions than the first, then the extra the glycan profiles is avoided, in an attempt to minimize mistakes. compositions will not show in the viewer. To avoid this problem, it is imperative to upload first, the profile with the largest set of Display mode compositions. Glynsight has been designed for two main usages: visualize and compare glycan profiles. Each is characterized by a distinct mode Differential display and a characteristic display. A single glycan profile is visualized in Differential Display is an adapted version of the Individual Display the “individual display mode” while in the “differential display for comparing two glycan profiles. In this mode, the list of experi- mode” the comparison of two profiles is shown. ments in the left side menu is duplicated to provide the user with the The user can smoothly switch between the two modes by select- option of selecting an experiment in each of the two lists. Glynsight ing the preferred one in the “Display mode” left menu. Switching subtracts intensities of each pair of identical compositions. The display mode automatically triggers a change in the Experiment list viewer shows red and blue bars whose height for each composition enabling the selection of multiple experiments. The following sec- represents the result of the subtraction. For any given composition, tions explain how to use the individual and the differential displays. if its intensity in the first experiment is higher than in the second, then the bar is blue. In the opposite situation, the bar is red. We also introduced a distinction to spot compositions unique to an experi- Individual display ment. Brighter blue (resp. red) highlights differences arising from We assume here that “Individual Display” is selected in the left menu. compositions unique to the first (resp. second) experiment while Selecting a glycan profile in the experiment list triggers an update the toned down blue (resp. red) shows differences arising from composi- viewer. Glynsight retrieves the data related to the profile from the local tions expressed in both experiments but higher in the first (resp. storage in the browser and updates the view to fit with the new data. second) than in the second. Note that white and grey bars are still As detailed above, the viewer shows compositions incrementally present with the same meaning as previously defined. ordered and structured in columns. Within each column, they are ordered by increasing numbers of HexNAc from top to bottom. A bar next to each composition represents the relative intensity Potential structures in the glycan profile, whereas the bar color gives a visual indication To link compositions with potential glycan structures, Glynsight of the expression level. Four different colors are used: searches in GlyConnect, our dashboard for curated glycan data (manuscript in preparation), for all reported glycans that match a par- • Green corresponds to highly expressed glycan compositions ticular composition. This process is transparent to the user and is • Yellow corresponds to lowly expressed glycan compositions. initiated by clicking on a composition of interest or its corresponding • White corresponds to unexpressed glycans. bar in the main viewer. Then, potential structures are visualized in the • fi Grey corresponds to glycans present in other pro les but not in structure tab under the viewer. Results can be filtered by type of gly- the one currently displayed. can attachment to protein, species and tissue. Each can be removed by Since the separation between highly or lowly expressed glycans clicking on the red cross in the corner of each image. Clicking on the cannot be fixed but it is relative to the experiment, the user can set image goes directly to the structure page of the database. Finally, the “ ” the “Expression threshold” in the upper right part of the viewer Send to EpitopeXtractor button sends the list of glycan structures under Scale Setting. This threshold defines the upper limit of the yel- to the next tool in the pipeline described in this paper. It performs the low bars and the lower bound of the green bars. extraction of glycoepitopes/glycan determinants. In addition to the “Expression threshold”, Scale Setting can also operate on the display scale represented by the leftmost purple bar EpitopeXtractor and set by default to 5%. The value (number) and the unit (text) The part of a complex carbohydrate molecule that is recognized by can be modified to reflect the quantification unit. In some cases, a a GBP is referred to as “glycoepitope” or “glycan determinant”.Itis larger value improves the interpretation of variations. the glycan structure required for recognition by a GBP (Wang et al. In the upper left part of the viewer, the Edge Selection setting is 2014). Glycan determinants are uniquely recognized by antibodies composed of six different switches with color pickers. Using the corre- or by lectins, the generic category of GBPs. Here, glycoepitope and sponding switch, the user can highlight a specific monosaccharide glycan determinant are used as synonyms. increment, which will be visible in the form of an edge (or connector) EpitopeXtractor is the second step of our pipeline as well as a connecting two composition bars. Since each monosaccharide has an standalone tool. Its purpose is to extract all the glycoepitopes
Downloaded from https://academic.oup.com/glycob/article-abstract/28/6/349/4923253 by guest on 29 July 2018 Interactive view of glycosylation 359
contained in a glycan structure or a set thereof. Several lists of glycan Further detail on the purpose and usage of Glydin’ is provided in the determinants have been described in the literature or listed in data- next section. bases, often with overlaps, we have created a collection which sum- marizes the public knowledge about glycoepitopes. EpitopeXtractor is using this knowledge to extract glycoepitopes from glycan structures. Glydin’ A more detailed description of the sources selected to build our collec- Glycoepitopes have been collected by different authors and recorded tion of glycoepitopes is given in the section related to the Glydin’ tool. in different ways in a few unrelated databases or simply printed in EpitopeXtractor is powered by GlyS3 (Alocci et al. 2015), tables of articles. In all cases, glycoepitopes are listed as independent another in house tool, designed to search glycan substructures. The entities despite their compositional similarity and the common set of branched structure of glycans complicates this task and we proposed enzymes required to synthesize them. We designed Glydin’ (Glycan a solution in GlyS3 taking into account possible structure fuzziness. dynamics) to (i) merge distinct sources of information and standard- In a nutshell, EpitopeXtractor performs substructure search, via ize the description glycoepitopes provided in different formats and GlyS3, for each glycoepitope in our curated list and outputs all (ii) map the similarity between these entities in an interactive net- matches. work. Glydin’ is as a result, a tool to visualize and explore the rela- EpitopeXtractor is available online at http://glycoproteome. tionships between glycoepitopes based on either their shared expasy.org/epextractor. To our knowledge, it is the first online tool monosaccharide composition or the glycosyltransferase needed to extracting glycoepitopes from one or more glycan structure(s) based get one from the other. on a multi-source and curated list of glycan determinants. Glydin’ is available online at http://glycoproteome.expasy.org/ epitopes/.
Glycan structure input There are three ways of importing glycan structures to be processed Glycan determinant data sources for glycoepitope extraction: The extent of glycan determinants has not been yet defined. Several independent sources are available where information on glycoepi- • Draw a glycan using the graphical input using GlycanBuilder topes can be found. The following four resources cover the current (Ceroni et al. 2007). Note that in the near future, SugarSketcher cur- knowledge on glycoepitopes: rently only prototyped (https://glycoproteome.expasy.org/sugarsketcher/) will supersede GlycanBuilder. • The article “The repertoire of glycan determinants in the human gly- • fi De ne a structure with the text input pasted or written in come” (Cummings 2009) provided a table of 122 glycoepitopes. GlycoCT condensed (Herget et al. 2008). This format is widely • The 2014 version of GlycoEpitope database (http://www. used in glycoinformatics and many converters are available from glycoepitope.jp/)(Kawasaki, Nakao, and Tominaga 2008)reviewed other formats (Campbell et al. 2014). by Dr Catherine Hayes (formerly a member of the Dept of Medical • Insert a list of comma-separated glycan IDs from our in house Biochemistry and Cell Biology at the University of Gothenburg, database GlyConnect. Sweden) generated a list of 156 glycoepitopes. • The SugarBind database (https://sugarbind.expasy.org/)(Mariethoz Extraction results et al. 2016) contains 200 different glycoepitopes with described Running EpitopeXtractor can take several minutes due to the large structures. • amount of epitopes present in the standard collection (approximately The BiOligo database (http://glyco3d.cermav.cnrs.fr/search.php? = 500 different glycoepitopes). When the extraction is completed, the type bioligo), part of the Glyco3D portal, provides more than results show all the SNFG coded images of the epitopes extracted from 250 annotated entries of glycan determinants (Pérez et al. 2015). the input glycan(s). Clicking on each picture redirects the user to a tool called Glydin’, a glycoepitope viewer presented in the next section. Preprocessing structural data When multiple glycans are submitted, the results are shown as a The four different sources of glycoepitopes cited above were inte- conceptual map highlighting the relation between selected glycoepi- grated into one unique and nonredundant collection. However, the topes and input structures. This interactive map displays glycoepi- compilation of the sources was made difficult due to the variety of topes in a central vertical list radiating to structures that contain the formats used to encode structures. Glycan determinant structures in substructures. Hovering on a glycoepitope highlights all the struc- the different sources are described with IUPAC formulae that are tures including this particular substructure whereas moving the not standardized. Consequently, the same epitope referenced in two mouse on a structure will shows all the glycoepitopes found in this different sources may have two different encodings and will be con- selected glycan. sidered as two different entities by an automatic text parser. Using several online converters, all glycoepitopes were first translated to Visualization with Glydin’ GlycoCT, the most common structure encoding format. In many By default EpitopeXtractor outputs a list of glycoepitopes associated cases, the four sources provided names for the glycoepitopes and with a set of glycans. However, the underlying relationships between those with none were left as “unnamed”. In the end, a list of 558 glycoepitopes that partially share composition should also be high- unique glycan determinants was compiled. lighted and explored. This can be managed with Glydin’, a tool we developed to map and display glycoepitope similarities. In the upper right corner of the Result window, the user can click on a button Data visualization with a network icon to send all the extracted glycoepitopes to Glydin’. Glydin’ displays the 558 glycoepitopes as nodes in a network. Two Then, a new browser page opens and shows the glycoepitope network networks are mapped to emphasize two different types of shared of Glydin’ where all the selected epitopes are singled out as red nodes. properties:
Downloaded from https://academic.oup.com/glycob/article-abstract/28/6/349/4923253 by guest on 29 July 2018 360 D Alocci et al.
1. In the “Substructure Based Map”, two nodes are connected if one Supplementary data glycoepitope is a substructure of the other. Supplementary data is available at Glycobiology online. 2. In the “Enzyme Base Map”, two nodes are connected as the result of a glycosyltransferase adding a monosaccharide to the other. Funding
Substructure network European Union FP7 Innovative Training Network [grant number 316,929] Using GlyS3 (Alocci et al. 2015), the edge list representing the net- and by the Swiss Federal Government through the State Secretariat for work was created. Each node was searched as a substructure in the Education, Research and Innovation (SERI). ExPASy is maintained by the remaining 557. Practically, when an epitope A is found to be a sub- web team of the Swiss Institute of Bioinformatics and hosted at the Vital-IT structure of an epitope B, then node A and node B are connected, Competency Center. and an edge with the IDs of the epitopes A and B is added to the edge list. This resulted in a network where each node is connected to Acknowledgements itself since it is a substructure of itself. Also, if a node A is connected to B and node B is connected to node C, then a redundant edge that We thank Dr David Falk and Prof Manfred Wuhrer for helpful discussions on the genesis of Glynsight and feedback on the early prototypes. We also links nodes A and C is created and overcrowds the network. Finally, acknowledge Prof Nicolle Packer and Dr Niclas Karlsson, our main colla- to clean all the redundant edges, a transitive reduction of the graph borators in the Glycomics@ExPASy initiative. was applied.
Conflict of interest statement Enzymatic network The enzymatic network can be represented as an evolution of the None declared. substructure network. Starting from the same entities, more con- straints have been added to the substructure search. This network is Abbreviations rooted. Roots are single monosaccharides (Hex and HexNAc). An edge between two nodes is added only if one is a substructure of the CFG, consortium for functional glycomics; CSV, comma-separated value; other and if it covers a subregion that includes one of the roots. EOC, epithelial ovarian cancer; Fuc, fucose; GBP, glycan binding protein; Including the root as part of the substructure is the key to shift from Hex, hexose; HexNAc, N-acetylhexosamine; mAbs, monoclonal antibodies; NeuAc, N-acetylneuraminic acid; NeuGc, N-glycolylneuraminic acid; SNFG, being a substructure to an extension. In the end, like the substruc- symbol nomenclature for glycans; SVG, scalable vector graphics. ture network, a filtering step using the transitive reduction was applied to the graph. References
Network properties Abbott KL, Nairn AV, Hall EM, Horton MB, McDonald JF, Moremen KW, The “Substructure Based Map” shows a network as one large con- Dinulescu DM, Pierce M. 2008. Focused glycomic analysis of the N- nected component whereas the “Enzyme Base Map” is a collection linked glycan biosynthetic pathway in ovarian cancer. Proteomics. 8(16): 3210–3220, https://doi.org/10.1002/pmic.200800157. of connected components each stemming from one of the five main Adamczyk B, Jin C, Polom K, Muñoz P, Rojas-Macias MA, Zeeberg D, Borén monosaccharides. In either map all nodes are labeled with their gly- M, Roviello F, Karlsson NG. 2018. Sample handling of gastric tissue and can determinant name. Hovering over a node highlights the glycoe- O-glycan alterations in paired gastric cancer and non-tumorigenic tissues. Sci pitope and its neighboring nodes and shows their names. The Rep. 8(1):242, https://doi.org/10.1038/s41598-017-18299-6. glycoepitopes called “unnamed” were not given names in the ori- Adamczyk B, Tharmalingam T, Rudd PM. 2012. Glycans as cancer biomar- ginal sources. Nodes have distinct shades of color. Glycoepitopes kers. Biochim Biophys Acta. 1820(9):1347–1353, https://doi.org/10. with the strongest shade are referenced in all four sources and 1016/j.bbagen.2011.12.001. appear in the foreground. Color fades as the number of source refer- Agravat SB, Saltz JH, Cummings RD, Smith DF. 2014. GlycoPattern: A web – ences decreases. In the background the lightest shade corresponds to platform for glycan array mining. Bioinformatics. 30(23):3417 3418, nodes referenced in only one source. Upon clicking on a node, a https://doi.org/10.1093/bioinformatics/btu559. Alocci D, Mariethoz J, Horlacher O, Bolleman JT, Campbell MP, Lisacek F. small window that shows the corresponding glycoepitope details 2015. Property graph vs RDF triple store: A comparison on glycan sub- opens in the upper right corner and next to it, a target icon appears. structure search. PLoS One. 10(12):e0144578, https://doi.org/10.1371/ Clicking on this target centers the network on the node of the gly- journal.pone.0144578. coepitope detailed in the window. This window contains four sec- Anugraham M, Jacob F, Nixdorf S, Everest-Dass AV, Heinzelmann-Schwarz tions. The first one shows the glycoepitope SNFG cartoon V, Packer NH. 2014. Specific glycosylation of membrane proteins in epi- representation. The second one (“Referenced In”) specifies where it thelial ovarian cancer cell lines: Glycan structures reflect gene expression was referenced and links to the original sources. The third section and DNA methylation status. Mol Cell Proteomics. 13(9):2213–2232, (“Referenced As”) states the names that it was given in the sources, https://doi.org/10.1074/mcp.M113.037085. and the last one (“Related To”) contains the names of the connected Buehler M, Brian T, Leboucq A, Jacob F, Caduff R, Fink D, Goldstein DR, glycoepitopes. Clicking on the link to a related glycoepitope will Heinzelmann-Schwarz V. 2013. Meta-analysis of microarray data identi- fies GAS6 expression as an independent predictor of poor survival in refresh the window with information of that new glycoepitope. A ovarian cancer. BioMed Res Int. 2013:238284, https://doi.org/10.1155/ double click on a node highlights it and the nodes related to it while 2013/238284. the others fade out to the background. A zoom option is also avail- Bénard J, Da Silva J, De Blois MC, Boyer P, Duvillard P, Chiric E, Riou G. able. Zooming in when centered on a glycoepitope expands the 1985. Characterization of a human ovarian adenocarcinoma line, graph around it. Above a certain zoom level, the glycoepitope names IGROV1, in tissue culture and in nude mice. Cancer Res. 45(10): are automatically displayed for all the nodes. 4970–4979.
Downloaded from https://academic.oup.com/glycob/article-abstract/28/6/349/4923253 by guest on 29 July 2018 Interactive view of glycosylation 361
Campbell MP, Ranzinger R, Lütteke T, Mariethoz J, Hayes CA, Zhang J, vasculitis. EBioMedicine. 17(January):108–118, https://doi.org/10.1016/j. Akune Y, Aoki-Kinoshita KF, Damerell D, Carta G et al. 2014. ebiom.2017.01.033. Toolboxes for a standardised and systematic study of glycans. BMC Keser T, Gornik I, Vucˇkovic´ F, Selak N, Pavic´ T, Lukic´ E, Gudelj I, Bioinf. 15(Suppl 1):S9, https://doi.org/10.1186/1471-2105-15-S1-S9. Gašparovic´ H, Biocˇina B, Tilin T et al. 2017. Increased plasma N- Ceroni A, Dell A, Haslam SM. 2007. The GlycanBuilder: A fast, intuitive and glycome complexity is associated with higher risk of type 2 diabetes. flexible software tool for building and displaying glycan structures. Source Diabetologia. 60(12):2352–2360, https://doi.org/10.1007/s00125-017- Code Biol Med. 2(August):3, https://doi.org/10.1186/1751-0473−2-3. 4426-9. Ceroni A, Maass K, Geyer H, Geyer R, Dell A, Haslam SM. 2008. Knezevic A, Gornik O, Polasek O, Pucic M, Redzic I, Novokmet M, Rudd GlycoWorkbench: A tool for the computer-assisted annotation of mass PM, Wright AF, Campbell H, Rudan I et al. 2010. Effects of aging, body spectra of glycans. J Proteome Res. 7(4):1650–1659, https://doi.org/10. mass index, plasma lipid profiles, and smoking on human plasma N-glycans. 1021/pr7008252. Glycobiology. 20(8):959–969, https://doi.org/10.1093/glycob/cwq051. Cholleti SR, Agravat S, Morris T, Saltz JH, Song X, Cummings RD, Smith Knezevic´ A, Polasek O, Gornik O, Rudan I, Campbell H, Hayward C, DF. 2012. Automated motif discovery from glycan array data. OMICS J Wright A, Kolcic I, O’Donoghue N, Bones J et al. 2009. Variability, herit- Integr Biol. 16(10):497–512, https://doi.org/10.1089/omi.2012.0013. ability and environmental determinants of human plasma N-glycome. Christiansen MN, Chik J, Lee L, Anugraham M, Abrahams JL, Packer NH. J Proteome Res. 8(2):694–701, https://doi.org/10.1021/pr800737u. 2014. Cell surface protein glycosylation in cancer. Proteomics. 14(4–5): Konishi Y, Aoki-Kinoshita KF. 2012. The GlycomeAtlas tool for visualizing 525–546, https://doi.org/10.1002/pmic.201300387. and querying glycome data. Bioinformatics. 28(21):2849–2850, https:// Cummings RD. 2009. The repertoire of glycan determinants in the human gly- doi.org/10.1093/bioinformatics/bts516. come. Mol BioSyst. 5(10):1087–1104, https://doi.org/10.1039/b907931a. Krištic´ J, Vucˇkovic´ F, Menni C, Klaric´ L, Keser T, Beceheli I, Pucˇic´-Bakovic´ Cymer F, Beck H, Rohde A, Reusch D. 2017. Therapeutic monoclonal anti- M, Novokmet M, Mangino M, Thaqi K et al. 2014. Glycans are a novel body N-glycosylation—Structure, function and therapeutic potential. biomarker of chronological and biological ages. J Gerontol A Biol Sci Biologicals. November. https://doi.org/10.1016/j.biologicals.2017.11.001. Med Sci. 69(7):779–789, https://doi.org/10.1093/gerona/glt190. Escrevente C, Machado E, Brito C, Reis CA, Stoeck A, Runz S, Marmé A, Kurbacher CM, Korn C, Dexel S, Schween U, Kurbacher JA, Reichelt R, Altevogt P, Costa J. 2006. Different expression levels of alpha3/4 fucosyl- Arenz PN. 2011. Isolation and culture of ovarian cancer cells and cell transferases and lewis determinants in ovarian carcinoma tissues and cell lines. Methods Mol Biol. 731:161–180, https://doi.org/10.1007/978-1- lines. Int J Oncol. 29(3):557–566. 61779-080-5_15. Gudelj I, Keser T, Vucˇkovic´ F, Škaro V, Goreta SŠ, Pavic´ T, Dumic´ J, Mariethoz J, Khatib K, Alocci D, Campbell MP, Karlsson NG, Packer NH, Primorac D, Lauc G, Gornik O. 2015. Estimation of human age using Mullen EH, Lisacek F. 2016. SugarBindDB, a resource of glycan-mediated N-glycan profiles from bloodstains. Int J Legal Med. 129(5):955–961, host-pathogen interactions. Nucleic Acids Res. 44(D1):D1243–D1250, https:// https://doi.org/10.1007/s00414-015-1162-x. doi.org/10.1093/nar/gkv1247. Hayes CA, Nemes S, Issa S, Jin C, Karlsson NG. 2012. Glycomic work-flow Miwa HE, Song Y, Alvarez R, Cummings RD, Stanley P. 2012. The bisecting for analysis of mucin O-linked oligosaccharides. In: McGuckin M, GlcNAc in cell growth control and tumor progression. Glycoconjugate J. Thornton D, editors. Mucins, Methods in Molecular Biology. Humana 29(8–9):609–618, https://doi.org/10.1007/s10719-012-9373-6. Press. p. 141–163, https://doi.org/10.1007/978-1-61779-513−8_8. Nakagawa H, Kawamura Y, Kato K, Shimada I, Arata Y, Takahashi N. Herget S, Ranzinger R, Maass K, Lieth C-WVD. 2008. GlycoCT—A unifying 1995. Identification of neutral and sialyl N-linked oligosaccharide struc- sequence format for carbohydrates. Carbohydr Res. 343(12):2162–2171, tures from human serum glycoproteins using three kinds of high- https://doi.org/10.1016/j.carres.2008.03.011. performance liquid chromatography. Anal Biochem. 226(1):130–138, Ismail MN, Stone EL, Panico M, Lee SH, Luu Y, Ramirez K, Ho SB, Fukuda M, https://doi.org/10.1006/abio.1995.1200. Marth JD, Haslam SM et al. 2011. High-sensitivity O-glycomic analysis of North SJ, von Gunten S, Antonopoulos A, Trollope A, MacGlashan DW, mice deficient in core 2 {beta}1,6-N-acetylglucosaminyltransferases. Glycobiology. Jang-Lee J, Dell A, Metcalfe DD, Kirshenbaum AS, Bochner BS et al. 21(1):82–98, https://doi.org/10.1093/glycob/cwq134. 2012. Glycomic analysis of human mast cells, eosinophils and basophils. Jansen BC, Falck D, de Haan N, Ederveen ALH, Razdorov G, Lauc G, Glycobiology. 22(1):12–22, https://doi.org/10.1093/glycob/cwr089. Wuhrer M. 2016. LaCyTools: A targeted liquid chromatography-mass Planinc A, Dejaegher B, Heyden YV, Viaene J, Praet SV, Rappez F, spectrometry data processing package for relative quantitation of glyco- Antwerpen PV, Delporte C. 2017. Batch-to-batch N-glycosylation study peptides. J Proteome Res. 15(7):2198–2210, https://doi.org/10.1021/acs. of infliximab, trastuzumab and bevacizumab, and stability study of beva- jproteome.6b00171. cizumab. Eur J Hosp Pharm. 24(5):286–292, https://doi.org/10.1136/ Jansen BC, Karli RR, Bondt A, Ederveen ALH, Palmblad M, Falck D, ejhpharm-2016-001022. Wuhrer M. 2015. MassyTools: A high-throughput targeted data process- Pucic´ M, Knezevic´ A,VidicJ,AdamczykB,NovokmetM,PolasekO,GornikO, ing tool for relative quantitation and quality control developed for glyco- Supraha-Goreta S, Wormald MR, Redzic´ I et al. 2011. High throughput iso- mic and glycoproteomic MALDI-MS. JProteomeRes. 14(12):5088–5098, lation and glycosylation analysis of IgG-variability and heritability of the IgG https://doi.org/10.1021/acs.jproteome.5b00658. glycome in three isolated human populations. Mol Cell Proteomics. 10(10): Jin C, Diarmuid TK, Skoog EC, Padra M, Adamczyk B, Vitizeva V, Thorell M111.010090, https://doi.org/10.1074/mcp.M111.010090. A, Venkatakrishnan V, Lindén SK, Karlsson NG. 2017. Structural Pérez S, Sarkar A, Rivet A, Breton C, Imberty A. 2015. Glyco3D: A portal for Diversity of Human Gastric Mucin Glycans. Molecular & Cellular structural glycosciences. In: Glycoinformatics, 241–58. Methods in Proteomics: MCP, March. https://doi.org/10.1074/mcp.M117.067983. Molecular Biology. New York, NY: Humana Press, https://doi.org/10. March. 1007/978-1-4939-2343-4_18. Joshi HJ, von der Lieth C-W, Packer NH, Wilkins MR. 2010. GlycoViewer: Raman R, Raguram S, Venkataraman G, Paulson JC, Sasisekharan R. 2005. A tool for visual summary and comparative analysis of the glycome. Glycomics: An integrated systems approach to structure-function relationships Nucleic Acids Research 38 (Web Server issue). W667–W670, https://doi. of glycans. Nat Methods. 2(11):817–824, https://doi.org/10.1038/nmeth807. org/10.1093/nar/gkq446. Raman R, Venkataraman M, Ramakrishnan S, Lang W, Raguram S, Kawasaki T, Nakao H, Tominaga T. 2008. GlycoEpitope: A database of Sasisekharan R. 2006. Advancing glycomics: Implementation strategies at carbohydrate epitopes and antibodies. In: Experimental Glycoscience. the consortium for functional glycomics. Glycobiology. 16(5):82R–90R, Tokyo: Springer. p. 429–431, https://doi.org/10.1007/978-4-431-77922- https://doi.org/10.1093/glycob/cwj080. 3_104. Rooijen J. J. van, Jeschke U, Kamerling JP, Vliegenthart JF. 1998. Expression of Kemna MJ, Plomp R, van Paassen P, Koeleman CAM, Jansen BC, Damoiseaux N-linked sialyl Le(x) determinants and O-Glycans in the carbohydrate moiety JGMC, Cohen Tervaert JW, Wuhrer M. 2017. Galactosylation and sialyla- of human amniotic fluid transferrin during pregnancy. Glycobiology. 8(11): tion levels of IgG predict relapse in patients with PR3-ANCA associated 1053–1064.
Downloaded from https://academic.oup.com/glycob/article-abstract/28/6/349/4923253 by guest on 29 July 2018 362 D Alocci et al.
Ruhaak LR, Deelder AM, Wuhrer M. 2009. Oligosaccharide analysis by gra- Stadlmann J, Taubenschmid J, Wenzel D, Gattinger A, Dürnberger G, phitized carbon liquid chromatography-mass spectrometry. Anal Bioanal Dusberger F, Elling U, Mach L, Mechtler K, Penninger JM. 2017. Chem. 394(1):163–174, https://doi.org/10.1007/s00216-009-2664-5. Comparative glycoproteomics of stem cells identifies new players in ricin Ruhaak LR, Miyamoto S, Lebrilla CB. 2013. Developments in the identifica- toxicity. Nature. 549(7673):538, https://doi.org/10.1038/nature24015. tion of glycan biomarkers for the detection of cancer. Mol Cell Stroop CJ, Weber W, Gerwig GJ, Nimtz M, Kamerling JP, Vliegenthart JF. Proteomics. 12(4):846–855, https://doi.org/10.1074/mcp.R112.026799. 2000. Characterization of the carbohydrate chains of the secreted form of Ruhaak LR, Zauner G, Huhn C, Bruggink C, Deelder AM, Wuhrer M. 2010. the human epidermal growth factor receptor. Glycobiology. 10(9):901–917. Glycan labeling strategies and their use in identification and quantifica- Taniguchi N, Kizuka Y. 2015. Glycans and cancer: Role of N-glycans in can- tion. Anal Bioanal Chem. 397(8):3457–3481, https://doi.org/10.1007/ cer biomarker, progression and metastasis, and therapeutics. Adv Cancer s00216-010-3532-z. Res. 126:11–51, https://doi.org/10.1016/bs.acr.2014.11.001. Schnaar RL. 2016. Glycobiology simplified: Diverse roles of glycan recogni- Wang Y-C, Peterson SE, Loring JF. 2014. Protein post-translational modifica- tion in inflammation. J Leukocyte Biol. 99(6):825–838, https://doi.org/ tions and regulation of pluripotency in human stem cells. Cell Res. 24(2): 10.1189/jlb.3RI0116-021R. 143, https://doi.org/10.1038/cr.2013.151.
Downloaded from https://academic.oup.com/glycob/article-abstract/28/6/349/4923253 by guest on 29 July 2018 TECHNICAL BRIEF Glycoinformatics www.clinical.proteomics-journal.com PepSweetener: A Web-Based Tool to Support Manual Annotation of Intact Glycopeptide MS Spectra
Marcin Jakub Domagalski,* Davide Alocci, Andreia Almeida, Daniel Kolarich, and Fr´ed´erique Lisacek*
Purpose: PepSweetener is a web-based visualization tool designed to 1. Introduction facilitate the manual annotation of intact glycopeptides from MS data Glycosylation is one of the most common regardless of the instrument that produced these data. post-translational modifications found Experimental design: This exploratory tool uses a theoretical glycopeptide on proteins. These glycosylated proteins dataset to visualize all peptide-glycan combinations that fall within the error are involved in a wide variety of biological pathways and fulfill numerous physiolog- range of the query precursor ion. PepSweetener simplifies the determination ical functions.[1] A comprehensive anal- of the correct peptide and glycan composition of a glycopeptide based on its ysis of protein-bound glycans requires precursor mass. The theoretical glycopeptide search space can be customized the determination of various parameters in an advanced query mode that specifies potential proteins/peptides, glycan such as the identification of the peptide compositions, and several experimental parameters. and site of glycosylation, the individual glycan structure(s) present at a specific Results: PepSweetener displays the results on an interactive heat-map chart site (microheterogeneity), and degree of where theoretical glycopeptide tile colors correspond to ppm deviations from site occupancy (macroheterogeneity).[2] the query precursor mass. Additionally, a visualization chart incorporates Several glycoforms can coexist on each glycan composition filtering, sorting by mass and tolerance, and an in silico site of glycosylation, and each single gly- peptide fragmentation diagram is provided to further support the correct cosylation site can exhibit a distinct pat- glycopeptide identification. tern of glycosylation. These patterns can also represent signatures of changes in Conclusions and clinical relevance: PepSweetener efficiently allows the cellular pathways that result from in- selection of the most probable intact glycopeptide mass matches and speeds flammation events or diseases such as up the verification process. It is validated on serum protein samples and cancer.[3,4] Therefore, the ability to fully immunoglobulins. The tool is publicly hosted on ExPASy, the SIB Swiss characterize a glycoprotein’s glycosyla- Institute of Bioinformatics resource portal tion site occupancy and glycan hetero- geneity is critical to better understand (http://glycoproteome.expasy.org/pepsweetener/app/). the biological and functional role of pro- tein glycosylation in health and disease.[5] MS has become the method of choice to study site-specific gly- Dr. M. J. Domagalski, D. Alocci, Dr. F. Lisacek cosylation as it offers high sensitivity, high throughput, and can Proteome Informatics Group easily be combined with a variety of different separation tech- SIB Swiss Institute of Bioinformatics [6] Geneva, Switzerland niques such as LC or CE. The analysis of intact glycopeptides E-mail: [email protected]; [email protected] by MS faces the challenge that the glycan portion cannot be pre- Dr. M. J. Domagalski, D. Alocci, Dr. F. Lisacek dicted as glycan biosynthesis is not a template-driven process, Computer Science Department CUI but the consequence of a complex biosynthetic machinery.[7] This University of Geneva renders standard proteomics such as database search algorithms Geneva, Switzerland useless.[8] Glycopeptides also have a lower ionization efficiency A. Almeida, Dr. D. Kolarich Institute for Glycomics compared to nonglycosylated peptides, making these compounds Gold Coast Campus more difficult to detect in a complex mixture of peptides and Griffith University glycopeptides.[9] Southport, QLD, Australia A common way to overcome the complexity associated with Dr.F.Lisacek glycopeptide analysis is to analyze the glycan and peptide moi- Section of Biology ety in separate MS/MS experiments after the glycan structures University of Geneva [10] [11] Geneva, Switzerland were released either enzymatically or chemically. The analy- sis of released glycans provides information on the glycan struc- The ORCID identification number(s) for the author(s) of this article tures and their relative abundance on a protein in contrast with can be found under https://doi.org/10.1002/prca.201700069 the analysis of the deglycosylated peptides that gives easy access DOI: 10.1002/prca.201700069 to protein identity and confirms the sites of N-glycosylation. In
Proteomics Clin. Appl. 2017, 1700069 1700069 (1 of 10) C 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.advancedsciencenews.com www.clinical.proteomics-journal.com
example, in the case of core 1 structures, the characteristic Clinical Relevance pattern includes peptide ion, water loss ion (−18), peptide + N + + Glycoproteins are key molecules involved in a plethora of intra- ion -acetylgalactosamine ( 203), and peptide ion N + + [8] and intercellular interactions, and their correct function is often -acetylgalactosamine galactose ( 365). These high mass directly correlated with their individual glycosylation. A large ions of glycopeptide stubs are particularly useful for glycan majority of FDA-approved cancer marker proteins are glyco- sequencing and determination of the peptide mass, but usually proteins, and software such as PepSweetener that is workflow give weak peptide fragmentation. Overall, these glycopeptide independent, helps getting closer and faster to deciphering spectra provide useful information on the composition and protein and site-specific glycosylation signatures in health and partially the structure of glycans; however in most cases the disease. With this momentum, novel tools and opportunities detailed glycan structure beyond composition is not generated to understand onset and progression of diseases such as can- by these techniques. cer and chronic inflammatory diseases will soon emerge. This provides huge potential for disease detection and monitoring. 3. Existing Tools for Software-Assisted Glycopeptide Spectra Annotation Enormous progress has been made in software tools for ana- complex mixtures, however, it is almost impossible to corre- lyzing glycopeptide MS data over the past years. Despite these late glycan structures with individual proteins and sites of gly- advances, automated interpretation of intact glycopeptides re- cosylation. Analysis of glycosite heterogeneity can thus only be mains an evolving field as existing software packages often are performed on intact glycopeptides by annotating MS data for specifically tailored to the needs of individual laboratories, as glycan structure distribution and MS/MS data for glycopeptide reviewed in ref. [15]. The majority of currently available gly- identity.[9,12,13] coproteomics software packages implement library-oriented or de novo approaches adjusted specifically for characterization of both peptide and sugar parts of the glycopeptide. More re- 2. Manual Annotation of Glycopeptide Spectra cent database search packages include desktop applications: MAGIC,[17] SimGlycan,[18] GlycoQuest,[8,13] GlycoPep Evaluator,[6] In the early stages of glycoproteomics, time consuming and GlycoPeptideSearch,[19] GP Finder,[20] Sweet-Heart,[21] Byonic,[22] potentially experimentalist-biased manual interpretation of and web services: GlycoMaster DB,[23] GlycoPep Detector,[24] and glycopeptide spectra has been the only data analysis option. Even GlycoPep Grader.[25] In these tools, the adjustments introduced to today this remains the preferred way for glycopeptide spectra the traditional proteomics database search strategy include, most annotation of complex glycopeptide carrying more than a single importantly, glycopeptide spectra detection, use of theoretical gly- site of glycosylation.[13,14] Still, the diversity and complexity of gly- can databases, and parallel identification of the peptides and gly- coproteomics workflows, which usually involve multiple MS ex- cans. Alternatively, GPQuest[26] is using a spectral library match- periments with complementary fragmentation methods, remain ing algorithm to assign peptides in glycopeptide spectra. Unfor- a tough challenge for software developers. Ion trap, Orbitrap, and tunately, library-oriented tools are limited by the lack of compre- Q-TOF CID (as well as other vibrational dissociation methods: hensive and curated collections of glycan sequences.[27] In con- HCD, CAD, IRMPD) produce MS/MS spectra dominated by gly- trast, de novo sequencing algorithms implemented in GlyPID[28] can fragments, while signals derived from the fragmentation of and SweetSEQer[29] resolve the problem of missing glycan li- peptide backbone bonds are less intense in these spectra.[13,15,16] braries, but their major flaw is a lack of statistical validation.[15] In contrast, activated electron dissociation methods produce Since completely automated annotation of generic glycan mass mainly peptide fragments leaving the glycan moiety intact, which spectra remains an unfeasible task, a range of computer pro- can be very useful for identification of glycosylation sites.[15] Gly- grams accelerating manual interpretation of glycopeptide spec- can derived oxonium ions remain the most convenient diagnostic tra is commonly used. GlycoMod has been one of the first and ions used to annotate collision-induced dissociation MS/MS still highly useful tools for glycopeptide annotation.[30] This web- ion scans for the presence of glycopeptides in protein digests. based tool can predict the possible oligosaccharide compositions Glycopeptide diagnostic ions obtained with CID are usually low from experimentally determined masses. When provided with molecular weight, protonated oxonium ions: m/z 163.06011+ the protein or peptide sequence, GlycoMod can also suggest pos- (Hexose; abbr. Hex), m/z 204.08671+ (N-acetylhexosamine; abbr. sible glycan compositions present on glycopeptides based on 1+ 1+ [31] HexNAc), m/z 366.1395 (HexNAc1Hex1), m/z 292.1027 the corresponding mass input. GlycoWorkbench is a comple- (N-acetylneuraminic acid; abbr. NeuAc), and m/z 657.23491+ mentary desktop application, which computes theoretical frag- (HexNAc1Hex1NeuAc1). Especially, the N-acetylhexosamine ment ions for user defined glycan structures. This tool can also residue and its water-eliminated derivatives are indicative match a set of glycan structures against the glycan MS/MS peak for N- as well as O-glycans.[16] Besides low mass diagnostic list. GlycoMod and GlycoWorkbench are very useful applications ions, N-glycopeptide CID spectra usually contain characteristic for the assignment and verification of potential glycan compo- patterns resulting from fragmentation of the trimannosyl core sitions, but these programs are of limited use in answering and starting from peptide + N-acetylglucosamine (y+HexNAc glycoproteomics questions. GlycoX[32] is a MATLAB program, ion). Likewise, O-glycopeptides usually contain fragmentation which extended the oligosaccharide calculator functionality with patterns specific for a particular type of core structure, for autoisotope filtering and determination of glycosylation sites
Proteomics Clin. Appl. 2017, 1700069 1700069 (2 of 10) C 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.advancedsciencenews.com www.clinical.proteomics-journal.com
posed solutions explains the experimental data best. The soft- ware does not provide statistical validation and does not score results, solely because at this point in time the scarcity of glyco- proteomics experimental data as well as the challenge of defin- ing a proper parameter set currently prevents the design and the implementation of such an algorithm in a reliable and platform- independent manner. Thus, peptide and glycans are by default sorted by ppm accuracy, but can be also sorted by mass and com- position/sequence to facilitate manual data evaluation. A pop-up cloud with the UniProt identifier of source protein is displayed when mousing over the peptide label with its sequence and mass. A cloud with the glycan mass pops up when mousing over the glycan composition label. Peptide fragmentation diagram is dis- played on mouse click on the glycopeptide tile. The chart can be further restricted with a glycan composition filter and sorted by peptide and glycan mass or ppm distance. These features make PepSweetener a very simple and generic tool that is entirely Figure 1. Heat map chart presenting theoretical O-glycopeptides that MS platform independent for the investigation of glycopeptide match precursor ion m/z 930.0893+. This figure shows the results of a spectra. search performed over the dataset created by the in silico proteolytic di- gestion of human IgD (UniProt accession: P01880) with trypsin. 5. Usage simultaneously from pronase digests. The availability of the tool PepSweetener has two working modes: simple and advanced is limited to personal request and processing complex datasets search. In both modes the search input is a comma-separated generate a high rate of false positives. list of potential intact glycopeptide masses or precursor ion m/z GlycoMod was taken one step further through the develop- values and the tolerance range. For ions, the charge state is de- ment of GlycoSpectrumScan,[33] a web-based application using fined by placing the charge value in parentheses, for example, a combinatorial approach for the analysis of site-specific glycan 930.0890(+3). The error window on experimental glycopeptide microheterogeneity. The tool is currently not accessible, but in mass values is specified in ppm or Da units. Due to the very large principle processes two datasets: the identified protein sequence size of the simple search dataset, maximum tolerance value is re- and its determined glycan monosaccharide compositions. Then stricted to ±100 ppm or ±0.1 Da. In the advanced mode the toler- it computes all glycoform combinations for each potential gly- ance range has no restriction. However, the size of the advanced copeptides, uses this information to screen the experimentally search dataset is restricted to 10 000 000 combinations. The sim- acquired MS data and finally identifies potential glycopeptides. ple search does not require inputting potential peptide sequences Scanning MS data against all possible theoretical peptide glyco- or glycans but allows to search precursor ion m/z values against forms significantly reduces the chances of false negatives. The a predefined database containing combination of all predicted tool also allows for the relative quantitation of different glyco- tryptic peptides containing N-glycosites in the human proteome forms taking various charge stages into account, but is not visu- (single missed cleavage allowed) and all human N-glycans of re- alizing alternative compositions within a specified error window, ported in the November 2015 release of UniCarbKB[34] extended though such a feature would be helpful in identification of false in-house by manually mining and curating published articles. positive matches. Regardless of the number of N-glycosylation sequons within a peptide sequence, only singly glycosylated peptides are gener- ated. This represents approximately 180 million theoretical com- 4. PepSweetener Concept binations, which are stored in a NoSQL database on the ExPASy server. The simple mode was designed for N-glycopeptide anal- The design of PepSweetener meets some of the objectives of Gly- yses of complex human samples. It should be used only with coSpectrumScan. Both tools screen experimental MS data of in- high accuracy data and low tolerance threshold. Due to the very tact glycopeptides against a dataset of theoretical glycoforms built large size of the theoretical dataset, it does not come as a sur- using a combinatorial approach and knowledge from comple- prise that the number of matches found in this mode can be mentary MS experiments on released glycans and deglycosylated very large. In the simple search mode, PepSweetener is limited to peptides. PepSweetener, however, is not indicating a single best the current state of knowledge of glycan compositions identified match, but it is visualizing all of the potential peptide and glycan on human proteins. The tool may not identify the correct glycan combinations on an interactive heat-map chart (Figure 1) with composition if it has not been reported previously in our in- peptides listed in rows, glycan compositions listed in columns house glycan composition file. Nevertheless, even if actual glycan and glycopeptide tiles colored accordingly to the ppm distance composition is not present in the search dataset, very of- from the query precursor mass. This interactive and global view ten the analysis of the best matching hits still allows for of possible matches helps the scientist to simultaneously ana- potential glycopeptide prediction and therefore determination lyze multiple possible glycoforms and decide which of the pro- of the glycan mass, which in turn can simplify the search
Proteomics Clin. Appl. 2017, 1700069 1700069 (3 of 10) C 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.advancedsciencenews.com www.clinical.proteomics-journal.com
Figure 2. The search space generator panel of the advanced search interface. Three vertically oriented input boxes allow for the selection of peptides, glycans, and fixed modifications, respectively.
of glycan composition with GlycoMod. GlycoMod is not re- computed using lists of peptide or glycan masses if sequences or stricted to any glycan library and can find all possible glycan compositions are unknown. compositions from its experimentally determined mass. In PepSweetener’s advanced mode the search space is custom- defined by specifying potential peptide sequences and glycan 6. Design and Implementation compositions (Figure 2), which are converted into a dataset of all glycoform combinations of each peptide. This mode allows the PepSweetener is a web application written in JavaScript using generation of N- as well as O-glycopeptide search datasets. Input Google Polymer, D3.js, and fishTones.js libraries. The tool was peptides can be automatically generated for a chosen protein by developed according to the W3C web components standards. providing its UniProt accession number, selecting one of the pro- This innovative technology allows for the reduction of code com- teolytic enzymes and number of missed cleavages. Alternatively, plexity through the isolation of related groups of HTML markup, peptides can be uploaded in the form of a comma-separated list CSS styles, and JavaScript code into smaller units that perform in a plain text file. Uploaded peptides may carry modifications, encapsulated functions within the context of a single web page. which should be included by placing the modification name These types of HTML modules are called custom elements, and or corresponding mass shift within braces after the modified can be easily incorporated and used within any webpage. The residue, for example, K{Acetyl}Q{Deamidated}H{79.966}ER. PepSweetener graphical interface contains a set of isolated cus- Additionally, selected fixed modifications can be added using tom elements that perform specific functions, for example, gen- checkboxes located at the bottom of the dataset generation form. erate search dataset, handle search query, generate visualization The list of peptides can be also filtered by occurrences of the N- chart, filter glycopeptides by composition, or perform proteolytic X-S/T sequon, where X is any amino acid except proline, or alter- cleavage. Reusable PepSweetener custom elements can be found natively by the presence of serine and threonine residues. Pep- at https://bitbucket.org/account/user/sib-pig/projects/PE. Inter- tide filtering with N-glycosylation sequon is dependent on how active visualization of search results is implemented using D3.js peptides were submitted. When peptides are generated using library. The D3.js makes use of SVG, HTML5, and CSS stan- proteolytic cleavages, sequons are mapped at the protein level dards for producing scalable and dynamic visualizations in web saving all sites of glycosylation where the sequon is incomplete browsers. Peptide mass calculations, peptide theoretical MS/MS after proteolytic digestion. When peptides are uploaded in a text spectrum visualization, and fixed peptide modifications are file, those with partial N-X-S/T sequons will be filtered out un- handled using the fishTones.js library.[35] Fishtones.js default less this filtering option is turned off. The input glycan repertoire modifications dictionary is loaded with public data from the can be chosen as the whole glycan composition dataset or its re- Unimod repository.[36] Additionally, peptides uploaded in the ad- striction to N-glycans, or it can be uploaded in a text file with a vanced mode can carry modifications not recognized by fish- comma-separated list of compositions. The list of hypothetical tones.js. These peptides should contain the monoisotopic mass compositions can be generated from the experimental masses of change value within squared brackets. Analogically, any mass released glycans using GlycoMod. The search dataset can also be change can be added to glycans uploaded in the advanced mode
Proteomics Clin. Appl. 2017, 1700069 1700069 (4 of 10) C 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.advancedsciencenews.com www.clinical.proteomics-journal.com
Table 1. List of glycan constituents recognized by PepSweetener. nations within a tolerance range from the precursor mass than in the case of peptides.[37] The inference of glycan structure us- Code Full name ing MS is more difficult than solving its composition. PepSweet-
Hex Hexose (e.g., galactose, mannose, glucose) ener is not using any information about glycan structure there- fore it cannot visualize theoretical fragmentation of the suspected HexNAc N-acetylhexoseamine glycan. NeuAc N-Acetylneuraminic acid (sialic acid) NeuGc N-Glycolylneuraminic acid dHex Deoxyhexose (e.g., fucose, rhamnose) 8. Validation with Experimental Data Pen Pentose (e.g., xylose) PepSweetener is a robust search engine that based on a partic- HexA Hexuronic acid ular precursor mass falling within a specified ppm error range Kdn Ketodeoxynononic acid suggests possible glycopeptides while displaying the results in Neu Neuraminic acid an easily understandable manner. Complex samples such as hu- NAc Acetate man serum are major source of clinical markers and biomark- S Sulphate ers. A substantial part of these markers are glycoproteins in- P Phosphate volved in disease progression and thus evaluation of micro- and macro-heterogeneity of those is of great interest and valuable in- formation for disease management.[3,4] The analysis of intact gly- copeptides in a complex sample still comes with analytical chal- to represent monomers not present in PepSweetener’s glycan lenges and the success of the experiments depends highly on constituent library (Table 1). Variable modifications are not sup- sample preparation and applied MS-based parameters and tools. ported, as these would greatly increase the size of the dataset. The The simultaneous variation of the collision energies in Q-TOF simple search dataset is stored in the SSDB key-value NoSQL instruments during glycopeptide fragmentation analysis yields database. The SSDB database is accessible through the webdis both peptide and glycan fragmentation.[13,38,39] For validation of web service, simple HTTP server, which executes commands the simple and advanced search modes of PepSweetener, we on SSDB and returns a response in JSON format. The ad- used tryptic (glyco)peptides of human serum and immunoglob- vanced search dataset is stored in web browser’s memory us- ulins (Igs) datasets generated by nanoRP-LC-ESI-Q-TOF tandem ing session storage. In both modes, glycopeptide datasets are MS.[13] clustered by molecular mass into bins of 0.1 Da to improve In Figure 3, the MS/MS spectra of an enriched tryptic search performance. The application source code is available at glycopeptide of human serum with the precursor mass [M + + https://bitbucket.org/sib-pig/pepsweetener. The tool is accessi- 3H]3 1099.4587 shows the neutral losses of the glycan moiety, ble in the “glycomics” section of the ExPASy server at the follow- oxonium ions, and as well as peptide fragmentation. The glycan ing URL: http://glycoproteome.expasy.org/pepsweetener/app/. mass can be easily deduced from subtracting the peptide mass [M + H]+ 1033.5436 from the precursor mass. Furthermore, GlycoMod provides glycan compositions that can be cor- 7. Data Interpretation rectly chosen according to neutral losses from the glycan moiety of the glycopeptide and oxonium ions. The N- The first step in the manual annotation of intact glycopeptide glycopeptide of Figure 3 carries an N-glycan with composition spectra is the identification of peptide yn-1 ion at the high-mass Hex5HexNAc5dHex1NeuAc1. This is also supported by the end of the spectrum. This information can be found by track- oxonium ions m/z 528.191+ and m/z 657.231+ corresponding to ing fragmentation patterns consisting of peptide and glycan-core the glycan fragments Hex2HexNAc1 and Hex1HexNAc1NeuAc1, fragments and peptide fragment ions if those are observable. In respectively. Even though the glycan composition is easily order to simplify the selection of theoretical glycopeptides using accessed, the peptide sequence needs further protein search the observed precursor mass of the peptide ion, glycan–peptides engines to be successfully identified. PepSweetener integrates combinations can be sorted by intact peptide mass. The mass of the theoretical human N-glycoproteome using UniProt thus the peptide is displayed next to the label with its sequence on the the precursor mass 1099.4587(+3) was queried with a 2 ppm left side of the chart. Peptide identification may be considered tolerance range containing at least 5 Hex, 5 HexNAc, 1 Fuc, and unambiguous only if it is supported by the identification of its 1 NeuAc (Figure 3). The results indicate that 14 N-glycopeptides fragment ions in the corresponding glycopeptide MS/MS spec- fall into the search parameters carrying the glycan composition tra. Theoretical peptide fragment ions can be checked through Hex5HexNAc5Fuc1NeuAc1. The most probable peptide sequence visualizing peptide y- and b-ion series displayed under the chart is then chosen taking into consideration the observed peptide on a mouse click on the corresponding glycoform tile. The iden- mass [M + H]+1033.5436 and the peptide fragmentation shown tification of the peptide mass simplifies glycopeptide assignment in the MS/MS spectra of Figure 3. The N-glycopeptide [M + to solving glycan composition. The remaining mass after subtrac- 3H]3+ 1099.4587 with the peptide sequence 249Gly-Arg257 corre- tion of the peptide mass from the precursor ion mass should ac- sponds to immunoglobulin E protein, accession number P01854; count for the mass of the attached glycan. With high accuracy MS, IgE_HUMAN and carries a core fucosylated monosialylated glycans can be compositionally characterized on the basis of their bisected N-glycan at Asn252. intact mass, since a low number of possible glycan monomer In-depth site-specific investigation of Igs glycosylation has masses is more likely to reduce the number of possible combi- already been well described by several authors.[33,40–44] These
Proteomics Clin. Appl. 2017, 1700069 1700069 (5 of 10) C 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.advancedsciencenews.com www.clinical.proteomics-journal.com
Figure 3. Query of the glycopeptide [M + 3H]3+ 1099.4587 using the simple search of PepSweetener and corresponding manually annotated MS/MS fragment spectra. The m/z 1099.4587(+3) was searched within a 2 ppm tolerance window using the glycan filter for at least 5 Hex, 5 HexNAc, 1 dHex, and 1 NeuAc. The glycan mass was calculated from the precursor mass and peptide mass [M + H]1+ 1033.5436 and searched in GlycoMod. The results were sorted out by ppm distance of the precursor mass to the theoretical N-glycopeptide. Thirty-three N-glycopeptides were matching the selected criteria, but only 14 N-glycopeptides carry the glycan composition Hex5HexNAc5Fuc1NeuAc1. The theoretical peptide fragmentation given by PepSweetener matches the peptide sequence 249Gly-Arg257 of human IgE protein accession number is P01854; IgE_HUMAN observed in the MS/MS spectra. This glycopeptide has been described as carrying a core fucosylated monosialylated bisected N-glycan.[40] The y- and b-ion series are highlighted in red and blue, respectively. glycoproteins have N-andO-glycosylation sites and thus repre- gested by PepSweetener. Further information of oxonium ions sent good test samples to validate PepSweetener advanced search (ion m/z 6571+) allows confirmation of the sialylated nature of features. To this end, tryptic (glyco)peptides derived from Igs the attached N-glycan to Asn367. All together this information (IgA1+2, sIgA [secretory component, joining chain, IgA1+2], supports the suggested N-glycopeptide of IgD 364Thr-Arg370 car- IgD, IgE, IgG1+2 and IgM) were subjected to nanoRP-LC-ESI- rying a monosialylated bisected N-glycan already reported in the Q-TOF MS/MS with optimized collision energies.[13] Human literature.[13,44] IgD and IgA whose UniProt accession numbers are, respectively, The advanced search mode of PepSweetener is also help- P01880 and P01876 were in silico digested with trypsin allow- ful for solving MS/MS spectra of intact O-glycopeptides. In ing two missed cleavages and carbamidomethylation was set as Figure 5, the glycopeptide mass [M + 5H]5+ 1178.5038 was fixed modification. For IgD P01880 only in silico N-glycopeptides queried within 5 ppm tolerance and containing at least one were selected by filtering peptides containing the N-glycosite Hex, HexNAc, and NeuAc. In fact, the assignment of this IgA sequon, whereas for IgA P01876 (glyco)peptides were filtered O-glycopeptide with sequence 89His-Arg126 and total glycan com- with at least one serine or threonine residue. The UniCarbKB position Hex4HexNAc4NeuAc1 is a challenge to manual and au- glycans were selected and theoretical (glyco)peptides were gen- tomated annotation made by ProteinScape 4.0 (Bruker) as de- erated for each glycoprotein. In Figure 4 the precursor mass scribed in.[13] The presence of several ions of the y and b series in 964.4075(3+) of IgD_HUMAN was searched with 3 ppm of er- the MS/MS spectra that match the theoretical peptide fragmenta- ror tolerance containing at least one Hex, HexNAc, and NeuAc. tion ions are important for the correct assignment of such a large Only one of the theoretical N-glycopeptides with the peptide peptide sequence (Figure 5). Due to the particular sequence of 364 370 sequence Thr-Arg carrying the glycan with composition this tryptic O-glycopeptide the y27 appears as a prominent triply Hex5HexNAc5NeuAc1 was undoubtedly supported by the gly- and doubly charged product ion including some glycan frag- copeptide fragmentation (Figure 4). The neutral losses of the ments. Additionally, the presence of oxonium ions m/z 204.081+, glycan moiety of the glycopeptide until the peptide mass [M + m/z 366.141+,andm/z 657.231+ corresponding to the glycan + H] 774.4486 with additional presence of both y- and b-ion se- fragments HexNAc, Hex1HexNAc1,andHex1HexNAc1NeuAc1, ries are crucial for the identification of the peptide sequence sug- respectively, confirms the only hit predicted by PepSweetener.
Proteomics Clin. Appl. 2017, 1700069 1700069 (6 of 10) C 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.advancedsciencenews.com www.clinical.proteomics-journal.com
Figure 4. Query of the glycopeptide mass [M + 3H]3+ 964.4075 of IgD P01880 using PepSweetener advanced search and corresponding manually annotated MS/MS fragment spectra. The protein accession number of human IgD is P01880; IgD_HUMAN was digested with trypsin in silico with two missed cleavages and carbamidomethylation used as fixed modification. Combinations of IgD theoretical peptides and glycans were generated. The search of the observed precursor mass 964.4075(+3) was performed using a 3 ppm window with at least one Hex, HexNAc, and NeuAc. The peptide sequence 364Thr-Arg370 was clearly supported by the MS/MS fragmentation spectra.[13,44] The y- and b-ion series are highlighted in red and blue, respectively.
Figure 5. Query of the glycopeptide [M + 5H]5+ 1178.5038 of IgA P01876 using the advanced search of PepSweetener and the corresponding manually annotated MS/MS spectra. (a) The protein accession number of human IgA is P01876; IgA_HUMAN was digested with trypsin in silico with two missed cleavages and carbamidomethylation used as fixed modification. Together with UniCarbKB glycans, the theoretical peptide-glycans dataset for IgA was generated. The search was performed using a 5 ppm tolerance window and at least one Hex, HexNAc, and NeuAc. As a result, only one match was 89 126 possible and correctly assigned for the sequence His-Arg .TheO-glycopeptide carries O-glycans of total composition Hex4HexNAc4NeuAc1.(b) The corresponding spectra in which y- and b-ion series are highlighted in red and blue, respectively.
Proteomics Clin. Appl. 2017, 1700069 1700069 (7 of 10) C 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.advancedsciencenews.com www.clinical.proteomics-journal.com
Figure 5. Continued.
Unfortunately, it is not possible to identify the O-glycosylation user-defined sets of peptides and glycan compositions. The sim- sites due to the high number of serine or threonine residues on ple mode is suited to the analysis of high resolution MS data of the glycopeptide. human tryptic N-glycopeptides. The advanced search mode has Finally, we note that any PepSweetener users who may need a higher potential in the analysis of samples that contain only a the quick calculation of glycopeptide masses can use a calculator single protein or are a mixture of known proteins. Numbers of pad developed to complement PepSweetener. SwissMassAbaccus hits in this mode are much smaller, which simplifies the selec- is accessible online as well as available for download on any plat- tion of the most probable solution and allows for a larger mass form and devices from http://glycoproteome.expasy.org/swiss- tolerance range. Additionally, peptides and glycans used to build mass-abacus/ custom datasets can contain modifications, or result from prote- olysis with enzymes other than trypsin. In the next release of Pep- Sweetener (late 2017), all N-glycan compositions will be linked to [45] 9. Conclusion potential structures that will be displayed in the SNFG notation along with contextual information when available (published). In this report, we introduced PepSweetener, a new tool for the Note that a professional helpdesk made the fame of the ExPASy computer-assisted annotation of glycopeptide spectra. The visual- server[46] and further improvement of PepSweetener can easily be ization of potential glycoform combinations actually streamlines achieved through feedback from users. the annotation of intact glycopeptide MS spectra. PepSweetener On principle, PepSweetener was not designed as an automated is web-based and undertakes the matching of experimental MS annotation tool, but rather as a tool compatible with different precursors of potential glycopeptides with a database of theoret- experimental setups and protocols, supporting novice or expert ical peptide and glycan combinations. The tool accepts a list of scientists in MS manual annotation of glycoproteomics experi- precursor ion m/z values and returns a heat map chart containing ments. To be considered a stand-alone glycopeptide annotation all matching glycoforms within the predefined tolerance range. tool, PepSweetener is missing some of the features available in This simple process reduces the amount of time needed for the software packages for automated annotation, for example, gly- manual annotation of glycopeptide spectra. Precursor matching copeptide spectra filtering directly from MS/MS data or statistical can be done in a simple mode, that is, against a dataset consist- validation. It is envisaged that PepSweetener will evolve as expe- ing of combinations of all human N-glycosites containing pep- rience in annotating spectra builds up and allows the determina- tides with the most comprehensive set of glycan compositions, tion of relevant parameters and values warranted for automation. or in advanced mode, that is, against a custom dataset created by Furthermore, existing software transformation into publically
Proteomics Clin. Appl. 2017, 1700069 1700069 (8 of 10) C 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.advancedsciencenews.com www.clinical.proteomics-journal.com accessible web services would also significantly help tool combi- [12] D. Kolarich, P. H. Jensen, F. Altmann, N. H. Packer, Nat. Protocols nation into complex workflows reflecting the reality of analytical 2012, 7, 1285. work. [13] H. Hinneburg, K. Stavenhagen, U. Schweiger-Hufnagel, S. Pengelley, In conclusion, PepSweetener is a useful, MS platform- W. Jabs, P. H. Seeberger, D. V. Silva, M. Wuhrer, D. Kolarich, J. Am. independent exploratory tool that significantly speeds up the an- Soc. Mass Spectrom. 2016, 27, 507. [14] M. Windwarder, T. Yelland, S. Djordjevic, F. Altmann, Glycoconj. J. notation of intact glycopeptide MS data. 2015, 33, 387. [15] H. Hu, K. Khatri, J. Zaia, Mass Spectrom Rev. 2017, 36, 475. [16] M. Wuhrer, M. I. Catalina, A. M. Deelder, C. H. Hokke, J. Chromatogr. Abbreviations B Anal. Technol. Biomed. Life Sci. 2007, 849, 115. [17] K. S. Lynn, C. C. Chen, T. M. Lih, C. W. Cheng, W. C. Su, C. H. Chang, Hex, hexose; HexNAc, N-acetylhexosamine; NeuAc, N-acetylneuraminic C. Y. Cheng, W. L. Hsu, Y. J. Chen, T. Y. Sung, Anal. Chem. 2015, 87, acid 2466. [18] N. S. Meitei, A. Apte, S. I. Snovida, J. C. Rogers, J. Saba, J. Proteomics 2015, 127, 211. [19] K. B. Chandler, P. Pompach, R. Goldman, N. Edwards, J. Proteome Acknowledgments Res. 2013, 12, 3652. [20] J. S. Strum, C. C. Nwosu, S. Hua, S. R. Kronewitter, R. R. Seipert, R. This work was supported by the European Union, Seventh Framework Pro- J. Bachelor, H. J. An, C. B. Lebrilla, Anal. Chem. 2013, 85, 5666. gramme, Gastric Glyco Explorer Initial Training Network (grant number 316929). ExPASy is maintained by the web team of the Swiss Institute of [21] S. W. Wu, S. Y. Liang, T. H. Pu, F. Y. Chang, K. H. Khoo, J. Proteomics Bioinformatics and hosted at the Vital-IT Competency Center. Authors de- 2013, 84,1. clare no conflict of interest regarding this manuscript. [22] M. Bern, Y. J. Kil, C. Becker, Curr. Protocols Bioinform. 2012, Chapter 13, Unit13 20. [23] L. He, L. Xin, B. Shan, G. A. Lajoie, M. Ma, J. Proteome Res. 2014, 13, 3881. Conflict of Interest [24] Z. Zhu, D. Hua, D. F. Clark, E. P. Go, H. Desaire, Anal. Chem. 2013, 85, 5023. The authors have declared no conflict of interest. [25] C. L. Woodin, D. Hua, M. Maxon, K. R. Rebecchi, E. P. Go, H. Desaire, Anal. Chem. 2012, 84, 4821. [26] S. Toghi Eshghi, P. Shah, W. Yang, X. Li, H. Zhang, Anal. Chem. 2015, Keywords 87, 5181. [27] B. Tissot, S. J. North, A. Ceroni, P. C. Pang, M. Panico, F. Rosati, A. computer-assisted glycopeptide assignment; data analysis; glycoinformat- Capone, S. M. Haslam, A. Dell, H. R. Morris, FEBS Lett. 2009, 583, ics; glycoproteomics; MS 1728. [28] Y. Wu, Y. Mechref, I. Klouckova, A. Mayampurath, M. V. Novotny, H. Received: May 24, 2017 Tang, Rapid Comm. Mass Spectrom. 2010, 24, 965. Revised: August 18, 2017 [29] O. Serang, J. W. Froehlich, J. Muntel, G. McDowell, H. Steen, R. S. Lee, J. A. Steen, Mol. Cell. Proteomics 2013, 12, 1735. [30] C. A. Cooper, E. Gasteiger, N. H. Packer, Proteomics 2001, 1, 340. [31] A. Ceroni, K. Maass, H. Geyer, R. Geyer, A. Dell, S. M. Haslam, J. Proteome Res. 2008, 7, 1650. [1] K. W. Moremen, M. Tiemeyer, A. V. Nairn, Nat. Rev. Mol. Cell Biol. [32] H. J. An, J. S. Tillinghast, D. L. Woodruff, D. M. Rocke, C. B. Lebrilla, 2012, 13, 448. J. Proteome Res. 2006, 5, 2800. [2] C. E. Parker, V. Mocanu, M. Mocanu, N. Dicheva, M. R. Warren, in [33] N. Deshpande, P. H. Jensen, N. H. Packer, D. Kolarich, J. Proteome Neuroproteomics (Ed: O. Alzate), CRC Press, Boca Raton, FL 2010. Res. 2010, 9, 1063. [3] S. Pan, R. Chen, R. Aebersold, T. A. Brentnall, Mol. Cell Proteomics [34] M. P. Campbell, R. Peterson, J. Mariethoz, E. Gasteiger, Y. Akune, K. 2011, 10, R110 003251. F. Aoki-Kinoshita, F. Lisacek, N. H. Packer, Nucleic Acids Res. 2014, [4] A. Almeida, D. Kolarich, Biochim. Biophys. Acta 2016, 1860, 1583. 42, D215. [5] M. Thaysen-Andersen, N. H. Packer, B. L. Schulz, Mol. Cell Proteomics [35] A. Masselot, Genentech Bioinformatics & Computational Biology De- 2016, 15, 1773. partment, https://github.com/Genentech/fishtones-js, 2014. [6] Z. Zhu, H. Desaire, Annu. Rev. Anal. Chem. (Palo Alto Calif.) 2015, 8, [36] D. M. Creasy, J. S. Cottrell, Proteomics 2004, 4, 1534. 463. [37] D. C. Dallas, W. F. Martin, S. Hua, J. B. German, Brief. Bioinform. 2013, [7] A. Varki, R. D. Cummings, J. D. Esko, H. H. Freeze, P. Stanley, C. R. 14, 361. Bertozzi, G. Hart, M. E. Etzler, Essentials of Glycobiology, 2nd ed., Cold [38] E. D. Dodds, Mass Spectrom. Rev. 2012, 31, 666. Spring Harbor Laboratory Press, New York 2009. [39] K. Vekey, O. Ozohanics, E. Toth, A. Jeko, A. Revesz, J. Krenyacz, L. [8] P. Hufnagel, A. Resemann, W. Jabs, K. Marx, U. Schweiger-Hufnagel, Drahos, Int. J. Mass Spectrom. 2013, 345, 71. in Discovering the Subtleties of Sugars. Proceedings of the 3rd Beilstein [40] R. Plomp, P. J. Hensbergen, Y. Rombouts, G. Zauner, I. Dragan, C. A. Glyco-Bioinformatics Symposium (Eds: G. Martin, C. K. E. Hicks), Pots- Koeleman, A. M. Deelder, M. Wuhrer, J. Proteome Res. 2014, 13, 536. dam, Germany 2015, 85 pp. [41] J. Huang, A. Guerrero, E. Parker, J. S. Strum, J. T. Smilowitz, J. B. [9] K. Stavenhagen, H. Hinneburg, M. Thaysen-Andersen, L. Hartmann, German, C. B. Lebrilla, J. Proteome Res. 2015, 14, 1335. D. Varon Silva, J. Fuchser, S. Kaspar, E. Rapp, P. H. Seeberger, D. [42] T. S. Mattu, R. J. Pleass, A. C. Willis, M. Kilian, M. R. Wormald, A. C. Kolarich, J. Mass Spectro. 2013, 48, 627. Lellouch, P. M. Rudd, J. M. Woof, R. A. Dwek, J. Biol. Chem. 1998, 273, [10] W. Zhang, H. Wang, L. Zhang, J. Yao, P. Yang, Talanta 2011, 85, 499. 2260. [11] K. D. Greis, B. K. Hayes, F. I. Comer, M. Kirk, S. Barnes, T. L. Lowary, [43] Y. Wada, A. Dell, S. M. Haslam, B. Tissot, K. Canis, P. Azadi, M. Back- G. W. Hart, Anal. Biochem. 1996, 234, 38. strom, C. E. Costello, G. C. Hansson, Y. Hiki, M. Ishihara, H. Ito, K.
Proteomics Clin. Appl. 2017, 1700069 1700069 (9 of 10) C 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.advancedsciencenews.com www.clinical.proteomics-journal.com
Kakehi, N. Karlsson, C. E. Hayes, K. Kato, N. Kawasaki, K. H. Khoo, Prestegard, R. L. Schnaar, H. H. Freeze, J. D. Marth, C. R. Bertozzi, M. K. Kobayashi, D. Kolarich, A. Kondo, C. Lebrilla, M. Nakano, H. Nari- E. Etzler, M. Frank, J. F. Vliegenthart, T. Lutteke, S. Perez, E. Bolton, matsu, J. Novak, M. V.Novotny, E. Ohno, N. H. Packer, E. Palaima, M. P. Rudd, J. Paulson, M. Kanehisa, P. Toukach, K. F. Aoki-Kinoshita, B. Renfrow, M. Tajiri, K. A. Thomsson, H. Yagi, S. Y. Yu, N. Taniguchi, A. Dell, H. Narimatsu, W. York, N. Taniguchi, S. Kornfeld, Glycobiol- Mol. Cell Proteomics 2010, 9, 719. ogy 2015, 25, 1323. [44] J. N. Arnold, C. M. Radcliffe, M. R. Wormald, L. Royle, D. J. Harvey, [46] P. Artimo, M. Jonnalagedda, K. Arnold, D. Baratin, G. Csardi, E. de M. Crispin, R. A. Dwek, R. B. Sim, P. M. Rudd, J. Immunol. 2004, 173, Castro, S. Duvaud, V. Flegel, A. Fortier, E. Gasteiger, A. Grosdidier, 6831. C. Hernandez, V. Ioannidis, D. Kuznetsov, R. Liechti, S. Moretti, K. [45] A. Varki, R. D. Cummings, M. Aebi, N. H. Packer, P. H. See- Mostaguir, N. Redaschi, G. Rossier, I. Xenarios, H. Stockinger, Nu- berger, J. D. Esko, P. Stanley, G. Hart, A. Darvill, T. Kinoshita, J. J. cleic Acids Res. 2012, 40, W597.
Proteomics Clin. Appl. 2017, 1700069 1700069 (10 of 10) C 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 4.2. Concluding remarks 121
4.2 Concluding remarks
Exploratory tools help scientists perform data analysis in a faster and more accurate way. The three tools presented in this chapter, introduce a collection of new fea- tures which were missing in the glycomics analysis workflows. PepSweetner adds an intuitive web interface to manually curate glycopeptides which cannot be auto- matically identified by expert algorithms (Liu et al., 2017a; Bern, Kil, and Becker, 2012). Glynsight is the first tool which empowers users to display and compare gly- can profiles with a new dedicated interface. To conclude, Glydin’ proposes a novel way of displaying glycan epitopes in a network visualisation where connections are based on substructure relationships.
123
Chapter 5
Knowledge-based tools
5.1 Overview
After integrative and explorative tools, we present here the knowledge-based tool which, in our case, is represented by GlyConnect. This software application rep- resents the ultimate "swiss army knife" to explore relationships between glycans, proteins, taxons, tissues and diseases. GlyConnect stores curated data, and high throughput studies information from the literature. It has a focus on glycoproteins but, at the same time, we worked hard to take glycobiology out of its isolation and connect the information with external sources. Therefore, GlyConnect merges two different data integration techniques which have been presented in the introduc- tion: link integration and service-oriented architecture. Indeed, all pieces of in- formation stored in the database are cross-referenced with external bioinformatics resources like UniProt and NeXtProt for proteins, GeneCards for genes, NCBI tax- onomy browser for species, Uberon and Brenda for tissues and Disease Ontology for diseases. Beside link integration, Glyconnect combines a collection of in-house tools using the Service Oriented Architecture developed under the Glycomics@ExPASy initiative. Users can send GlyConnect query features to perform protein sequence alignments using GlycoSiteAlign. They can also extract glycoepitopes from a collec- tion of glycan structures with EpitopeXtractor. Finally, some proteins have a direct connection to LiteMol, an external protein structure viewer, to display the 3D struc- ture of the protein together with glycans. GlyConnect integrates in-house tools but, at the same time, its data is integrated into other software applications. This is the 124 Chapter 5. Knowledge-based tools case of PepSweetener which uses a composition list from GlyConnect and Glysight which takes advantage of the composition to structures service available with the GlyConnect API. GlycoSiteAlign uses GlyConnect curated data to perform protein sequence alignments (Figure 9). Data integration is part of every aspect of GlyCon- nect, so that all diagrams and custom visualisations are designed to enhance the hypothesis building process by combining different types of information selected by the user.
To conclude, GlyConnect has become a trusted source of external resources. At the moment, the list of glycan compositions, already used in PepSweetener, has become the UniMod (Creasy and Cottrell, 2004) reference for glycan masses that are possibly attached to proteins. We are working in the same way to cross-reference ProSight (Fellers et al., 2015).
We present here the manuscript that has been submitted to the Journal of Proteome Research. GlyConnect: glycoproteomics goes visual, interactive and analytical
Davide Alocci1,2, Julien Mariethoz1,2, Alessandra Gastaldello1,2, Elisabeth Gasteiger3, Niclas G Karlsson4, Daniel Kolarich5,6, Nicolle H Packer5,6,7, Frédérique Lisacek1,2,8
1. Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, Rue Michel-Servet 1, CH-1211 Geneva, Switzerland 2. Computer Science Department, University of Geneva, Switzerland 3. Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland 4. Department of Medical Biochemistry and Cell Biology, Institute of Biomedicine, University of Gothenburg, Sweden 5. Institute for Glycomics, Griffith University, Southport, Queensland, Australia 6. ARC Centre for Nanoscale BioPhotonics, Macquarie University and Griffith University, Australia 7. Department of Molecular Sciences, Macquarie University, Sydney, NSW, Australia 8. Section of Biology, University of Geneva, Switzerland
Keywords: Data integration, database, software, glycoprotein, glycomics, glycoproteomics, glycan structure, mass spectrometry, data curation, data visualisation
Abstract
Knowledge of glycoproteins, their site-specific glycosylation patterns and the glycan structures that they present to their recognition partners in health and disease are gradually being built on using a range of experimental approaches. The data from these analyses are increasingly being standardised and presented in various sources, from supplemental tables in publications to localised servers in investigator laboratories. Bioinformatics tools are now needed to collect this data and enable the user to search, display and connect glycomics and glycoproteomics to other sources of related proteomics, genomics and interactomics information. We here introduce GlyConnect (https://glyconnect.expasy.org/), the central platform of the Glycomics@ExPASy portal for glycoinformatics. GlyConnect has been developed to gather, monitor, integrate and visualize data in a user-friendly way in order to facilitate the interpretation of collected glycoscience data. GlyConnect is designed to accommodate and integrate multiple data types as they are increasingly produced.
1 Introduction
The importance of identifying proteoforms is increasingly stressed in the proteomics community and the intensive work involved in assessing their variety and extent was recently discussed1. In that overview, the case of glycoforms was singled out and the contribution of glycoproteomics discussed. The heterogeneity of glycoproteins and the variety of techniques that are used for their study has raised obstacles to the automation of the analytical procedures. Progress in this area has also been limited by the lack of standardised, open access software for integrating glycomics with other -omics data. A current consensus regarding advances in glycosciences is the acknowledgement of the bottleneck created by limited bioinformatics support.
The range of databases and tools that can support functional studies of mammalian glycoproteins are extremely diverse and include numerous databases and software tools including glycan structure collections, mass spectrometry and chromatographic analysis tools, glycoenzyme classification, glycoprotein 3D structure as well as glycan and lectin array interaction databases (see overview of current resources in recent reviews2 3 4. These have all been developed since the landmark glycoinformatics efforts were begun with CarbBank5 and first connected to other resources in glycobiology by GLYCOSCIENCES.de6 and the Consortium
7 for Functional Glycomics (CFG) . A few years ago, we initiated the development of interactive and exploratory tools to populate the Glycomics@ExPASy portal8. As the next step, we introduce here GlyConnect, a central platform integrating glycoproteomic data and related software in a user-friendly resource. By design, Glyconnect will enable its use by all biologists investigating biomolecular functions. In particular, emphasis is put on usability and data visualisation.
Despite its importance, data visualisation remains a challenge in glycobiology. The adoption of a cartoon notation9 10 simplifying the representation of glycans was a major step in spreading interest in these molecules and easing their use in databases and software. Nonetheless, only a handful of visualisation tools have improved some aspects of glycan identification and structural representation. Glycoviewer11 is the first example of a data visualisation tool that allows glycoscientists to visualise, summarise and compare different glycomes. GlycomeAtlas12 provides an interactive interface for exploring CFG array data . In
2 parallel, three-dimensional modelling had also generated interactive images of glyco- molecules13 and glyconjugates14 as recently reviewed15. Recently, GlycoDomainViewer16 was introduced as a visual "integrative tool for glycoproteomics that enables global analysis of the interplay between protein sequences, glycosites, types of glycosylation, and local protein fold /domain and other PTM contexts". GlycoDomainViewer integrates experimental data as well as knowledge data sources presenting an extensive collection of information to explore the possible effect of glycosylation on a protein. To build custom visualisation, we have tried to mix different points of view. Glycobiologists can start from their knowledge about glycans and move towards, for example, proteins or disease, whereas proteomics experts focus on the glycosylation of a protein(s) of interest. GlyConnect presents a series of interactive diagrams that help the user understand relations between glycans, proteins, tissues, diseases and taxonomy. Search results are presented in a conceptual map, which shows the connections between glycans and proteins via glycan compositions. User-friendly visualisation allows non- expert scientists to explore the data and break the barrier between different disciplines. To bridge glycomics with proteomics, we have cross-referenced each protein with UniProt17 and neXtProt18, a knowledge platform for human proteins, providing the connection to proteomics. We have also associated each glycoprotein with GeneCards19, a searchable and integrative database that compiles information on all annotated and predicted human genes. Recently, we have also started a series of “Glycome pages” that summarise all the information available for the glycome specific for a tissue or body fluid. The first page has been focusing on the human plasma N-glycome.
The present article highlights the features of GlyConnect and through examples of data searches, describes the potential of the platform for revealing unexpected associations as well as formulating new questions regarding glycosylated proteins.
Material and Methods
Data
The molecular entities that are described in the database are shown in Figure 1.
3
Figure 1: Depiction of the cell membrane harbouring a glycoprotein. The cell membrane accommodates a transmembrane protein which is decorated by a glycan structure attached on the asparagine residue (N) part of the characteristic sequon NQT; N is the glyco(sylation) site. Additionally, a subregion of the glycan structure, called ligand or glyco-epitope or glycan determinant, interacts with a glycan-binding protein or lectin. Peptide and composition are notions related to the experimental techniques such as mass spectrometry used to identify intact glycoproteins. In glycoproteomics, proteins are digested into peptides to be analysed, and attached glycans are usually elucidated as compositions, a list of all the building blocks that compose a structure.
Database content
GlyConnect data content spans information on glycan structures and compositions, glycoproteins, glycosylation sites, taxonomy, tissue expression and diseases. Approximately 70% of the data in GlyConnect is manually curated. It includes the full content of the GlycoSuiteDB database20 licensed by SIB as well as curated data by Robyn Peterson (former Packer lab member). This information is labelled in the database as reviewed. It also contains data extracted from the supplementary material of several large-scale glycoproteomics studies recently published. Supplementary material from21 22 was downloaded. This data is
4 partially curated only to reduce redundancy and enhance the consistency of the database. For example, when several accession numbers are listed for a single protein, these are reviewed and merged whenever possible. Additional filters such as the minimum size of a peptide (at least seven amino acid-long) validating a glycosite, are also applied. This data has introduced a bias towards a majority of human glycoproteins in the current version of the database.
Cross Links and cross references
Information in the database is supported with a broad range of cross-references. Table I summarises them all, with corresponding URLs.
Table I: Cross-references in GlyConnect
Cross-reference Resource URL
Glycan GlyTouCan glytoucan.org
Protein UniProtKB uniprot.org
neXtProt nextprot.org
UniLectin unilectin.eu
Gene GeneCards genecards.org
3D Structures Protein Data Bank ebi.ac.uk/pdbe/ (via Litemol)
Reference PubMed ncbi.nlm.nih.gov/pubmed
DOI doi.org
Taxonomy NCBI Taxonomy ncbi.nlm.nih.gov/taxonomy
Disease Disease Ontology disease-ontology.org
Peptide dbSNP ncbi.nlm.nih.gov/projects/SNP/
5 Database model
The data scheme is inherited from the GlycoSuiteDB database. The database content was transferred from Proteome Systems Ltd to SIB Swiss Institute of Bioinformatics under a license agreement in 2009. This content was structured in a spreadsheet like manner with a single table. To manage this non-normalised and error-prone architecture, we applied the current usual database normalisation techniques23. This normalisation process helped to target and remove errors and duplications. The concept of normalisation consists mostly in splitting data tables into smaller tables to reduce data redundancy and improve data integrity. We applied data modelling techniques to guarantee the uniqueness of glycan structures, compositions, proteins, expression tissues, diseases and references supporting the information. The normalisation process was applied to preserve the relatedness of information established in publications and ease navigation between entities (glycan, protein, expression tissue, disease, etc.) by keeping contextual information. The redesigned database now complies with the Essential Tuple Normal Form (ETNF) and as such, prevents tuple redundancy as well as observes at least the four first normal forms of the relational model24. This now makes the refactored dataset robust, consistent and easier to extend. The latter was verified when adding the large-scale glycoproteomics data. This process, which entailed integrating new concepts like peptide and proteoform while fitting the existing data model, was rapidly implemented and applied.
Since GlyConnect is a glycoprotein-centric database, glycosylation remains the core concept. The glycosylation notion of the database is built using the association of a glycan structure with a given protein at a specific amino acid position. Moreover, the glycosylation is supported by one or more curated articles and the given article may relate the glycosylation to a disease. The conceptualisation and entity relationships reflect knowledge of the glycobiology that is modelled.
In addition to the relational data model, an RDF (Resource Description Framework) version that reflects the GlyConnect data, is under construction. This process relies on a Glycoconjugate ontology currently under development. Providing an RDF version of the GlyConnect data will support federated queries with UniProt17, NextProt18, GlyTouCan25 or any other triplestore of interest. As a result, the database is integrated using RDF and semantic web technology.
6 Encoding and display of glycans
Many formats have been created to encode and represent glycan structures26. In GlyConnect we have selected the current most popular in glycoinformatics. Glycan structures are stored in the GlycoCT format27. They are displayed by default with the SNFG notation9 promoted in the reference manual Essentials of Glycobiology28. This cartoon representation of glycans simplifies the otherwise fastidious atomic description. Structures can also be viewed in the IUPAC condensed29 and the Oxford30 formats.
Design and Implementation
GlyConnect is intended as a dashboard to monitor and facilitate the interpretation of glycomics and glycoproteomics data. Its design reflects this aim. Users can access GlyConnect at different levels using the Application Programming Interface (API) or through a web interface. GlyConnect data are stored using PostgreSQL 9.5, and they are part of the Glycomics@ExPASy initiative8. Additionally, GlyConnect has full access to an internal glycoepitope database and SugarBindDB31, a database of pathogen-glycan interactions. The API is built with the open source framework Play (version 2.5) that combines Java and Scala code and allows the separation between model, view and controller. In GlyConnect the model is represented by the mapping class built with Ebean, an object-relational mapping (ORM) which converts each table row from the database into a java object. Controllers link models and views and provide API calls.
As shown in Figure 2, the central element of GlyConnect is the database that contains both reviewed and unreviewed data (lower left frame). There are two main types of access: Search and Browse. As a result, the web interface of GlyConnect is a mixture of technologies. The Search Interface (upper left frame of Figure 2) available under the “Search” tab has been developed using web components and in particular the library Google Polymer (Version 2).
7
Figure 2: Glyconnect Overview. This figure summarises the Glyconnect project. The upper part focuses on the two interfaces available in GlyConnect: Search and Browse. The Search interface proposes an intuitive query builder for exploring glycans and proteins. Results are shown in a conceptual map familiarly called octopus for its shape. The Browse interface displays data in lists of items and proposes a global view of the data. The lower part of the figures, from left to right, highlights the data input and the cross-references. GlyConnect includes data reviewed by experts (manual annotation) and unreviewed data from high- throughput studies. Each piece of information has been cross-referenced with databases listed in the cross-references section.
Web components are reusable user interface widgets, which are built using a mixture of HTML5, CSS3 and Javascript. In GlyConnect we have used Google Polymer, which speeds up the development of custom web components fulfilling the standard established by the W3C. The Browse Interface (upper right frame of Figure 2) available under the “Browse” tab is built with Play as stated above. Pages displayed in that section are views generated from the database built with Scala and include Twitter Bootstrap as well as several other Javascript libraries to improve user-friendliness.
8 Search interface
The search interface is the main entry point to query data available in the GlyConnect database. As hinted in its name, GlyConnect outputs relationships between glycans and proteins or glycans and tissues or taxonomy. Then, users can further search either proteins or glycans. Starting with proteins is rather straightforward while querying with glycans offers more than one option. Since glycan structures are notoriously heterogeneous, the search interface allows the user to query by monosaccharide composition or by structural features such as fucosylated or sialylated structures or by structural motif.
Search by protein
A protein name or UniProt accession number can be selected from a dropdown list among the ones present in the database. Since each protein can be found in multiple species, a box displays all species available and lets the user choose the ones of interest. After selecting the species, clicking on the search button will perform the query.
Search by monosaccharide composition
Glycan compositions can be input manually or uploaded in a file. Alternatively, the “Monosaccharide selectors” tab provides a wizard to define compositions in the correct format. Clicking on the search button launches the query.
Search by linkage type
The database queries reflect the classical grouping of glycans based on the distinct types of attachment to proteins: N-Linked, O-Linked and free glycans (released from conjugates). The interface proposes three levels of selection of structural features organised in three successive tabs. In the first tab, the broad classification of N-linked (Oligomannose, Complex, etc) or O-linked (core 1,2,3, etc) is provided in the form of selectable items to be dropped in the query box. In the second tab, another series of more distinct properties is available for selection with the same principle of dropping items into the query box. A complete list of all recorded features is given in Table IIa,b.
9 Table IIa: Glycan structural properties used in GlyConnect Category Property type Property name
Compositional Fucosylation Fucosylated
Core-fucosylated
Non-fucosylated
Non-core-fucosylated
Compositional Galactosylation Galactosylated
Non-galactosylated
Compositional Sialylation Sialylated
Non-sialylated
Mono-sialylated
Di-sialylated
Over-two-sialylated
Compositional Xylosylation Xylosylated
Core-xylosylated
Non-xylosylated
Non-core-xylosylated
Structural Antenna Bi-antennary
Tri-antennary
Tetra-antennary
Over-tetra-antennary
Structural Bisection Bisecting
Non-bisecting
Structural Truncation Truncated
Non-truncated
10 Table IIb: Glycan cores in the database Glycan type Core name Free Complex Free Lactose Free No-core Free None Free Truncated N-Linked Complex N-Linked High-Mannose N-Linked Hybrid N-Linked No-core N-Linked Pauci-Mannose N-Linked Truncated N-Linked Undefined core
N-Linked/O-Linked No-core N-Linked/O-Linked Truncated If N-Linked
O-Linked Core 0
O-Linked Core 1 O-Linked Core 10
O-Linked Core 11 O-Linked Core 12 O-Linked Core 13 O-Linked Core 14
O-Linked Core 2 O-Linked Core 3 O-Linked Core 4 O-Linked Core 5 O-Linked Core 6 O-Linked Core 7
O-Linked Core 8 O-Linked Core 9
11 O-Linked No-core O-Linked Undefined core O-Linked/C-Linked No-core Undefined type No-core Undefined type Undefined core
In the interface, the colour code for each box/feature matches that of the SNFG notation (red in relation to fucose, yellow in relation to galactose, etc). The selection, if any, of features in one tab remains active when performing a selection in another tab. In this way, the user can combine criteria and search, for example, High Mannose structures (first tab), core- fucoslyated (second tab), then make a choice in the third tab from a list of structural motifs (limited to known determinants, such as Sialyl Lewis X, Blood group B, etc). In the third tab, an SNFG picture of each motif helps the user in selecting motifs of interest. Clicking on the search button launches the query.
Conceptual Map (aka octopus)
The results of each query performed in the search interface are displayed as a conceptual map, the shape of which evokes an octopus. We use this evocative image to describe the content. For the sake of clarity, we have used the simplest and most compact notation for compositions that is presented in Table III.
Table III: Notation of monosaccharides
Full name Short name Encoding Symbol
Hexose Hex H circle
N-Acetylhexosamine HexNAc N square
Fucose Fuc F triangle
N-Glycolyl-Neuraminic acid NeuGc G diamond (Sialic acid) (purple)
N-Acetyl-Neuraminic Acid NeuAc S diamond
12 (Sialic acid) (light blue)
Sulphate S s S
The centre of the map contains the list of monosaccharide compositions of the glycans. Information in the left and right parts of the map connected with “tentacles” to compositions can change according to the option selected in the toolbar above the map. Several possibilities are given to the user:
● Protein - Glycan ● Protein - Taxonomy ● Protein - Disease ● Disease - Glycan ● Tissue - Disease ● Tissue - Glycan Each possibility reflects the values that will populate the map. For example, if Protein - Taxonomy is selected, then protein names on the left side will be connected to corresponding taxa on the right side via monosaccharide compositions in the centre. Every time a new map option is selected, the octopus is updated: and the shape adjusted; only the central information on glycan compositions remains.
When the search criteria are too loose, the map is likely to be overpopulated and impractical. A message prompt is then displayed and invites the user to apply stricter criteria. This of course can be done by relaunching a search with additional criteria. We also included a quick fix with a series of filters located above the conceptual map. The taxonomy filter reduces the size of the octopus by restricting the results to taxa of interest. Two combo boxes compose the taxonomy filter, the first limits the search to taxonomy classes, e.g., Mammalia, whereas the second sets a specific species. The date filter restricts the results to data evidenced by articles published within a set time bracket.
The conceptual map represents the entry point to navigate query results. Mouse-hovering on a box, in the centre of the map, highlights the tentacles that have a connection with the selected composition. Those tentacles can be coloured in red by clicking on the box. A second click on the box, restore the grey colour deselecting the tentacles. By clicking on each dot of
13 the left and right sides, the user can obtain more details about a particular glycan, protein, taxonomy, disease or tissue. Protein and glycan pages share the same look of the search interface whereas all the other pages (Taxonomy, Disease, Tissue) redirect the user to the browse interface.
Since each extremity of a tentacle links to information that is either reviewed or unreviewed, different colours distinguish evidence checked by expert curators from evidence extracted from high-throughput studies. Light-blue has been assigned to reviewed contents while grey identifies unreviewed information. In the octopus, dots at the tip of tentacles and composition boxes in the central part change colours according to the status of the associated content.
Results obtained in GlyConnect can be forwarded to GlycoSiteAlign32 that aligns the amino acid sequences surrounding glycosites depending on the properties of the attached glycans. Clicking on “Align glycosites with GlycoSiteAlign” sends all the glycan information (compositions and structural features such as “fucoslyated”) on glycosites of selected proteins to GlycoSiteAlign. This will result in displaying the amino acid sequence alignment (by default, 20 positions on each side of the glycosylated residue).
Protein and Glycan pages
Protein and glycan pages have been engineered to: (1) display relevant information stored in the database and (2) whenever possible, integrate tools developed as part of the Glycomics@ExPASy initiative. The overall display is similar for both glycans and proteins:
(a) A star preceding the name of the entity at the top of a page qualifies the status of the information contained in that page: a star filled in yellow qualifies reviewed information whereas an empty star identifies unreviewed data. (b) Cross-references to external databases are listed at the top of a page (c) A page is composed of several sections that are delineated with blue bars (d) Sankey graphs are shown in a section labelled “Associations” to display the relationships between the glycoprotein or the glycan structure of interest and other entities of the database, namely, glycosites, tissues and diseases. These graphs are currently not hyperlinked to the database (see future work)
14 (e) PubMed references provided as supporting evidence are shown in a section labelled “References”; links to DOI are also included.
Specificity of glycan pages
In a glycan page, the different sections are arranged in tabs coloured in blue (resp. grey) when selected (resp. not selected). A specific tab labelled "Family" contains the result of a composition search across the database shown as a list of the possible arrangements of the same monosaccharide building blocks displayed as SNFG images. This list can in turn serve as input to EpitopeXtractor33. This tool will perform an extraction of all the known glyco- epitopes, based on a curated list of almost 600 glycan determinants33. This input can be modified by the user as items of the list can be removed acting on a red icon next to each glycan image.
Specificity of protein pages
In a protein page, the different sections are arranged vertically with headers boxed in blue. A specific section of the page contains site-specific information and it is labelled "Sites". Each site is presented in a collapsible container featuring the glycosylated amino acid and its relative position on the protein (e.g. "Asn-234") unless evidence is partial and the site is unknown and labelled as such. The header is red (resp. grey) if the evidence is reviewed (resp. unreviewed). Once collapsed, glycans that were identified on that site are listed as SNFG images. In each site-specific section, when compositions co-occur with defined structures, matching structures to a composition are grouped within a single frame. Furthermore, the defined released structures at unknown site(s) if any, are also matched to reported compositions for possible site assignment. They are singled out with a grey border in the interface. Lastly, when the information is extracted from mass spectrometry data, the corresponding list of identified glycopeptides is provided. Peptides are listed after the glycan images when the information is available.
In a protein page, all SNFG images of glycan structures that are listed are individually linked to their corresponding glycan page.
15 Browse Interface
The Browse interface is the entry point to explore data available in the GlyConnect database. Each molecular entity of Figure 1 is accessible in the top menu (grey bar). A colour code is preserved throughout to distinguish each of the nine data types: taxonomy [red], protein [dark yellow], tissue [green], (glycan) structure [light blue], (glycan) composition [purple], disease [pink], (published) reference [dark blue], (glycosylation) site [brown], (glycosylated) peptide [orange]. Each the coloured boxes (except the last two) of the homepage as shown in Figure 2, links directly to the list of corresponding entries. Neither “Sites” nor “Peptides” are displayed as independent entities.
The interface is built to facilitate navigation from one entity to the other thereby supporting easy data exploration. For example, starting with the list of structures available either from the homepage by clicking on the “Structures” button or selecting “Structures” in the top menu (grey bar) and selecting a particular one by clicking on it will open a page with the details of that structure (SNFG image, IUPAC code, GlycoCT code, GlyTouCan25 cross-reference). Statistics are updated in the coloured boxes showing the number and the names of proteins, tissues, species, etc, where this structure is found along with the references that support these relationships. Any of the pieces of information displayed is clickable and further exploration of protein/tissue/species/disease/site/peptide is made possible. With this framework, any entity is a potential starting point and contextual information can be collected through the navigation from one data type to the other.
Since the coverage of entities (proteins, glycans, etc) is ever-growing, filters are implemented to reduce the size of lists displayed on the screen. For example, proteins can be filtered by name or UniProt accession number and species by taxa. Note that this display contrasts with the Search interface, as lists of proteins can be handled simply by selecting appropriate terms in the filter area.
RESTful API
The Play web application of GlyConnect has two main components. It supports the web pages of the Browse Interface as well as other data related services using a RESTful API (application program interface) architecture. The purpose of the API is mainly to feed the
16 Search Interface and other frontends with json formatted responses. When a query is built from the search interface, the search parameters are sent to the API through a URL that in turn sends a json representation of the result. The search option parameters, the conceptual map, the glycan and protein pages and the sankey graphs are built with json data. The role of the API is to forward data from the database to the service processing the data in a standard format that is easy to parse. Some other services are accessible through the API and used in external applications. Since a glycan structure is usually searched in the database using the GlycoCT format, the API provides services to convert glycan encoding formats. This includes a textual converter from IUPAC of glycan to GlycoCT and an image generator that uses GlycanBuilder libraries34 and takes GlycoCT codes to return a cartoon representation (SNFG, Oxford, IUPAC 2D) in an image file (PNG, SVG BMP, SVG) that can be saved. Another API service compiles a list of glyco-epitopes. A composition search is also available and used by Unimod35 to redirect users to GlyConnect. Finally, the API provisions mappings of GlyConnect with UniProt, GeneCards19 and neXtProt to feed each respective data integration pipeline.
Glycome Pages
The search and browse interfaces of GlyConnect allow users to search a specific glycan/protein but it is impossible to visualise and compare multiple glycans/proteins at the same time. For this reason, we have initiated the generation of more integrated “Glycome” pages. Each of these pages is destined to focus on a specific glycoproteome by providing combined information on protein and glycan expression. The first glycome page is dedicated to the N-glycome of the human plasma. All the data has been extracted from Clerc et al.36 that reviewed human plasma protein N-glycosylation. In this article, the glycosylation of plasma proteins is put into perspective so that the contribution of individual proteins to a total plasma N-glycosylation profile is examined at the released glycan level. The result of this work is stored in a large table that shows glycoproteins in relation to glycan species in the form of monosaccharide compositions. We transformed this static table into interactive diagrams linked to our database. This was achieved by creating ad hoc visualisation, as follows:
1) The "Protein Abundance" pane shows the relative abundance of the 24 plasma glycoproteins listed by Clerc et al. 36 Each protein is associated with a coloured disk,
17 and the dimension of the disk is proportional to the relative abundance of the protein in the plasma. The disks are displayed in a spiral clockwise from the largest to the smallest. Hovering over a disk shows the protein name. Next to the spiral, glycoproteins are shown by names and grouped by main biological processes as per their Gene Ontology terms collected in the corresponding UniProt entries. Clicking on a name in any of these groups opens the related protein page in GlyConnect (search interface) whereas clicking on a disk triggers a change in the "Glycan Abundance per Protein" pane. 2) The “Single glycan abundance” pane shows the relative abundance of a single glycan composition expressed on each of the 24 glycoproteins listed in the previous pane. 3) The “Glycan abundance per protein” pane illustrates the relative abundance of each glycan composition found on a selected glycoprotein. The relative abundance is visualised using a bar that can change colour according to the threshold defined by the user. This interface is built on the same principle as that of Glynsight33 where the dependencies between compositions (for example H4N4F1 contains H4N4 and the two differ by one fucose) are included in the visualisation of the glycome profile. As stated before, to select a different protein simply involves clicking on a different disk in the "Protein Abundance" pane. Also, clicking on a bar in the “Glycan abundance per protein” pane will trigger a change in the “Single glycan abundance” and “Glycan structures” panes.
4) The “Glycan structures” pane shows all currently reported, possible glycan structures that match a specific monosaccharide composition retrieved from GlyConnect glycan structures. The “Glycan structures” pane changes its content according to the glycan composition selected in the “Glycan abundance per protein” pane.
Results
Integrated platform for glycobiology
GlyConnect includes data extracted from a broad range of published articles ranging from the characterisation of the global glycome of glycans in a biological system, a handful of glycans attached to an individual protein or the full screening of a cell glycoproteome. The
18 content of the database reflects the current bias found in the literature. At present, it is clearly leaning towards released glycan characterisation or glycoprotein site-specific characterisation of mammalian N-glycan compositions. This is a direct consequence of the recent improvements made for handling high throughput mass spectrometry techniques in glycomics (the glycans released from proteins) and glycoproteomics (the analysis of intact glycopeptides). The current statistics are visible on the homepage of the Browser section. In addition to the GlyConnect core data, the database also integrates cross-references (see Table I) and tools for users to process the database content and build hypotheses based on previously unseen associations. As shown in Figure 3, GlyConnect data is accessible through two distinct interfaces; while the Browse interface simply displays information stored in the database (unless filtered), the Search Interface relies on the API to extract only the information relevant to the search criteria.
Figure 3: Glyconnect dashboard. This figure describes in finer details the two interfaces of GlyConnect featuring in Figure 2 focusing on cross-references and in-house tools integration. The information stored in the database, can be explored using both the Browse and Search interfaces. The Browse interface is composed of eight types of pages: Taxonomy, Tissue,
19 Glycan, Protein, Site and Reference. All pages have the same look and feel but each one is cross-referenced with different sources (green arrow) according to the content (i.e. disease page is connected to disease ontology). The Search interface proposes an interactive query builder and a conceptual map, called octopus, to show the results. From the octopus, it is possible to run GlycoSiteAlign, which performs the alignment of amino acid sequences based on glycan features. Additionally, the search interface has dedicated protein and glycan pages and uses the browser pages for Taxonomy, Tissue and Disease. These specific glycan and protein pages are reachable only from the octopus and they have been integrated with external tools. Selected protein pages have a link to a PDB 3D- viewer, called LiteMol, which displays the glycoprotein together with sugars in SNFG 3D. Sugar pages are connected with EpitopeXtractor which extracts ligands from a collection of structures based on a list of almost 600 glycan epitopes. The list of extracted epitopes can be plotted on Glydin', a network visualisation of glycan epitopes. Some of these epitopes are cross-referenced to SugarBindDB, a database on host-pathogen interactions mediated by glycans. All GlyConnect data is supported by a PubMed reference.
Figure 3 illustrates the overall connectivity of the resource. Focusing on the Search interface (Bottom of Figure 3), users can explore search results by clicking on glycans and proteins presented in the octopus. Besides showing all the information stored in the database, glycan and protein pages are bound to respectively LiteMol37 and EpitopeXtractor. When available, LiteMol shows a 3D model of the glycoprotein together with the glycan moieties using the SNFG 3D notation14. In complement, EpitopeXtractor is integrated into each glycan page to present all structures with the same composition as the glycan described in the page, and then performs a glycan epitope extraction. This tool checks which known ligands are present in a selected set of structures by relying on a curated reference set of almost 600 epitopes33. Additionally, the results of EpitopeXtractor can be overlaid on the glycoepitope network proposed by Glydin', the glycan epitope network viewer33. Finally, as mentioned in the Material and Methods section, a query can be directly forwarded to GlycoSiteAlign32 to perform a protein sequence alignment based on selected glycan traits.
The glycan and protein pages in the Search and Browse interfaces share the same information in their respective pages, especially regarding cross-references. To begin with, each page contains the PubMed IDs and article summary details including the DOI when the
20 publication information is available as supporting evidence for the glycan assignment. Furthermore, protein pages link to protein (UniProt, NeXtProt for human) and gene databases (GeneCards for human) whereas glycan pages are cross-linked to GlyTouCan, the glycan structure repository. Through GlyTouCan, cross-talk between GlyConnect and the MS/MS spectral database, UniCarb-DB38 , will be ready for release later in 2018. Additionally, the Browser interface has a collection of other pages for taxonomy, tissue and disease (Top of figure 3). Each of these page types shows specific cross-references to expert resources (NCBI taxonomy browser, Brenda and Uberon or standard tissue references and disease ontology).
Although the Search and Browse interfaces show the same information, they are specifically designed to answer different questions. We can illustrate the specificity of the Search interface with a query that cannot be expressed in the Browse interface. For example, assume our aim is to find all proteins that carry O-linked glycan structures containing the sialyl-GalNAc epitope which actually also is the motif of the sialyl-Tn antigen in Homo sapiens, i.e., NeuAc(a1-6)-GalNAc. In the Search interface clicking on the O-linked button restricts the search only to O-linked glycans. Then, the sialyl-GalNAc epitope is selected in the “Determinants” tab and the query is run. The resulting octopus shows connections between 50 reported compositions matching 83 O-linked glycan structures containing the sialyl-GalNAc epitope and 96 proteins known to carry structures including this motif. Clicking on a composition highlights in red the relationship(s) between corresponding glycan structures and glycoproteins as reported in the database. Some relations are unique, and some are multiple as shown in the two views of Figure 4.
21
Figure 4: Octopus result map. The figure shows the octopus result map generated by querying all O-linked glycans which contain Sialyl-Tn antigen in Homo sapiens. (A) The composition N1S1 is the exact composition of the Sialyl-Tn antigen and the octopus shows connections highlighted in red. On the structure side (right) only the Sialyl-Tn antigen appears and on the protein side (left) 13 proteins are connected. (B) The composition H1N1S1 is the exact composition of core 1 and core 8 O-glycans which differ by one galactose from the Sialyl- Tn antigen motif/ sialyl-GalNAc and the octopus shows corresponding connections highlighted in red. On the structure side, only core 1 and core 8 appear and on the protein side (left) 18 proteins are connected, six of which are common with those shown for the Sialyl-Tn antigen motif/ sialyl-GalNAc.
Due to the large number of hits, the octopus shows a crowded situation that might be hard to explore. When this happens, a message prompt invites the user to reduce the size of results, using the taxonomy filter above the octopus. In this example, selecting “Mammalia” and “Homo sapiens” restricts the map to 31 reported compositions matching 44 O-linked glycan structures and 49 proteins carrying those glycans. Composition N1S1 (one N-Acetylhexosamine, one N-acetylneuraminic acid) describes the specific sialyl-Tn antigen structure, and it is connected to only one glycan structure in the database. In Figure 4A, on the right side of the octopus, GlyConnect shows this composition in relation with 13 proteins. The “Disease - Glycan” version of the octopus accessible in the toolbar above the map shows
22 that the sialyl-GalNAc epitope (alternatively the sialyl-Tn antigen) appears in connection with several types of cancers like breast cancer, adenocarcinoma and leukaemia, suggesting an association with cancer tissues and/or body fluids. All currently reported compositions that contain the selected sialyl-GalNAc epitope, are listed in the centre list of the octopus. Selection of H1N1S1, for example, allows to further investigate the specificity of the connections. Figure 4B shows the results of this selection in the octopus. H1N1S1 matches two structures in the database. The two red links reveal that these matches contain the searched motif (sialyl-GalNAc epitope) with the addition of a galactose. These are two anomers, the core 1 and core 8 structures. Obviously, Figure 4A differs from Figure 4B in which 18 red connectors go through H1N1S1 to proteins. Only six protein links are shared between Figure 4A and B. Those seven (13-6) proteins specifically associated with the sialyl-GalNAc epitope can be further considered by clicking on the protein name. This example can be seen as the simulation of adding a galactose to an initial O-linked structure on the attachment to a glycoprotein.
Navigation and visualisation to popularise glycobiology
The Search interface as illustrated in the previous paragraphs is convenient to show the various views representing the molecular entities of Figure 1 as well as their characteristic properties in a given context. The query and the results are presented so as to simplify usage for non-specialists in glycobiology yet obtaining useful information. In complement to Search, the Browse interface also supports users in getting a more global picture of the data. Each type of entity (as shown in Figure 1) can be listed and items easily filtered by names and each item is linked its individual page whether a protein, a glycan structure or a composition or a glycosite. Furthermore, each type of contextual information such as taxon, tissue or disease can also be listed and filtered. In these cases, an individual page (e.g. tissue page) will show all glycans known to be expressed in that tissue according to collected and cited published references.
Irrespective of the starting point, GlyConnect browsing will navigate across the data for as long as the user clicks on related information. Starting from a composition page to the matching glycan structures and the related tissues from which these glycans have been described. If available, information on any links to a disease is provided in a dedicated page
23 that will list all glycans known to be expressed in this disease. This list can then, where known, be related to the glycoproteins expressed. The application supports endless clicking from one page or one list to the other. Exploring glycosite and peptide data is currently more simply handled in the Browse interface where each peptide evidence for a site can be explored. In particular, despite the unreviewed status of large scale glycoproteomics datasets, a few dozen protein references were merged in order to limit redundancy and peptides were grouped when they obviously belong to the same protein. A telling example is B4DPN0, an unreviewed UniProt entry matching Beta-2-glycoprotein 1 (APOH, P02749). B4DPN0 with tryptic peptide YTTFEYPNTINFSCNTGFYLNGADSAK was listed in the identification data with three sialylated compositions attached (H5N4S1, H5N4S2, H5N4F2S1). It appeared that the corresponding site resulted from an S->N mutation that creates a new glycosite at Asn-107 in the original P02749 sequence: 97YTTFEYPNTISFSCNTGFYLNGADSAK123. This mutation is reported in P02749 and corresponds to variant dbSNP:rs1801692. Merging this information into the GlyConnect record of P02749 shows a summary of three defined sites matching the UniProt entry annotation confirmed at Asn-162 (5 structures, 1 composition), Asn-183 (1 composition) and Asn-253 (2 structures and 19 compositions) with the additional Asn-107 as a post-mutation event unreported so far. Note that one potential glycosite listed at Asn-193 in P02749 does not appear in GlyConnect. It may be due to the large size of the tryptic peptide in which this glycosylation site is embedded (>3200 Da) that subsequently defied detection within the mass spectrometric experiment. Alternatively of course, it may not be glycosylated. This example illustrates the potential for exploring site annotation in GlyConnect, which is not complete but may reveal previously unknown associations.
Glycosite and peptide information is particularly useful for other (glyco)proteomics studies of different biologies/disease when data originates from glycoproteomics studies. This will be integrated in the next release of neXtProt (autumn 2018). All tools are now ready to feed this data to the proteomics tab of a neXtProt entry. Likewise, GeneCards now includes GlyConnect data and on-going collaboration with the developers of that resource will gradually determine the relevance of integrating more information. Efforts to cross-reference to other databases will continue to include glyco-information in a variety of bioinformatics resources and popularise the glycobiology viewpoint.
24 A reference for MS-based glycoproteomics
GlyConnect is designed to support glycoproteomics along with the tool collection of Glycomics@ExPASy. To begin with, it is already a trusted source of Unimod35 with regards to listing masses of glycans that are possibly attached to proteins. The complete composition file of GlyConnect was shared with Unimod such that most of Unimod glycan mass entries now link to the corresponding GlyConnect composition page. From there, the user can navigate the related data as explained in the paragraphs above. In the same spirit, work is ongoing for cross-referencing ProSight39 and GlyConnect to support glycopeptide analysis in top down proteomics.
The major impact of integrating large scale glycoproteomics studies results in GlyConnect is the potential to substantiate UniProt annotation. Less than 4000 human proteins are annotated with strong evidence of N-glycosylation. Though the consensus on the actual number of glycoproteins is not yet reached, the estimate of glycoforms1 is expected to exceed the magnitude of 104. Despite the uneven level of reviewing, many N-glycosylation sites are predicted on the basis of the occurrence of the well-known sequon N-X-S/T (X¹P). This sequence-based prediction is annotated in UniProt entries as the result of “sequence analysis” and is associated the ECO:0000255 evidence tag as set by the Evidence Ontology (http://www.evidenceontology.org/).
In order to assess the potential for refining site annotation in UniProt through the accumulation of high throughput glycoproteomics data, we collected the set of human UniProt entries (release 2018_07) that contain only putative N-glycosites tagged ECO:0000255. In this evaluation, entries with at least a PubMed reference associated to an N-glycosite (ECO:0000269) were excluded. This resulted in a set of 3270 UniProt proteins out of which 289 were present in GlyConnect. We select here a few representative examples of the usefulness of GlyConnect in refining questions or raising potentially new questions from putting into perspective experimental data.
We first highlight the results regarding mammaglobulin-A (Q13296) and mammaglobulin-B (O75556) reflecting experimental data extracted from the unreviewed reference21 that characterised intact glycopeptides in breast tumours. Corresponding UniProt entries are poorly informative regarding the function of these 57 % identical proteins. Q13296 indicates
25 expression in breast cancer but not O75556. Sequence prediction spots two N-glycosites in mammaglobulin-A (Q13296) at Asn-53 and Asn-68. Asn-68 is conserved in mammaglobulin-B (O75556) while at position 53 a N->D mutation disables glycosylation. As it turns out Asn-68 has only been detected glycosylated in mammaglobulin-B (O75556) with two compositions matching oligomannose structures (H8N2, H9N2). However, the unique Asn-53 site in mammaglobulin-A (Q13296) is associated with 26 possible compositions, 11 of which are sialylated and 13 are fucoslyated (eight are in common, i.e., both sialylated and fucoslyated). The identified tryptic glycopeptide was 20 amino acids long. Two compositions (H6N5F3S1 and H8N7S1) did not show a structure match in the database, while for another two compositions, single matched structures were found. Interestingly, of these single matches, the first one (H4N4F3) points to a structure containing two terminal Lewis x glycoepitopes and the second (H6N6S3) has a bisecting GlcNAc. Although these compositions could also describe not yet reported structures that are thus not recorded in the database, both these characteristics are interesting enough to justify further investigations. The 22 remaining compositions all have multiple potential matches in the database though each can be associated with mutually exclusive features, such as oligomannose structures (7), triantennary (4), tetrantennary (2) and another three potentially with a bisecting GlcNAc (H3N3, H3N3F1, H3N4). Note that mammaglobulin-A and B are alternatively spliced but splicing does not affect the glycosylation sites. The distinguishing glycosylation pattern of mammaglobulin-A that is known to be involved in breast cancer definitely provides substance for studying the pathological role of these protein glycoforms.
Another remarkable pair of similar proteins with empty comments regarding their function is Prominin-1 (O43490) and Prominin-2 (Q8N271). While the above described mammaglobulins are 93 and 95 amino acid long, the length of prominins is 865 and 834, respectively. Consequently, significantly more sequon motifs can be expected. Surprisingly, Q8N271 reports only Asn-270 as a possible site (sequence-based prediction), while O43490 shows the prediction of eight N-sites (Asn-220, 274, 395, 414, 548, 580, 729, 730). As it turns out, only Asn-270/Asn-274 is conserved in the alignment of these two proteins (see supp. Figure 1). The glycoproteomics results of our source21 show selective N-glycosylation of two expected sites in prominin-1(Asn-220, 274) and two unexpected sites (Asn-707, 725) in prominin-2. The conserved site is glycosylated in prominin-1 (Asn-274) but not in prominin-2
26 (Asn-270). Nevertheless, it is also possible that the tryptic glycopeptide containing Asn-270 was not detected due to its size (4849.899 Da without N-glycan). Nonetheless, these data indicate a tendency for N-terminal glycosylation in prominin-1 while the C-terminal part of prominin-2 is N-glycosylated. Only one non-sialylated and di-fucosylated composition is reported at Asn-220 on prominin-1, whereas twelve out of fifteen structures were reported to be sialylated at Asn-274. The two N-sites of prominin-2 are associated with less but very clear-cut compositions; Asn-707 contains three highly fucosylated structures (at least 3 fucoses) and Asn-725 only two sialylated compositions that are part of the list associated with Asn-274. Again, these details could become useful in the design of a functional study of prominins.
A last instructive example is Apolipoprotein F (Q13790) that is predicted to bear a single N-glycosylated Asn-118 in the corresponding UniProt entry. Glycoproteomics studies confirm the occurrence of two compositions at Asn-118, one of which (N4H6F2) is highly specific in our database since it only points at structures carrying the Blood group A glycoepitope (of course, this may only reflect a bias in our data). The second composition (H5N4S2) is more common and matches multiple structural entries in the database. Interestingly, a second site not listed in UniProt Q13790 is suggested by experimental data at Asn-139 where the same disialylated composition (H5N4S2) is described.
Stored data of many other proteins, especially membrane and transmembrane, offer similar potential leads to hypothesize on the specific properties of N-glycosylation and its potential regulatory role.
Seed assumptions on site assignment of glycans
The accumulation of site-specific compositional data increasingly supports potential site assignment of previously identified structures that were characterised after they were released from the glycoproteins. For example, over the years the glycans on several abundant human proteins such as fibronectin or the epidermal growth factor receptor have been characterised in successive and independent MS-based experiments. The corresponding data populated the database with recurring and new glycan structures and/or compositions. Recent compositional inclusions from intact glycopeptide glycoproteomic experiments are all site-specific. As explained in the Material and Methods section, structures are mapped to
27 compositions at the protein level so that defined structures listed in the “unknown site” category can be matched with site-specific compositions obtained from glycoproteomic data. Fibronectin at site Asn-1007 is reported to bear two defined released structures and nine compositions. The two structures match two of the separately reported compositions that also match two defined released structures which had been reported on fibronectin without known sites (grey border). A third defined structure reported without a known site(s) matches another composition (H5N4F1) on Asn-1007. No other structure listed in the “unknown site” section was found to match any of the six remaining compositions listed at this site. A sharper contrast regarding fucosylated structures is seen between Asn-528 and Asn-1236. Released structures with two fucoses have been reported without any site assignment and match compositions only seen on Asn-528. On Asn-1236 however, no fucosylated structure released from fibronectin is matched to any of the 18 listed compositions (8 fucosylated/10 non- fucoslyated). This trend would need to be confirmed but the likelihood of selective site glycosylation is indicated.
Following the same line of reasoning, data stored for the epidermal growth factor receptor (EGFR) reveals a definite preference for oligomannose glycans at Asn-352 and Asn-361 while Asn-413 appears to carry fucose-rich complex structures. Note that 42 of the 50 (92%) reported structures at unknown sites on EGFR are fucosylated.
Overall, only 21 human N-glycoproteins of the database are associated with both site- specific compositions and defined released structures on unknown sites. An average of 62% of defined released structures reported on unknown sites could be assigned to a specific site by matching a composition (details are provided in Supp.Table I).
Discussion
Glycosylation is a major posttranslational modification (PTM) and its characterisation is essential in deciphering the combinatorial patterns of PTMs that convey information through ‘PTM crosstalk’ on the protein. This is considered as one of the next most important challenge in proteomics40. Prior to solving PTM crosstalk, it is necessary to qualify and quantify site- specific modifications, in particular the highly heterogeneous and variable glycosylation.
28 GlyConnect is proposed as a solution to conquer this challenge and fill the existing gap for querying integrated data in glycobiology. As mentioned in the Introduction, the existing range of databases and tools that can support functional studies in glycobiology is limited, occurs on individual servers, is often not maintained and are not integrated. Previously, some level of tool integration was achieved on a more chemically-oriented basis such as GLYCOSCIENCES.de6 or an array focussed basis in the CFG databases7. GlyConnect is designed to shift integration towards overall biological questions that involve glycoproteins. It needs to be acknowledged that this lack has also been due to that of high-throughput methodologies in glycomics and glycoproteomics. These fields have just started catching up to other -omics and generate reliable data that can now provide the data backbone of integrated tools as presented here with GlyConnect.
Mass spectrometry techniques associated to powerful bioinformatics tools have generated a significant portion of proteomic biological knowledge over the past two decades. Given the rapid development happening in the glycoproteomics space, these methodologies will also provide in-depth data on glycoproteins and their site-specific glycosylation patterns in health and disease. Tools such as GlyConnect enable the user to search, display and connect these complex data to other databases and provide interested scientists with an opportunity to find information that otherwise would be difficult to grasp. Nevertheless, still a lot of effort is necessary to develop workflows that enable structuring and standardising the data derived from the incoming flow of glycosylation-based data. Programs such as the MIRAGE initiative that define the standards for reporting of glyco-related data41 are one important step towards ensuring a minimum data and information standard.
Results presented above show that unearthed information in the numerous Excel spreadsheets submitted as supplementary material can already feed the annotation of glycoproteins. Our immediate target is therefore to build upon these types of resources and reinforce the early identified trends described in the Results section that emphasised that much more glyco-related information can be gained from existing knowledge if it is accessible. At the time of writing, straightforward literature search on PubMed (https://www.ncbi.nlm.nih.gov/pubmed/) reveals the existence of several datasets available for integration to feed the GlyConnect pipeline. We are aware that trends in the currently obtained data (human N-glycans) induces bias that is reflecting the major interest in
29 glycosylation in human diseases. This is an important and vital research, but alternative dedicated curation programmes that cover also other branches of life such as bacteria, other animals, plants, viruses and parasites need also to be investigated. The here presented tools, however, are designed to provide the backbone for such projects once increasing amounts of reliable data are available.
GlyConnect currently supports multiple types of comparison and associations. As shown in Figure 4, the octopus reveals the relations between glycoproteins and their individual glycan compositions and related structures. Moreover, the several versions of the octopus combine different concepts such as tissue and glycan, disease and glycan, etc, to highlight glycan- dependent connections that otherwise are impossible to gather from current high- throughput data outputs. GlyConnect supports scientists in making connections between compositions, glycoproteins and their changes when acquired by global glycoproteomics methods. If combined with parallel global glycomics studies that shine light on the specific structures present in a sample, educated structure predictions can be made from the compositional data. Although not yet fully exploited, Sankey graphs also bring another view on associations between entities depicted in Figure 1. Work is in progress to refine and enhance this variety of views. A future step of data integration will soon include UniLectin, a knowledgebase of carbohydrate-binding proteins that we co-develop. This will allow to display the potential set of lectins that can bind queried glyco-epitopes with the help of EpitopeXtractor and cross-links to the Glydin’ epitope network. Despite the limitation of compositional data that will not always be specific for a peculiar glyco-epitope without orthogonal data, this does not limit the ability of GlyConnect to associate different data sources into an informative and visual output. Suggestions made throughout the results of the present paper have been made with caution and not stated as facts but as leads to refine questions or guide reasoning on the data.
Conclusion
GlyConnect is an evolving platform. It is the central engine of the Glycomics@ExPASy collection. Our near future plans include the definition of filters based on protein functional properties. We also plan to implement the evidence ontology (http://www.evidenceontology.org/) to qualify the reliability of information in the database.
30 Substantial work is needed to cross-reference the Carbohydrate Active Enzyme database CAZy42 wisely and bring out glycan biosynthetic enzyme information in GlyConnect. Lastly, we are engaged in community-based efforts to promote the usage and automate the analysis of glycomics and glycoproteomics data. At this stage, we are moving towards data sharing via repositories38 as well as testing the performance of the currently available glycoproteomics MS software with benchmark datasets HUPO Human Glycoproteomics Initiative (HGI: https://www.hupo.org/Glycoproteomics-(B/D-GPP)). GlyConnect is destined to become the curated and integrated component in this larger glycoinformatics picture.
References
(1) Aebersold, R.; Agar, J. N.; Amster, I. J.; Baker, M. S.; Bertozzi, C. R.; Boja, E. S.; Costello, C. E.; Cravatt, B. F.; Fenselau, C.; Garcia, B. A.; et al. How Many Human Proteoforms Are There? Nature Chemical Biology 2018, 14 (3), 206–214. (2) Lisacek, F.; Mariethoz, J.; Alocci, D.; Rudd, P. M.; Abrahams, J. L.; Campbell, M. P.; Packer, N. H.; Ståhle, J.; Widmalm, G.; Mullen, E.; et al. Databases and Associated Tools for Glycomics and Glycoproteomics. In High-Throughput Glycomics and Glycoproteomics; Lauc, G., Wuhrer, M., Eds.; Springer New York: New York, NY, 2017; Vol. 1503, pp 235–264. (3) Egorova, K. S.; Toukach, P. V. Glycoinformatics: Bridging Isolated Islands in the Sea of Data. Angewandte Chemie International Edition 2018. (4) Campbell, M. P.; Aoki-Kinoshita, K. F.; Lisacek, F.; York, W. S.; Packer, N. H. Glycoinformatics. In Essentials of Glycobiology; Varki, A., Cummings, R. D., Esko, J. D., Stanley, P., Hart, G. W., Aebi, M., Darvill, A. G., Kinoshita, T., Packer, N. H., Prestegard, J. H., et al., Eds.; Cold Spring Harbor Laboratory Press: Cold Spring Harbor (NY), 2015. (5) Doubet, S.; Bock, K.; Smith, D.; Darvill, A.; Albersheim, P. The Complex Carbohydrate Structure Database. Trends Biochem. Sci. 1989, 14 (12), 475–477. (6) Lutteke, T. GLYCOSCIENCES.de: An Internet Portal to Support Glycomics and Glycobiology Research. Glycobiology 2006, 16 (5), 71R-81R. (7) Raman, R.; Venkataraman, M.; Ramakrishnan, S.; Lang, W.; Raguram, S.; Sasisekharan, R. Advancing Glycomics: Implementation Strategies at the Consortium for Functional Glycomics. Glycobiology 2006, 16 (5), 82R-90R. (8) Mariethoz, J.; alocci, D.; Gastaldello, A.; horlacher, O.; Gasteiger, E.; Rojas-Macias, M.; Karlsson, N. G.; Packer, N.; Lisacek, F. Glycomics@ExPASy: Bridging the Gap. Molecular & Cellular Proteomics 2018, mcp.RA118.000799. (9) Varki, A.; Cummings, R. D.; Aebi, M.; Packer, N. H.; Seeberger, P. H.; Esko, J. D.; Stanley, P.; Hart, G.; Darvill, A.; Kinoshita, T.; et al. Symbol Nomenclature for Graphical Representations of Glycans. Glycobiology 2015, 25 (12), 1323–1324. (10) Varki, A.; Cummings, R. D.; Esko, J. D.; Freeze, H. H.; Stanley, P.; Marth, J. D.; Bertozzi, C. R.; Hart, G. W.; Etzler, M. E. Symbol Nomenclature for Glycan Representation. PROTEOMICS 2009, 9 (24), 5398–5399.
31 (11) Joshi, H. J.; von der Lieth, C.-W.; Packer, N. H.; Wilkins, M. R. GlycoViewer: A Tool for Visual Summary and Comparative Analysis of the Glycome. Nucleic Acids Research 2010, 38 (suppl_2), W667–W670. (12) Konishi, Y.; Aoki-Kinoshita, K. F. The GlycomeAtlas Tool for Visualizing and Querying Glycome Data. Bioinformatics 2012, 28 (21), 2849–2850. (13) Pérez, S.; Sarkar, A.; Rivet, A.; Breton, C.; Imberty, A. Glyco3D: A Portal for Structural Glycosciences. In Glycoinformatics; Lütteke, T., Frank, M., Eds.; Springer New York: New York, NY, 2015; Vol. 1273, pp 241–258. (14) Sehnal, D.; Grant, O. C. Rapidly Display Glycan Symbols in 3D Structures: 3D-SNFG in LiteMol. Journal of Proteome Research 2018. (15) Woods, R. J. Predicting the Structures of Glycans, Glycoproteins, and Their Complexes. Chemical Reviews 2018, 118 (17), 8005–8024. (16) Joshi, H. J.; Jørgensen, A.; Schjoldager, K. T.; Halim, A.; Dworkin, L. A.; Steentoft, C.; Wandall, H. H.; Clausen, H.; Vakhrushev, S. Y. GlycoDomainViewer: A Bioinformatics Tool for Contextual Exploration of Glycoproteomes. Glycobiology 2018, 28 (3), 131– 136. (17) UniProt Consortium, T. UniProt: The Universal Protein Knowledgebase. Nucleic Acids Research 2018, 46 (5), 2699–2699. (18) Gaudet, P.; Argoud-Puy, G.; Cusin, I.; Duek, P.; Evalet, O.; Gateau, A.; Gleizes, A.; Pereira, M.; Zahn-Zabal, M.; Zwahlen, C.; et al. NeXtProt: Organizing Protein Knowledge in the Context of Human Proteome Projects. Journal of Proteome Research 2013, 12 (1), 293– 298. (19) Safran, M.; Dalah, I.; Alexander, J.; Rosen, N.; Iny Stein, T.; Shmoish, M.; Nativ, N.; Bahir, I.; Doniger, T.; Krug, H.; et al. GeneCards Version 3: The Human Gene Integrator. Database 2010, 2010 (0), baq020–baq020. (20) Cooper, C. A.; Joshi, H. J.; Harrison, M. J.; Wilkins, M. R.; Packer, N. H. GlycoSuiteDB: A Curated Relational Database of Glycoprotein Glycan Structures and Their Biological Sources. 2003 Update. Nucleic Acids Res. 2003, 31 (1), 511–513. (21) Hu, Y.; Shah, P.; Clark, D. J.; Ao, M.; Zhang, H. Reanalysis of Global Proteomic and Phosphoproteomic Data Identified a Large Number of Glycopeptides. Analytical Chemistry 2018, 90 (13), 8065–8071. (22) Bollineni, R. C.; Koehler, C. J.; Gislefoss, R. E.; Anonsen, J. H.; Thiede, B. Large-Scale Intact Glycopeptide Identification by Mascot Database Search. Scientific Reports 2018, 8 (1). (23) Darwen, H.; Date, C. J.; Fagin, R. A Normal Form for Preventing Redundant Tuples in Relational Databases. In Proceedings of the 15th International Conference on Database Theory - ICDT ’12; ACM Press: Berlin, Germany, 2012; p 114. (24) Kent, W. A Simple Guide to Five Normal Forms in Relational Database Theory. Communications of the ACM 1983, 26 (2), 120–125. (25) Tiemeyer, M.; Aoki, K.; Paulson, J.; Cummings, R. D.; York, W. S.; Karlsson, N. G.; Lisacek, F.; Packer, N. H.; Campbell, M. P.; Aoki, N. P.; et al. GlyTouCan: An Accessible Glycan Structure Repository. Glycobiology 2017, 1–5. (26) Campbell, M. P.; Ranzinger, R.; Lütteke, T.; Mariethoz, J.; Hayes, C. A.; Zhang, J.; Akune, Y.; Aoki-Kinoshita, K. F.; Damerell, D.; Carta, G.; et al. Toolboxes for a Standardised and Systematic Study of Glycans. BMC Bioinformatics 2014, 15 (Suppl 1), S9. (27) Herget, S.; Ranzinger, R.; Maass, K.; Lieth, C.-W. v. d. GlycoCT—a Unifying Sequence Format for Carbohydrates. Carbohydrate Research 2008, 343 (12), 2162–2171.
32 (28) Essentials of Glycobiology, 2nd ed.; Varki, A., Ed.; Cold Spring Harbor Laboratory Press: Cold Spring Harbor, N.Y, 2009. (29) Sharon, N. Nomenclature of Glycoproteins, Glycopeptides and Peptidoglycans (Recommendations 1985). Pure and Applied Chemistry 1988, 60 (9), 1389–1394. (30) Harvey, D. J.; Merry, A. H.; Royle, L.; Campbell, M. P.; Dwek, R. A.; Rudd, P. M. Proposal for a Standard System for Drawing Structural Diagrams of N- and O-Linked Carbohydrates and Related Compounds. Proteomics 2009, 9 (15), 3796–3801. (31) Mariethoz, J.; Khatib, K.; Alocci, D.; Campbell, M. P.; Karlsson, N. G.; Packer, N. H.; Mullen, E. H.; Lisacek, F. SugarBindDB, a Resource of Glycan-Mediated Host–Pathogen Interactions. Nucleic Acids Research 2016, 44 (D1), D1243–D1250. (32) Gastaldello, A.; Alocci, D.; Baeriswyl, J.-L.; Mariethoz, J.; Lisacek, F. GlycoSiteAlign: Glycosite Alignment Based on Glycan Structure. Journal of Proteome Research 2016, 15 (10), 3916–3928. (33) Alocci, D.; Ghraichy, M.; Barletta, E.; Gastaldello, A.; Mariethoz, J.; Lisacek, F. Understanding the Glycome: An Interactive View of Glycosylation from Glycocompositions to Glycoepitopes. Glycobiology 2018, 28 (6), 349–362. (34) Ceroni, A.; Dell, A.; Haslam, S. M. The GlycanBuilder: A Fast, Intuitive and Flexible Software Tool for Building and Displaying Glycan Structures. Source Code for Biology and Medicine 2007, 2 (1), 3. (35) Creasy, D. M.; Cottrell, J. S. Unimod: Protein Modifications for Mass Spectrometry. PROTEOMICS 2004, 4 (6), 1534–1536. (36) Clerc, F.; Reiding, K. R.; Jansen, B. C.; Kammeijer, G. S. M.; Bondt, A.; Wuhrer, M. Human Plasma Protein N-Glycosylation. Glycoconjugate Journal 2016, 33 (3), 309–343. (37) Sehnal, D.; Deshpande, M.; Vařeková, R. S.; Mir, S.; Berka, K.; Midlik, A.; Pravda, L.; Velankar, S.; Koča, J. LiteMol Suite: Interactive Web-Based Visualization of Large-Scale Macromolecular Structure Data. Nature Methods 2017, 14 (12), 1121–1122. (38) Rojas-Macias, M. A.; Mariethoz, J.; Andersson, P.; Jin, C.; Venkatakrishnan, V.; Aoki, N. P.; Shinmachi, D.; Ashwood, C.; Madunic, K.; Zhang, T.; et al. E-Workflow for Recording of Glycomic Mass Spectrometric Data in Compliance with Reporting Guidelines. 2018. (39) Fellers, R. T.; Greer, J. B.; Early, B. P.; Yu, X.; LeDuc, R. D.; Kelleher, N. L.; Thomas, P. M. ProSight Lite: Graphical Software to Analyze Top-down Mass Spectrometry Data. PROTEOMICS 2015, 15 (7), 1235–1238. (40) Venne, A. S.; Kollipara, L.; Zahedi, R. P. The next Level of Complexity: Crosstalk of Posttranslational Modifications. PROTEOMICS 2014, 14 (4–5), 513–524. (41) Kolarich, D.; Rapp, E.; Struwe, W. B.; Haslam, S. M.; Zaia, J.; McBride, R.; Agravat, S.; Campbell, M. P.; Kato, M.; Ranzinger, R.; et al. The Minimum Information Required for a Glycomics Experiment (MIRAGE) Project: Improving the Standards for Reporting Mass-Spectrometry-Based Glycoanalytic Data. Molecular & Cellular Proteomics 2013, 12 (4), 991–995. (42) Lombard, V.; Golaconda Ramulu, H.; Drula, E.; Coutinho, P. M.; Henrissat, B. The Carbohydrate-Active Enzymes Database (CAZy) in 2013. Nucleic Acids Res. 2014, 42 (Database issue), D490-495.
33 Author Contributions
Conceptualization, D.A., J.M. and F.L.; methodology, D.A., J.M. and F.L.; software, D.A., A.G., and J.M.; validation, N.G.K, D.K., N.H.P., E.G. and F.L.; writing—original draft preparation, D.A. and F.L.; writing—review and editing, all.; supervision, N.H.P and F.L.; funding acquisition, N.H.K and F.L
Funding
This work is supported by the Swiss Federal Government through the State Secretariat for Education, Research and Innovation (SERI). The ExPASy portal is maintained by the web team of the Swiss Institute of Bioinformatics and hosted at the Vital-IT Competency Center. Part of the tools described above were developed with the support of the European Union FP7 Innovative Training Network [ITN 316929] and the Swiss National Science Foundation [SNSF 31003A_141215]
Acknowledgements
We thank the neXProt and GeneCards teams for constructive interactions.
34
Supplemental Figure 1
O43490|PROM1_HUMAN ------MALVLGSLL------LLGLCGNSFSGGQPSSTDAPKAWNYELPATNYETQDS 46 Q8N271|PROM2_HUMAN MKHTLALLAPLLGLGLGLALSQLAAGATDCKFLGPAEHLTFTPAA------RARWLAPRV 54 :* :** * * . .* * * :* * :.: :
O43490|PROM1_HUMAN HKAGPIGILFELVHIFLYVVQPRDFPEDTLRKFLQKAYESKIDYDKPETVILGLKIVYYE 106 Q8N271|PROM2_HUMAN RAPGLLDSLYGTVRRFLSVVQLNPFPSELVKALLNELASVK------VNEVVRYE 103 : * :. *: *: ** *** . **.: :: :*:: . * ::* **
O43490|PROM1_HUMAN AGIILCCVLGLLFIILMPLVGYFFCMCRCCNKCGGEMHQRQKENGPFLRKCFAISLLVIC 166 Q8N271|PROM2_HUMAN AGYVVCAVIAGLYLLLVPTAGLCFCCCRCHRRCGGRVKTEHKALA-CERAALMVFLLLTT 162 ** ::*.*:. *:::*:* .* ** *** .:***.:: .:* . * .: : **:
O43490|PROM1_HUMAN IIISIGIFYGFVANHQVRTRIKRSRKLADSNFKDLRTLLNETPEQIKYILAQYNTTKDKA 226 Q8N271|PROM2_HUMAN LLLLIGVVCAFVTNQRTHEQMGPSIEAMPETLLSLWGLVSDVPQELQAVAQQFSLPQEQV 222 ::: **:. .**:*::.: :: * : ..: .* *:.:.*:::: : *:. :::.
O43490|PROM1_HUMAN FTDLNSINSVLGGGILDRLRPNIIPVLDEIKSMATAIKETKEALENMNSTLKSLHQQSTQ 286
Q8N271|PROM2_HUMAN SEELDGVGVSIGSAIHTQLRSSVYPLLAAVGSLGQVLQVSVHHLQTLNATVVELQAGQQD 282 243SSVYPL…LEPAIR288 :4849.621 Da :*:.:. :*..* :** .: *:* : *:. .:: : . *:.:*:*: .*: . :
O43490|PROM1_HUMAN LSSSLTSVKTSLRSSLNDPLCLVHPSSETCNSIRLSLSQLNSNPELRQLPPVDAELDNVN 346 Q8N271|PROM2_HUMAN LEPAIREHRDRLLELLQEARCQ-----GDCAGALSWARTLELGADFSQVPSVDHVLHQLK 337 *. :: . : * . *:: * * . *: . :: *:* ** *.:::
O43490|PROM1_HUMAN NVLRTDLDGLVQQGYQSLNDIPDRVQRQTTTVVAGIKRVLNSIGSDIDNVTQRLPIQDIL 406 Q8N271|PROM2_HUMAN GVPEANFSSMVQEENSTFNALPALAAMQTSSVVQELKKAVAQQPEGVRTLAEGFPGLEAA 397 338GVPEAN…SVVQELK374 :3924.899 Da .* .:::..:**: .::* :* . **::** :*:.: . ..: .::: :* :
O43490|PROM1_HUMAN SAFSVYVNNTESYIHRNLPTLEEYDSYWWLGGLVICSLLTLIVIFYYLGLLCGVCGYDRH 466 Q8N271|PROM2_HUMAN SRWAQALQEVEESSRPYLQEVQRYETYRWIVGCVLCSVVLFVVLCNLLGLNLGIWGLSAR 457 * :: :::.*. : * ::.*::* *: * *:**:: ::*: *** *: * . :
O43490|PROM1_HUMAN ATPTTRGCVSNTGGVFLMVGVGLSFLFCWILMIIVVLTFVFGANVEKLICEPYTSKELFR 526 Q8N271|PROM2_HUMAN DDPSHPEAKGEAGARFLMAGVGLSFLFAAPLILLVFATFLVGGNVQTLVCQSWENGELFE 517 *: . .::*. ***.********. *:::*. **:.*.**:.*:*: : . ***.
568GTYGTL…ELESLK602 :3869.894 Da O43490|PROM1_HUMAN VLDTPYLLNEDWEYYLSGKLFNKSKMKLTFEQVYSDCKKNRGTYGTLHLQNSFNISEHLN 586 Q8N271|PROM2_HUMAN FADTPGNLPPSMNLSQ----LLGLRKNISIHQAYQQCKEGAALWTVLQLNDSYDLEEHLD 573 . *** * . : : : ::::.*.*.:**:. . : .*:*::*:::.***:
O43490|PROM1_HUMAN INEHTGSISSELESLKVNLN-IFLLGAAGRKNLQDFAACGIDRMNYDSYLAQTGKSPAGV 645 Q8N271|PROM2_HUMAN INQYTNKLRQELQSLKVDTQSLDLLSSAARRDLEALQSSGLQRIHYPDFLVQIQRPVVKT 633 **::*..: .**:****: : : **.:*.*::*: : :.*::*::* .:*.* : . .
O43490|PROM1_HUMAN NLLSFAYDLEAKANSLPPGNLRNSLKRDAQTIKTIHQQRVLPIEQSLSTLYQSVKILQRT 705 Q8N271|PROM2_HUMAN SMEQLAQELQGLAQAQDNSVLGQRLQEEAQGLRNLHQEKVVPQQSLVAKLNLSVRALESS 693 .: .:* :*:. *:: . * : *:.:** ::.:**::*:* :. ::.* **: *: :
O43490|PROM1_HUMAN GNGLLERVTRILASLDFAQNFITNNTSSVIIEETKKYGRTIIGYFEHYLQWIEFSISEKV 765 Q8N271|PROM2_HUMAN APNLQLETSDVLANVTYLKGELPAWAARILRNVSECFLAREMGYFSQYVAWVREEVTQRI 753 . .* ..: :**.: : :. : :: :: : :: : :***.:*: *:. .:::::
O43490|PROM1_HUMAN ASCKPVATALDTAVDVFLCSYIIDPLNLFWFGIGKATVFLLPALIFAVKLAKYYRRMDSE 825 Q8N271|PROM2_HUMAN ATCQPLSGALDNS-RVILCDMMADPWNAFWFCLAWCTFFLIPSIIFAVKTSKYFRPIRKR 812 *:*:*:: ***.: *:**. : ** * *** :. .*.**:*::***** :**:* : ..
O43490|PROM1_HUMAN DVYDDVETIPMKNMENGNNGYHKDHVYGIHNPVMTSPSQH 865 Q8N271|PROM2_HUMAN LSSTSSEETQL------FHIPRVTSLKL------834 . * : :* :* .::
Supplemental Table I
162 Chapter 5. Knowledge-based tools
5.2 Concluding remarks
Although this is only a start, GlyConnect is the first integrated dashboard for study- ing glycans and glycoproteins. The whole software is shaped to enhance the connec- tion between glycomics and other ’omics’. At the same time, the integration of mul- tiple curated data sources enhances the data analysis process giving users a range of different hypotheses to explore. 163
Chapter 6
Discussion
Since each tool has been already discussed in its respective chapter, in this section, we focus on what has been achieved globally and what are the main issues that we have encountered in our work. We also include a technical discussion on the three technologies that have been used to build the ecosystem in figure 1.15: service- oriented architecture (SOA), resource description framework (RDF) and web com- ponents. The technical discussion provides an overview of our experience showing the advantages and disadvantages of each technology.
6.1 Achievements
6.1.1 Reassemble the glycoprotein puzzle
The challenges of studying glycosylation in a living system have led glycoscien- tists to adopt a reductionist approach. The separation into sub-problems is mainly based on technical and technological considerations. Glycoproteins are approached by protein scientists by removing glycans leaving only the first monosaccharide to study glycosylation sites and glycopeptides. Glycans can be removed enzymatically or via chemical processes. In this context, the fundamental analytical techniques are high-performance liquid chromatography (HPLC), ultra-performance liquid chro- matography (UPLC), capillary electrophoresis (CE), and mass spectrometry (MS).
In contrast, chemists leave proteins out and only consider the detached glycans to elucidate their structures and chemical properties. In this case, MS is considered the 164 Chapter 6. Discussion
FIGURE 6.1: This picture shows the spread of information across dif- ferent resources from a biological perspective. The green rectangle contains all proteins related information together with glycosites and peptide data, and it represents neXtProt and UniProt. The orange rectangle combines glycan data with the information contained in the green rectangle, and it describes GlyConnect data. The blue rect- angle holds lectins and ligand information and is connected to Sug- arBindDB and Glydin. Two red arrows explain the possibility of con- necting the orange and the blue rectangle using EpitopeXtractor (or- ange to blue) and GlyS3 (blue to orange). On the contrary, the orange and the green rectangle are connected since they contain overlapping data. primary technique when small amounts of glycans are available, otherwise NMR is also a very popular technology. Additionally, HPLC or UPLC are used as a separa- tion method before a sample is introduced into the mass spectrometer. In fact, the technology to analyse glycans and proteins as a complex has been limited to indi- vidual cases usually reported in the PDB for several decades. Recent development in mass spectrometry has provided a more systematic means of studying glycocon- jugates but the resolution is such that only glycan composition is revealed. In this context, these higher-throughput methods identify glycoproteins and glycosites but, the elucidation of glycan structures usually requires further experiments. 6.1. Achievements 165
As the first achievement of this thesis, we have used bioinformatics to put together the pieces of the puzzle created by the different experimental techniques used inde- pendently. In particular we could bridge information about glycosites, glycoproteins and glycans. This has been the main work behind GlyConnect and is identified by the orange polygon in figure 6.1. GlyConnect integrates curated data mainly taken from GlycoSuiteDB (Cooper et al., 2003) and unreviewed data from high-throughput glycoproteomics experiments. Where it is possible, proteins have been cross-linked with their respective entries in UniProt and NextProt (green rectangle figure 6.1) whereas glycan structures are associated with GlyTouCan identifiers. When a gly- can is not present in GlyTouCan, GlyConnect directly sends a request to the registry to get a new identifier.
Contrary to many bioinformatics databases, which focus on data integration with limited attention to visualisation, GlyConnect proposes a series of interactive dia- grams which help the user understand relations between glycan, protein, tissue, dis- ease and taxonomy. Besides the usual search by protein, GlyConnect offers a visual query interface which allows retrieving a class of sugars based on structural prop- erties (i.e. the number of antennas, bisection, core fucose, etc.) and the presence of specific glycan determinants (i.e. blood group antigens, Lewis antigens, etc.). Search results are present in a conceptual map, familiarly called octopus for its shape, which shows the connections among glycans and proteins passing through glycan compo- sitions. Additionally, users can select the information to be shown on the tentacles choosing between disease, tissue, taxonomy. User-friendly visualisation allows sci- entists to easily explore the data and break the barrier between different disciplines. To build custom visualisation, we have tried to mix different points of view. Gly- cobiologists can start from their knowledge about glycans and move towards, for example, proteins or disease, whereas proteomics experts can analyse the glycome starting with their protein of interest. Even medical doctors, who are an essential channel towards clinical medicine, can easily navigate and understand which gly- can structures are involved in the progression of a disease.
Although the GlyConnect audience is growing every day, our work has just begun. We are working on adding peptide data and extend the coverage of information 166 Chapter 6. Discussion through including more high-throughput datasets. To bridge glycomics with pro- teomics, we have cross-referenced each protein with UniProt and neXtProt (Lane et al., 2012), a knowledge platform for human proteins, providing a connection to pro- teomics. Instead, to link glycomics with genomics, we have associated each protein with GeneCards (Stelzer et al., 2016), a searchable and integrative database which provides information on all annotated and predicted human genes. Recently, we also have started a series of “Glycome pages” which summarise all the information available for a specific tissue glycome. The first has been focusing on human plasma N-glycome, but we are working to extend and populate this section.
The limited availability of glycomics data represents the primary issue for GlyCon- nect. Researchers usually publish only results, leaving all experimental data in the supplemental material. Since journals do not regulate this section, authors are using custom formats and spread information across multiple spreadsheets, which sub- stantially extends the curation phase. In the worst cases, the information is incom- plete leading to a waste of data that could be avoided. For these reasons, we encour- age glycoscientists to publish all the information needed to understand their data and to use well-defined standards. In any case, to effectively tackle this problem, a broad discussion which includes the prominent groups in glycomics together with journal editors is needed. On our side, we are creating a pipeline to curate high- throughput data automatically and, in the future, we could allow authors to register their dataset on GlyConnect directly.
6.1.2 Bridging glycoproteins and carbohydrate-binding proteins
A contribution to bridging glycoproteins with glycan-binding proteins (GBP) like lectins has been the second outcome of this thesis. As it happens for glycans and gly- coproteins, GBPs are also studied in isolation. The most common technique for iden- tifying the target of a carbohydrate-binding protein is glycan microarray. It uses a plate, covered with synthetic glycans, which are screened with a fluorescent-labelled GBP. The fluorescence level evaluates the strength of the binding between glycans and a GBP. From now on, we will alternatively use glycan epitope, glycan deter- minant and ligand to identify the subregion of a glycan recognised by a GBP. The 6.1. Achievements 167 orange polygon and the blue rectangle, in figure 6.1, shows the actual separation of information. Within Glycomics@ExPASy, data and knowledge about glycoproteins and glycan structures are stored in GlyConnect while GBPs and associated glycan epitopes are deposited into SugarBindDB and Glydin’. As in GlyConnect, GBPs in SugarBindDB are cross-referenced to UniProt. However, there was no possibility to connect GlyConnect and SugarBindDB via glycan epitopes directly. To solve this issue we have developed two distinct tools called GlyS3 an EpitopeXtractor.
FIGURE 6.2: The figure shows how to build new hypotheses on gly- can mediated protein-protein interactions using GlyConnect, Sug- arBindDB, UniProt and GlyS3. To start, we take a GBP in Sug- arBindDB, and we select all ligands recognised. Then we use those in GlyS3 to extract all sugars in GlyConnect which contain these specific motifs. In the end, we leverage GlyConnect knowledge to extract all proteins carrying those structures and their UniProt identifiers. Since GBPs in SugarBindDB are cross-referenced in UniProt as well, we can link entries in UniProt which communicates via glycan interactions.
GlyS3 is capable of identifying all glycan structures that have a specific subregion, performing the substructure search task. Therefore, using this tool, we can select 168 Chapter 6. Discussion a collection of structures in Glyconnect starting from a GBP and its associated lig- and(s) in SugarBindDB. In a broader picture, if glycan structures in Glyconnect are associated with glycoproteins, we can make new hypotheses on glycan-mediated protein-protein interactions (Figure 6.2). In the future, we can use GlyS3 to cross- reference GlyConnect structures with other resources containing GBP information. This applies to UniLectin, a curated database about 3D structures of carbohydrate- binding proteins called lectins. UniLectin is a joint project between the Proteome Informatics Group of Swiss Institute of Bioinformatics and the Centre de recherches sur les macromolécules végétales (CERMAV).
GlyS3 connects one GBP to a list of glycoproteins, creating one-to-many relation- ships. However, as in an entity-relationship model, there are two types of associa- tions: one-to-one and many-to-one. The first case is quite rare since glycoproteins can have different sites, and each of them can expose various sugars. For example, a monoclonal antibody has one conserved N-linked glycosylation in the fragment crystallizable (Fc) at position N297, but this specific site can accommodate 22 dif- ferent glycans (Yu et al., 2016). The variety of structures associated with a site and the possibility to have multiple sites on a single protein induce a many-to-one rela- tionship between GBPs and glycoprotein. Indeed, a collection of different glycans carries several ligands which can be recognised by different GBPs.
To explore the many-to-one associations, we have developed EpitopeXtractor, which does the opposite work of GlyS3. This software application verifies which ligands are contained in one or more glycan structures. EpitopeXtractor is searching against almost 600 glycan determinants which are extracted from four different sources pub- licly available:
• SugarBindDB database (Mariethoz et al., 2014)
• GlycoEpitope database (Okuda, Nakao, and Kawasaki, 2017)
• Glyco3D database (Pérez et al., 2016)
• A list extracted from “The repertoire of glycan determinants in the human gly- come.” (Cummings, 2009) 6.1. Achievements 169
To know what are the GBPs that could interact with a glycoprotein, we send its as- sociated glycan structures from GlyConnect to EpitopeXtractor. Glycan epitopes se- lected by the extraction can be highlighted on Glydin’, the epitope network viewer. Glydin’ shows the structural relationships among different glycan epitopes and con- nects each epitope with its source of information. In this way, users can get a list of GBPs that interacts with the starting glycoprotein (Figure 6.3).
FIGURE 6.3: The figure shows how to build new hypotheses on gly- can mediated protein-protein interactions using GlyConnect, Sug- arBindDB, UniProt and EpitopeXtractor and Glydin’. To start, we pick up a protein in Glyconnect and send all the glycans associated to EpitopeXtractor. This software application extracts all the epitopes contains in this pool of structures and displays them on Glydin’. If ligands are presents in SugarBindDB, we can go back to the glycan binding proteins. Since both SugarBindDB and GlyConnect are cross- referenced with UniProt, we can discover new connections between UniProt entries which are mediated by glycans.
GlyS3 and EpitopeXtractor solve the problem of connecting glycoprotein informa- tion with GBPs simulating an in-silico glycan array. However, there are still several issues to be tackled. The curation of glycan epitopes represents the first limitation of our work. To create a collection of ligands, which is used both for EpitopeXtractor 170 Chapter 6. Discussion and Glydin’, we have used four different sources, cited above, which have dupli- cates and use different names. The lack of central repository and standard names for glycan determinants is limiting the possibility of building better tools. Addition- ally, these software applications connect glycoproteins and GBPs using only glycan structural data and miss information about the strength of the binding. Possible fu- ture improvements would be adding glycan microarray data to provide users with experimental evidence of each in-silico prediction.
Another issue, especially for GlyS3, is the lack of a ranking system. GlyS3 proposes a list of glycan structures which contains a specific subregion without ranking them. When results are exceeding few hundreds, it becomes challenging to analyse them all. Due to this fact, a ranking algorithm is needed to show first the most relevant results. The main problem is that ranking methods can have many different vari- ables that could change according to user needs. Moreover, glycans are tree-like structures, and their alignment represents a mathematical challenge. Recently, Aoki- Kinoshita et al. (Hosoda, Akune, and Aoki-Kinoshita, 2017) have published the first algorithm to perform multiple glycan alignments. At this point in time, we are still in the phase of implementing and testing their approach to align GlyS3 results. In the following years, we would like to extend GlyS3 substructure search to other gly- coforms like glycosaminoglycan (GAGs) and glycolipids. In particular, the repetitive structure of GAGs facilitates the encoding process in GlyS3; however, the length of these polysaccharides can be an issue for the storage. Although we have not thor- oughly studied this problem, a solution would be to store only different disaccharide units and restrict the search to few repeats of those.
6.1.3 Providing tools for data analysis in glycomics
The third and last achievement regards the creation of exploratory tools for assist- ing data analysis in glycomics. Glycomics is still a new field and most of the high- throughput software applications, for example in mass spectrometry, include min- imal support for glycans. Although vendors are starting to look into glycomics, glycoscientists are still obliged to combine proteomics and other ‘omics’ tools with general purpose applications like Excel to perform their analysis. In this context, we 6.1. Achievements 171 have designed a set of exploratory tools which help glycobiologists speeding up the analysis of experimental data.
As a first attempt, we developed PepSweetener, a web-tool which aids researchers in the manual annotation of glycopeptides. This software application computes a space of possible glycopeptides combining peptides and glycan compositions. Once the search space is defined, the m/z of a precursor ion is used to identify putative glycopeptides and perform the annotation. Search results are presented in an in- teractive heat map which dynamically shows the error between the precursor ion and the selected glycopeptide. PepSweetener does not use sophisticated algorithms to detect glycopeptides but allows users to tailor the search space and explore the possible combinations of glycan and peptide masses. The added value of PepSweet- ener resides in providing scientists with all the information needed to manually cu- rate glycoproteomics experiments in one single tool. Additionally, it is indepen- dent from any instrument or workflow used to produce glycoproteomics data. In proteomics, tools like Mascot have eased the annotation process giving scientists a complete automatic pipeline to analyse their samples. Currently, the proteomics community believes in Mascot annotations, but there has been a time where pep- tides were annotated manually. High-throughput MS software applications provide minimal support for Glycomics and glycoproteomics which are still at the manual annotation stage. This year, the HUPO Human Glycoproteomics Initiative (HGI: https://www.hupo.org/Glycoproteomics-(B/D-GPP)) has launched a call to bench- mark the currently available glycoproteomics MS software. In this context, the lack of gold standard datasets prevents the generation of expert algorithms which could automatically predict glycopeptides. The introduction of PepSweetener is just the first step towards the development of a fully automated pipeline. By speeding up the annotation of glycopeptides, PepSweetener allows scientists to generate a substan- tial amount of precious data that will be used to design more expert tools. Waiting for a consistent amount of data, we are continually working to adapt this applica- tion to glycoscientists’ needs. In the future, we will use the new published datasets to define a possible semi-automated annotation workflow. 172 Chapter 6. Discussion
Subsequently, we centred our attention on quantitative glycan profiles. The situa- tion in this context is even worse than with glycoproteomics experiments since there is no tool available for data analysis. Experimental data is collected in spreadsheets using many different dialects to describe glycan compositions. Some studies are using a nomenclature called Oxford which combines composition information with some structural features as the number of antennas (See chapter1 - Glycan com- position format). Others are just listing the different monosaccharides with their respective quantity. In a context where each lab uses its format and provides specific handmade visualisations, we introduced Glynsight, the first tool to visualise and compare glycan profiles. Glynsight provides a custom sankey visualisation which shows the relationships among different glycans and, at the same time, displays their relative abundance in an interactive way. Scientists can use Glynsight to visu- alise their profiles and perform a pairwise comparison. Since glycan profiles come only with glycan composition information, Glynsight uses GlyConnect data to sug- gest putative glycan structures for each composition. As for GlyConnect, Glynsight is bridged with EpitopeXtractor to explore the repertoire of glycan epitopes presents in a glycan profile.
As one of the first tool for quantitative glycomics, Glynsight tries to standardise the analysis and the publication of glycan profiles. However, many details are still miss- ing especially regarding the encoding format of glycan composition. As mentioned before, each laboratory provide its own dialect and, at the moment, Glynsight is only supporting one format which is detailed in “Understanding the glycome: an interactive view of glycosylation from glycocomposition to glycoepitopes” (See chapter4). As a fu- ture goal we will extend our application to other encoding formats, but at the same time, we would like the glycomics community to agree on one or two general stan- dards for glycan compositions. Thus, we will use our energy to add useful features to Glynsight instead of spending time building several different parsers for all the encoding format in use today. Another feature under development is the possibility of comparing a different set of glycan profiles in one go. This possibility is quite essential when the number of samples becomes significant and it allows to create of a consensus profile for a specific group of samples. We have already tried to analyse 6.1. Achievements 173 an IgG N-glycan dataset extracted from a community-based study in a Han Chinese population (Yu et al., 2016), but it is quite challenging to delineate a clean separation of samples in groups. We wish that the use of Glynsight could accelerate the creation of new datasets that could be used to sharpen this tool.
6.1.4 Standards and community feedbacks
As said many times in this discussion, we are struggling to get the right amount of valuable data to enhance tools whether these are integrative, explorative or knowl- edge base applications. The glycomics community is still small and fragmented. The fragmentation not only regards techniques and methodology applied to eluci- date glycoconjugates but also resides in the way data are encoded and published. This is one of the most crucial issues that glycoscientists need to rapidly solve to keep the pace of proteomics and genomics.
Although in 2017, there are papers still talking about new encoding standards for glycan structures, the MIRAGE initiative represents the first substantial project to harmonise, at least, the minimum information required for a glycan experiment. If other initiatives take place to standardise the way data are published, glycomics could benefit from a fast improvement. Thus, bioinformatics tools will be supported by a growing amount of data, and automatic prediction will become more accurate. This virtuous circle will start if glycobiologists agree on a collection of encoding formats and look at bioinformatics as a possibility to speed up their analyses and enhance their results.
Another issue is represented by the poor community engagement towards bioinfor- matics. Tools need to be tested, and developers need feedbacks and comments from expert users. Without a feedback loop, software applications cannot be upgraded with smarter algorithms providing better predictions. In our experience, we had a fruitful collaboration with Daniel Kolarich and his colleagues to build PepSweet- ener. The contribution of experts was a key factor to understand their needs and shape a tool that supports them in each aspect of their glycopeptide identification workflow. However, this is not always the case, and most of the tools described in 174 Chapter 6. Discussion
Tool Service Digests glycan structures using exoglycosidase en- GlycoDigest zymes
GlyS3 Perform substructure search task Extracts glycan epitope form glycans using a curated EpitopeXtractor list of almost 600 glycan determinants Draws glycan structures from GlycoCT encoding for- StructureDrawing mat Access knowledge based data in Glyconnect (Multi- Glyconnect ple API calls) Access glycopeptide search space built using Uniprot PepSweetener human glycoproteins (digested with trypsin) and GlyConnect glycan composition
TABLE 6.1: List of data standard initiatives. Courtesy of “Data inte- gration in biological research: an overview” (Lapatas et al., 2015). the thesis are fully developed in-house trying to tackle issues that we have identified in glycomics. We need more support from experts if we want to continue providing cutting-edge software solutions for glycomics.
6.2 Technical discussion
6.2.1 Service Oriented Architecture
Almost all the web tools described in the thesis are relying on service-oriented ar- chitecture (SOA). Admittedly, each tool uses a mixture of the web services shown in Table II.
There are three main reasons why we have decided to use a SOA:
• The possibility of integrating multiple technologies
• The possibility of adding new services with a minimal effort
• The ease of combining legacy with newly developed tools
The SOA behind Glycomics@ExPASy is based on a mixture of different technologies which communicate using REST calls. Indeed, the REST interface represents the 6.2. Technical discussion 175 only constraint to be part of the ecosystem. Most of the services are built using Java in synergy with the Jersey library and deployed on a Glassfish server. On the con- trary, the GlyConnect API is built using the Play framework and Pepsweetener data storage uses the duo Webdis - SSDB. SOA has facilitated the integration of all these technologies leaving developers with the freedom of choosing their preferred one. Additionally, the possibility of adding new services regardless of their implementa- tion allows us to quickly build a prototype and extend tools for fulfilling the need of the community. Glys3, for example, extends the features of SugarBindDB per- forming the substructure search task to connect a glycan determinant with glycan structures in GlyConnect.
SOA also helps in the integration between legacy and brand-new tools. This has been the case for GlycoDigest (Gotz et al., 2014) a desktop application for digest- ing glycan structure. GlycoDigest logic has been packed in a jar and exposed using REST API as a web service. To perform this transformation, we deleted the graphical input, and connected the digest function to a REST call. In less than a day GlycoDi- gest logic was exposed on the internet and all the tools were able to digest glycan structures using this new web service.
The growth of dedicated services aggregating many different technologies allows the rapid evolution of the Glycomics@ExPASy initiative. On the contrary, the main- tenance of tools becomes more and more demanding. Developers, which are usually employed for a short period, need to be familiar with many different programming languages. To curb the problem, although we are using different technologies, we have tried to use only the Java programming language for developing web services. Additionally, in the case of a legacy service which becomes painful to maintain, the encapsulation ensured by the SOA allows developers to redesign it without touch- ing the other ones.
To conclude we introduce a security problem. Indeed, the web services in Table 2 are vulnerable to informatic attacks since they are open to everyone and can be used multiple times without any restriction. There are solutions available to secure services, but none of these is readily applicable to our situation. Many big players like Facebook and Twitter force users to register to access their infrastructures. In 176 Chapter 6. Discussion this way, they can monitor who is using a service and ban fraudulent users. In our case, preventing fraudulent actions is quite hard because we cannot force users to register. We are still searching for a concrete and easy solution which allows us to monitor the use of each web service and prevent misuse.
6.2.2 Resource description framework
In the paper “Property Graph vs RDF Triple Store: A comparison on Glycan Sub- structure Search”, which has been presented in the second chapter of this thesis, we have compared RDF with Neo4J, a graph database, fully developed in Java, which allows storing and querying any graph using Cypher language (See chapter3). Al- though they have comparable performance on our dataset, which contains almost 60000 glycan structures extracted from GlycomeDB, we preferred RDF because it is standardised by W3C which ensures long-term support. Additionally, several RDF endpoints can be linked through federated queries. Due to the spread of semantic web in life science as exemplified in (Gaudet et al., 2015; Redaschi and UniProt Con- sortium, 2009; Jupp et al., 2014) and looking at reaching our goal of interconnecting glycomics with other "omics", RDF represented the best choice available.
Besides the advantage proposed by RDF standardisation, glycans are tree-like molecules which can be naturally encoded in RDF triples. Each monosaccharide becomes a re- source identified by a URI and predicates account for glycosidic linkages. The only problem of encoding glycans with triples is the impossibility of adding properties on edges. These lead to a multitude of predicates to encode all the properties of a linkage. Instead of having one link between two nodes, in our case, we are obliged to have a list of different predicates, each one encoding a piece of information. For example, if a galactose and glucose are connected with an alpha 1-4 linkage, the information is encoded in five different triples:
• Gal → Links_to → Glc
• Gal → is_GlycosidicLinkage → Glc
• Gal → has_anomerCarbon_1 → Glc 6.2. Technical discussion 177
• Gal → has_linkedCarbon_4 → Glc
• Gal → has_anomerConnection_alpha → Glc
The situation is different for monosaccharides. In this case, we have introduced an ontology which allows defining classes and subclasses. For example, when glucose is added to the graph the RDF engine, recognises it as a monosaccharide using the ontology. With large datasets, linkage encoding can become a severe problem due to the number of triples needed to represent each linkage. Despite this limitation, this framework is by far the best to perform substructure search if compared with regular expressions, which have been used in the past. Although we have never made a comparison between these two implementations, the flexibility introduced by RDF represents one of the main advantages. Bugs can be found, and correct in a short amount of time thanks to the verbosity of RDF as opposed to the conciseness of the regular expressions. Interestingly, RDF efficiently encodes the uncertainty that is common in glycan structures. The system can be extended by adapting and enlarging the ontology underneath. When a new requirement imposes the change of the ontology, the system can be rebuilt starting from the raw data. Regarding the GlyS3, glycan structures encoded with GlycoCt format are kept in a text file and used to rebuild the endpoint if the ontology changes to accommodate a new condition. The parser in GlyS3 reads each structure from the file and converts them into triplets which are then stored in the RDF endpoint.
From a technical point of view, we have noted some stability issues with selected triple stores in RDF implementations. In our case, we registered many crashes using the Virtuoso triple store. Therefore, we moved to Blazegraph which runs smoothly, although we have seen some peaks in volatile memory consumption (RAM) which cannot be explained by a high load of the server. This phenomenon is possibly caused by the optimisation process, which is running from time to time and, for example, enhances the ordering of the RDF graph on disk. Another issue is repre- sented by the online query optimisation which, in some cases, is still poor. Never- theless, with the spreading of the semantic web, RDF implementations and query optimisations are improving fast, and we expect reliability to grow at the same level as relational databases within a few years. 178 Chapter 6. Discussion
Another problem we faced with our glycan dataset is the effectiveness of federated queries. Connecting several endpoints using SPARQL is still challenging due to the novelty of this technology. Even when the query runs evenly, the whole process op- erates at the speed of the slowest endpoint resulting in substantial query execution time. In life science, the feasibility of federated queries is limited by the poor sta- bility of endpoints and the ending of some projects due to funding issues. At this stage, we cannot comment on federated queries since we are still in the process of connecting our ontology with other publicly available ones. Ideally, we should com- plete the RDF model of SugarBindDB and GlyConnect. Once these two ontologies will be deployed on a triple store, we we will be able to perform federated queries within our services as well as via external triple stores.
6.2.3 Web components
The graphical interface of the tools developed in the thesis share several common parts. As an example, most of our tools needed a graphical interface to draw gly- cans or text input to paste string-encoded structures. This requirement called for a technology supporting the creation and reuse interface widgets. There are two types of web technologies that fulfil this need: web components and standalone frame- works like Angular and ReactJS. The main difference between the two groups is that web components are part of the HTML and DOM standards (See chapter1 - Web components) whereas javascript frameworks rely on standalone libraries. Follow- ing the same principle of choice between RDF and proprietary graphs, we decided to use web components that are standardised by W3C.
Vanilla web components, built only with the W3C standard, are still not mature. The definition of a small component is still too verbose and needs an in-depth knowledge of the standard API. This led us to select a library called Polymer. This library, de- veloped by Google, is built on top of the W3C standard and simplifies the process of developing a component. Polymer is not a standalone framework like Angular or ReactJS, but it is only tackling some of the pain points which are presents in the development of vanilla web components. Since the W3C standard is slow to evolve, 6.2. Technical discussion 179 the Polymer library is building a layer on top of it to meet basic needs like defining components and loading modules.
Also, it provides a catalogue of web components, called Polymer Elements, which can be used to create simple interfaces. For example, buttons, input fields and many other HTML components are ready to be used, and they can be customised using CSS. All components developed in this thesis take advantage of Polymer elements, offering the same look-and-feel among different web applications. As an example, buttons and input components have been used to build SwissMassAbacus, an en- hanced glycopeptide mass calculator for mass spectrometrists.
Another important feature of web components is the encapsulation ensured by the shadow dom, one of the four technologies that composed web components pre- sented in the introduction (See chapter1 - Web components). The advantage is two- fold: on the one side it limits the conflicts between different parts of the web page, on the other side it can be used to encapsulate legacy javascript libraries and HTML interfaces that need a possible redesign. We have used this technique with Glycan- Builder, an interface to draw glycan structures on the web implemented a few years ago. Since we favour tool re-use and in any event, could not develop a new interface within a short amount of time, we have encapsulated GlycanBuilder in a Polymer component and used it in several web tools. Finally, the frequent timeouts of Gly- canBuilder led us to create a new interface called SugarSketcher (See AppendixB). This tool, fully developed in Javascript, allows drawing glycan structures directly on the browser without the need for a backend server. Now we can swap the two components without redesigning each tool. SugarSketcher is currently integrated as a graphic query interface in CSDB (Toukach, 2011). Although web components have increased our throughput in tool prototyping and development, it is still a young technology with some drawbacks. The principal issue regards the update process. Google releases a new version of Polymer every year, introducing several changes in the way components are built. Changes follow the update of web components standard which continuously evolves. Despite the availability of converters from a version to another, to perform a clean upgrade, the process needs to be handled 180 Chapter 6. Discussion
FIGURE 6.4: Web components compatibility chart. The figure shows which of the four technologies (HTML templates, Custom Elements, Shadow Dom, HTML imports) which compose Web components are supported by Chrome, Opera, Safari, Firefox, Microsoft Edge. Cour- tesy of webcomponents.org. manually resulting in a substantial amount of time spent to upgrade a few compo- nents.
Performance and compatibility are raising further issues. Web components are new, and not necessarily supported by all browsers. Figure 6.4 shows which of four tech- nologies behind web components, which have been presented in the introduction (See chapter1 - Web components), are supported by the five most used browser on the market. As we can see, only Chrome and Opera natively implement the web components, but other suppliers are lagging behind. Firefox is working on custom elements and shadow dom, but is "on hold", together with Safari, for HTML im- ports. The worst situation is represented by Microsoft Edge which natively provides only with HTML templates. To run web components on unsupported browsers we use polyfills. These libraries are loaded on demand and integrate the standard API to run web components. However, the use of polyfills decreases the performance of the browser, which may affect user experience. To summarise, web components are a useful resource to develop interfaces quickly. They enhance the reuse of compo- nents, and allow the encapsulation of legacy web interfaces. However, as a novel technology which is partially supported by most of the common browsers, it is still under development. This entails periodical major changes in the way components 6.2. Technical discussion 181 are built and imposes regular updates.
183
Chapter 7
Conclusion
In the discussion, we have elaborated on achievements and issues that we encoun- tered in the course of this thesis. Although we have given some hints about future work, in the conclusion we summarise what is already planned for the next step of each tool.
7.1 GlyConnect
7.1.1 Data provenance
In the introduction, we have presented data provenance as the traceability of each piece of information (See chapter1 - Data provenance). Right now, each GlyConnect entry provides one or more reference(s) to the scientific article(s) from which infor- mation stored in the database has been extracted. However, an entry is a compilation from different sources and a specific piece of information is not precisely associated with a scientific paper. So, we are working on tagging each atomic data with the ap- propriate source. This level of detail is already implemented in databases but needs to be showed in the web application. Additionally, we will provide an Evidence & Conclusion Ontology (ECO) number (Chibucos et al., 2017) describing the types of scientific evidence(s) that supports the annotation. For example, UniProt provides an ECO number next to each piece of information giving users the option of tracing back to the data source. 184 Chapter 7. Conclusion
7.1.2 Semi-automated curation pipeline
At the moment, data are inserted directly in GlyConnect database, and for each mod- ification, it is still necessary to use sophisticated SQL queries. To speed up the cu- ration process of reviewed and unreviewed data, we have planned to build a semi- automated pipeline which would support curators in the insertion of new pieces of information. The pipeline would require a graphical interface to guide the curator across the different steps of the input process. Additionally, more procedures could be automated. For example, when a new glycan is added to GlyConnect, the cura- tor needs to specify the structural features of the sugar (i.e. fucosylated, bisected, more than three antennas, etc.) and which glycan epitopes are present. This step can be easily automated using composition information and leveraging GlyS3 and EpitopeXtractor for defining the glycan epitopes in the structures of interest.
7.1.3 Cross-references
Each entry in GlyConnect is already cross-referenced with several bioinformatics re- sources (e.g. UniProt, neXtProt, GlyTouCan, etc). However, we are planning to add more cross-references decreasing the gap between glycomics and other "omics". As a first step, we plan to include all structures of UniCarb-DB with a proper cross- reference. UniCarb-DB collects O-glycan structures which have been identified us- ing tandem mass spectrometry (MS/MS). As it stores experimental pieces of evi- dence that have led to the elucidation of each structure, proofs of the determination of a glycan structure can be verified. As a second step, information provided in the Carbohydrate-Active enZYmes Database (CAZy) (Terrapon et al., 2017) should be added in glycan structure pages. Each glycan structure would be associated with a list of carbohydrate-active enzymes that have contributed to its creation. This is es- sential since glycans do not have a template in the DNA, but can be connected to ge- nomics and transcriptomics via the enzymes needed to synthesise them. Conversely, each carbohydrate-active enzyme with its respective cross-reference to CAZy will be linked to structures requiring its activity. 7.2. GlyS3 185
7.2 GlyS3
7.2.1 Ontology refinement
The ontology behind GlyS3 has been especially designed to perform substructure search. Monosaccharides are identified by their specific names, for example, glu- cose, galactose, mannose, etc. However, some experimental techniques cannot dis- tinguish between glucose and galactose (e.g. mass spectrometry), but can only elu- cidate the number of carbons for each monosaccharide. For this reason, galactose, glucose and mannose are all part of the generic class Hexose (Figure 1.9). Therefore, we have planned to add the class attribute to each monosaccharide in the ontology. In this way, users would be able to perform the substructure search selecting a gen- eral class when they are not sure about the specific monosaccharide.
7.3 PepSweetener
7.3.1 Dataset automatic update
To build the glycopeptide search space in the simple search of PepSweetener, we have used two main datasets: the glycan compositions present in GlyConnect and the human glycoproteins in UniProt. As a first step, the human glycoproteins have been digested with trypsin, and only the peptides with an N-glycan consensus se- quence were kept. Then, each peptide has been combined with all the composition in GlyConnect to create all possible glycopeptides. The creation of the PepSweetener glycopetide search space has been carried with a fully manual approach, using a col- lection of scripts which are hard to maintain and have to be run at each release of UniProt or GlyConnect. Therefore, in the near future, we are planning to create an automatic pipeline which automates the creation of the glycopeptide search space each time an update of the two databases comes along. 186 Chapter 7. Conclusion
7.4 Glynsight
7.4.1 Multiple dataset analysis
Glynsight has been designed to visualise glycan profiles and perform pairwise com- parison. However, this approach is not appropriate when dozens to thousands of samples are screened to generate equally as many profiles. A solution to reduce the number of profiles to compare would be to group samples by similarity and create consensus glycan profiles for each of those. This will require an internal workflow for the similarity search prior to determining a consensus profile out of a group of samples selected by the users.
Consensus glycan profiles are essential to define glycan signatures of specific dis- eases that can be used as a diagnostic tool. Additionally, they can be used to assert the quality of the glycosylation in different batches of antibodies and other glyco- sylated molecules. Indeed, using Glynsight, the glycan profile of each batch can be compared against the standard glycan signature to verify if the glycosylation level is correct.
We have already tried to work on an IgG N-glycan dataset extracted from a community- based study in a Han Chinese population (Yu et al., 2016). The study proposed a dataset of 244 males, 457 females with age between 23-68 years old. In this exper- iment, for each sample, N-glycans have been released from IgG and then analysed using liquid chromatography. Each specimen produces a chromatogram which is divided into peaks. To be more precise, scientists assign a number to each peak starting from one. We have used those peaks as features for K-means clustering; however, none of the groups had or could be reconducted to biological meaning. Male and females of different ages were mixed in each of them. As a second trial, we have used Principal Component Analysis (PCA) to separate control from cases in a dataset provided by Genos, a Croatian SME specialized in high-throughput gly- coanalysis. Using the first two components, we were able to divide patients into four groups but, as before, none of them could separate controls from cases and vice versa. Right now, we are searching for bigger datasets and investigating other separation techniques like t-SNE. 7.5. Glydin’ 187
7.5 Glydin’
7.5.1 Data enhancement and interface improvements
In Glydin’, glycan epitopes are visualised as nodes of a network. Edges, instead, are added to the network when a ligand A is a substructure of a ligand B. At the moment, only nodes carry information about the source and the structure of each glycan epitope. So, as the next step, we suggest to enhance the information relative to both nodes and edges. We know that glycan epitopes are playing an important role in host-pathogen interactions and that aberrant glycan epitopes are a sign of car- cinogenesis. For this reason, each glycan epitope should be correlated with disease information directly extracted from the literature. Moving to the edges, if ligand A is connected to ligand B, it means that A is a substructure of B and it is possible to transform A in B using a set of enzymes. Since, in many cases, the difference between two ligands is limited to a few monosaccharides, enzyme information on edges defining which reaction is needed to move from A to B would be useful. All carbohydrate-related enzymes will be taken from CAZy, with a direct reference to this database.
To conclude we are studying a way to have a more precise depiction of the network. Since EpitopeXtractor is connected with Glydin’, the extraction of glycan determi- nants from a specific list of glycans can be shown through Glydin’ network. How- ever, the current algorithm that draws the network changes node positions at each refresh of the page. Thus, it is virtually impossible to superimpose two different net- works making comparisons impossible. We are still trying to display the network with different algorithms, but we have not found a satisfactory solution.
7.6 Final thoughts
In summary, we have systematically introduced bioinformatics in glycobiology by introducing three type of software application: integrative, explorative and knowledge- based. The integrative tools connect different areas of glycobiology. In our case, 188 Chapter 7. Conclusion
GlyS3 and EpitopeXtractor are designed to move from a structure to a list of gly- can epitopes and vice-versa. Explorative tools, instead, like PepSweetener, Glyn- sight and Glydin’ are built to facilitate the work of glycobiologists when it comes to data analysis and visualisation. In the end, knowledge-based tools, like GlyConnect, bring out the information available in the literature and enable visual navigation. Al- though a lot of work remains to be done, we have already produced a set of tools which can be used via the web. Besides the planned steps explained in this section, our primary concern, right now, is the reduced availability of suitable data sources which can be used to enhance this collection of software applications. Addition- ally, we hope that the glycomics community will more and more follow a restricted number of standards facilitating our work. 189
Appendix A
Other Co-author papers Journal of Proteomics 129 (2015) 63–70
Contents lists available at ScienceDirect
Journal of Proteomics
journal homepage: www.elsevier.com/locate/jprot
MzJava: An open source library for mass spectrometry data processing☆
Oliver Horlacher a,b, Frederic Nikitin a,DavideAloccia,b, Julien Mariethoz a, Markus Müller a,b,⁎, Frederique Lisacek a,b,⁎ a Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, Geneva 1211, Switzerland b Centre Universitaire de Bioinformatique, University of Geneva, Geneva 1211, Switzerland article info abstract
Article history: Mass spectrometry (MS) is a widely used and evolving technique for the high-throughput identification of Received 2 May 2015 molecules in biological samples. The need for sharing and reuse of code among bioinformaticians working Received in revised form 17 June 2015 with MS data prompted the design and implementation of MzJava, an open-source Java Application Program- Accepted 22 June 2015 ming Interface (API) for MS related data processing. MzJava provides data structures and algorithms for Available online 30 June 2015 representing and processing mass spectra and their associated biological molecules, such as metabolites, glycans and peptides. MzJava includes functionality to perform mass calculation, peak processing (e.g. centroiding, filter- Keywords: Java ing, transforming), spectrum alignment and clustering, protein digestion, fragmentation of peptides and glycans Mass spectrometry as well as scoring functions for spectrum–spectrum and peptide/glycan-spectrum matches. For data import and Proteomics export MzJava implements readers and writers for commonly used data formats. For many classes support for the Glycomics Hadoop MapReduce (hadoop.apache.org) and Apache Spark (spark.apache.org) frameworks for cluster comput- Hadoop ing was implemented. The library has been developed applying best practices of software engineering. To ensure Spark that MzJava contains code that is correct and easy to use the library's API was carefully designed and thoroughly tested. MzJava is an open-source project distributed under the AGPL v3.0 licence. MzJava requires Java 1.7 or higher. Binaries, source code and documentation can be downloaded from http://mzjava.expasy.org and https://bitbucket.org/sib-pig/mzjava. This article is part of a Special Issue entitled: Computational Proteomics. © 2015 Elsevier B.V. All rights reserved.
1. Introduction from different search engines and integrate these with quantitative data. Later, TOPP was introduced as a management system for generic Mass spectrometry (MS) has become a central analytical technique proteomics workflows [5], based on OpenMS [6]. OpenMS contains a to characterise proteins, lipids, carbohydrates and metabolites in well-designed Application Programming Interface (API), which makes complex samples [1,2]. The diversity of biological questions possibly it useful not only as a toolbox but also as a code base for software devel- addressed with MS is reflected in a wide range of experimental opers. ProteoWizard [7] is another C++ open source project for the workflows. Analysing data generated through these workflowsisauto- conversion of proteomic MS/MS file formats and processing of MS/MS mated though most of the time, the variability of applications requires spectra. More recently, the Java programming language gained popular- software customisation and/or extension. This situation is well de- ity in many fields and especially in bioinformatics due to its portability scribed in a recent review [3] and justifies the development of libraries across different computer platforms and the availability of powerful of MS related software to facilitate code reuse. Software libraries are and comprehensive class libraries that facilitate and accelerate software meant to benefit the developers' own group and collaborators as well development. For example, the Chemistry Development Kit (CDK, as the wider computational proteomics and glycomics communities. http://sourceforge.net/projects/cdk/) is an open source library for Many early open source contributions dedicated to the automated cheminformatics and computational chemistry [8]. BioJava (http:// analysis of proteomic data were coded in C++ or Perl. The trans- biojava.org/, [9]) addresses the bioinformatics community and provides proteomic pipeline (TPP) [4] is an assembly of C++ programs and classes and tools for protein structure comparison, alignments of DNA Perl scripts to process and statistically validate MS/MS search results and protein sequences, analysis of amino acid properties, protein modifications and prediction of disordered regions in proteins as well ☆ This article is part of a Special Issue entitled: Computational Proteomics. as parsers for common file formats. ⁎ Corresponding authors at: Swiss Institute of Bioinformatics, University of Geneva, CMU, Rue Michel-Servet 1, 1211 Geneva, Switzerland. A broad range of Java based open source solutions was developed for E-mail addresses: [email protected] (M. Müller), the proteomics community as comprehensively reviewed in [3]. Our [email protected] (F. Lisacek). focus being mainly on MS/MS data processing, the following refers to
http://dx.doi.org/10.1016/j.jprot.2015.06.013 1874-3919/© 2015 Elsevier B.V. All rights reserved. 64 O. Horlacher et al. / Journal of Proteomics 129 (2015) 63–70 such dedicated toolboxes. Compomics is a collection of tools mainly for The development of MzJava entailed refactoring a substantial part of MS/MS data analysis [10,11]. It also contains the Compomics-utilities the JPL aiming at producing high quality and efficient code. The outcome class library [12], which provides the code base tools. The Java is meant to be structured as a coherent API as opposed to bundling a col- Proteomics Library (JPL, javaprotlib.sourceforge.net) provides an API lection of code pieces. MzJava follows Java naming and behaviour con- to process MS/MS spectra and their annotation. It served as the code ventions and provides builders with fluent interfaces for constructing base for several software projects dealing with MS/MS searches [13], complex objects (http://en.wikipedia.org/wiki/Fluent_interface). To spectral libraries and open modification searches [14], and data- help prevent misuse and to make MzJava easy to use in multithreaded independent quantification [15]. The PRIDE tool suite (http://pride- environments, mutable objects are avoided as much as possible. Objects toolsuite.googlecode.com) contains a set of pure Java libraries, tools that need to be mutable were designed to always be in a valid state. and packages designed to process and visualize versatile MS proteomics data [3]. It also contains a well-designed Java API (ms-data-core-api) 2.2. Development methodology facilitating the implementation of customized solutions [16].Inrecent years, several open source Java-based laboratory information systems The methodology that was employed to develop MzJava follows best (LIMS) storing and processing proteomic or glycomic data were practice for scientificcomputing[32–35] and is influenced by agile described [17–19]. software development, especially test driven development (TDD) [36] Glycomics MS data are still seldom produced in high throughput set- and continuous integration (CI) [37]. In brief, TDD describes a short de- ups but this field evolves quickly and the need for automation is velopment cycle for adding improvements or new functionality. First growing. A reference software for glycan MS and MS/MS annotation is the developer writes an, initially failing, automated test that defines GlycoWorkbench [20]. This tool is modular and mostly known for its the improvement or new functionality. Code is then written to make convenient user interface to support glycan structural assignment. the test pass. This is followed by refactoring the new code to bring it Theoretical glycan spectra can be calculated with its fragmentation up to acceptable quality. Fig. 1A summarizes the TDD cycle. Automated tool based on the mechanisms and nomenclature described by Domon tests are written using JUnit (http://junit.org/)andMockito(http:// and Costello [21]. Importantly it relies on recognised standard descrip- mockito.org/). CI is a software engineering practice in which changes tion of monosaccharides and full structures [22]. to the code are automatically tested whenever they are added to the The scalability of existing solutions is currently one of the codebase. The goal of CI is to provide rapid feedback so that changes greatest challenges. The ever-growing size of MS datasets and the that introduce issues or break the build can be corrected rapidly. Jenkins need to process spectra by the tens of millions imposes the use of (http://jenkins-ci.org/) is used to automate the CI. distributed data processing frameworks such as Hadoop MapReduce Code quality scores are tracked using SonarQube (http://www. [23] (hadoop.apache.org) and Apache Spark [24] (spark.apache.org). sonarqube.org/). Quality profiles were slightly altered from the Sonar Hadoop is already used in bioinformatics, primarily for next- way with Findbugs profiles. SonarQube is also used to evaluate the generation sequencing analysis [25] but also for proteomics quality of comparable libraries such as BioJava, ms-data-core-api, [26–28], while Spark was more recently introduced [29,30]. Hadoop jmzml (a library to handle mzML files [38]), and compomics-utilities. MapReduce is an implementation of the MapReduce programming MzJava is a Maven project developed using IntelliJ Idea (https://www. model described by Dean and Ghemawat [23].Sparkextendson jetbrains.com/idea/). Fig. 1B provides a more detailed view on the the functionality and performance of Hadoop by allowing in- development cycle and the tools used. memory data storage and providing additional functions [24]. We introduce MzJava a Java class library designed to ease the 2.3. Architecture development of custom data analysis software by providing building blocks that are common to most MS data processing software. MzJava The MzJava architecture is modular and consists of three main addresses the scaling issues by adding classes to interface with Hadoop modules: and Spark. Furthermore, new code was included for processing 1. The core module contains functionality that is common to all MS data glycomics data. In fact, MzJava originates from merging JPL and another 2. The proteomics module contains functionality specifictopeptidesand unpublished Java MS codebase. During this merge the code was proteins comprehensively refactored and refinedinanefforttoproduceaconsis- 3. The glycomics module contains functionality specific to glycans. tent and well-designed API. Best practices in software engineering such as test driven development and continuous integration were applied Fig. 2 illustrates the organisation of the modules and highlights the during the implementation. Code quality metrics of MzJava are contin- central position of the core module that overlaps with the proteomics uously tracked to maintain high quality standards. These metrics are and glycomics modules. used to benchmark MzJava in relation to other packages. 3. Results 2. Materials and methods The MzJava core consists of three main parts: mol, ms and io (Fig. 2). 2.1. Development aims The mol-core comprises classes to work with chemical compositions and their masses. The Atom class for example represents the mass and MzJava is mostly centred on MS/MS identification and annotation. isotopic abundances of a chemical element. The Composition class This bias towards an identification-related API reflects our research deals with assemblies of atoms defined by their stoichiometric chemical focus. The use of the MzJava API is intended for writing software that formulae. NumericMass is used to represent objects that have a mass but is capable of processing large data sets. Consequently the API is designed no known composition. The main classes in ms-core deal with peak lists to be extensible, flexible and efficient. During development we found (PeakList, list of m/z-intensity pairs) and their associated meta data that flexibility often comes at the cost of performance. Where we (Spectrum). There are a number of Spectrum subclasses to capture the identified performance hot-spots we implemented solutions that meta data associated with particular types of spectra. For example, allow either flexible or high performance code to be written, while in MsnSpectrum contains meta data such as scan number and retention non performance critical code we kept flexibility as a major criterion. time and ConsensusSpectrum captures meta data that is associated Additional design aims were to make the API not only easy to use, but with a consensus spectrum such as the ids of the spectra from which also hard to misuse as well as prompt to fail whenever there are errors the consensus was built, and the structure of the peptide/glycan that [31]. the consensus spectrum represents. The peak list associated with each O. Horlacher et al. / Journal of Proteomics 129 (2015) 63–70 65
spectrum is stored in subclasses of PeakList. There are 5 different fla- vours of PeakList. This allows the m/z values to be stored as either 64 or 32 bit numbers and the intensity as 64 bit, 32 bit or as a constant. To provide a common interface for all the subclasses, the Spectrum class both wraps and implements PeakList. The only time that the precision needs to be taken into account is when the spectrum is created; thereafter it is just an instance of Spectrum.Inadditiontothe m/z and intensity values of peaks, PeakList can also store one or more annotations for each peak. MzJava has three built-in peak annotations, PepFragAnnotation, GlycanFragAnnotation and LibPeakAnnotation. PepFragAnnotation and GlycanFragAnnotation are used to annotate peaks with peptide or glycan fragments and contain information such as the charge and ion type of the fragment. LibPeakAnnotation is used to annotate peaks in LibrarySpectrum with consensus statistics. We make use of generics to specify the type of annotation that a PeakList contains. The ms-core package also contains classes to match spectra and score these matches. The first step in most spectra matching workflows is to remove noise or to extract features from mass spectra by applying one or more filters or transformers to the peaks in the peak list. MzJava provides filters to bin and centroid raw spec- tra (BinnedSpectrumFilter, CentroidFilter) and to select peaks based on their m/z range or peak annotation (MzRangeFilter, FragmentAnnotationFilter). Further it includes processors that sim- plify spectra and retain only the n most intense peaks of the entire spectrum or remove peaks that fall below a given threshold (ThresholdFilter). In order to adapt to varying noise levels other fil- ters only keep a peak if it is among the n most intense peaks in a local bin or sliding window (NPeaksFilter, NPeaksPerBinFilter, NPeaksPerSlidingWindowFilter). With the exception of the m/z affine transformer all transformers operate on peak intensities and replace the original values by their log-, square root-, affine- and arcsinh-transformed values (LogTransformer, Log10Transformer, AffineTransformer, ArcsinhTransformer). MzJava also has classes to normalize the total or maximal ion current in a spectrum (IonCurrentNormalizer, UnitVectorNormalizer, NthPeakNormalizer, RankNormalizer, HighestPeakPerBinNormalizer). Peak processing is a performance hotspot, since it may be applied to Fig. 1. A) This cycle illustrates the overall development process of MzJava based on unit fl fi millions of spectra, but also needs to be exible so that it is possible to tests (JUnit) and subsequent checks to produce robust code. This gure is an adaptation fi of an image from the blog http://technicalknowledgeabstracts.blogspot.ch/2014/02/tdd- experiment with different lters and transformers. To achieve both of test-driven-development-in-practice.html B) This cycle provides details of tasks under- these requirements, peaks can be processed by applying a taken in the cycle described in A). PeakProcessorChain to a PeakList.APeakProcesorChain is a linked list of objects that implement the PeakProcessor interface. A PeakList is processed by pushing the m/z, intensity and annotations of each peak through the chain of PeakProcessor. This way, stateless peak processors such as Log10Transformer do not have to copy the m/z or intensity arrays. Peak processors that have state, i.e. where the result for each peak depends on the other peaks present in the peak list, can extend DelayedPeakProcessor, which reuses m/z and intensity arrays to improve performance. Custom peak processors can easily be created by extend- ing AbstractPeakProcessor or DelayedPeakProcessor.Foracomprehensive list of peak processors see Table 1. MzJava has a number of spectrum–spectrum scoring functions such as the normalized dot product (NdpSimFunc), Pearson's correlation coefficient (PearsonsSimFunc) and shared peak count (SharedPeakSimFunc). For a comprehensive list see Supplementary Table S1. Custom scoring functions can easily be created by implementing AbstractSimFunc. AbstractSimFunc takes care of peak alignment and supports plugging in peak filters and transformers. The io-core package provides readers and writers for commonly used spectra and peptide spectrum matches (PSM) file formats. Spectra are modelled by the MsnSpectrum class, which contains the most important meta data that the formats support. The PSM readers Fig. 2. This chart represents the 3 modules composing MzJava. The core module contains split the information into two parts, a part to identify the spectrum the code for MS data handling and processing. The proteomics and glycomics modules fi contain the code specific for peptides/proteins and glycans, respectively. io: input–output, (SpectrumIdenti er) and another that contains the peptide match mol: molecules, ms: mass spectrometry, utils: utilities, stats: statistics. information (PeptideMatch). This division reflects the diversity in 66 O. Horlacher et al. / Journal of Proteomics 129 (2015) 63–70
Table 1 Peak Processors.
Class Description
Filters BinnedSpectrumFilter Bins a spectrum. Peaks will still be stored as a PeakList, but the mz values will be regularly spaced on the centres of the bins CentroidFilter Selects all peaks that are local maxima and replaces their mass/intensity values by the centroid values calculated within a neighbourhood FragmentAnnotationFilter Select peaks with annotations that fulfil a certain condition LibraryMergePeakFilter Cluster peaks with similar m/z values MzRangeFilter Select or exclude peaks in certain m/z ranges. NPeaksFilter Selects the N most intense peaks NPeaksPerBinFilter Retains the top N peaks for every fixed m/z bin NPeaksPerSlidingWindowFilter Retains the top N peaks in a sliding m/z window ThresholdFilter Removes any peaks that have an intensity that is smaller than a threshold
Transformers AffineTransformer Performs affine transformation on the intensities ArcsinhTransformer Performs first an affine transformation of peak intensities followed by a arcsinh transformation. This transformation was shown to stabilize the intensity variance. ContrastEnhancingTransformer For each peak, the average of all intensities within an m/z window centred at the peak is subtracted. This can be used for an efficient calculation of the Sequest XCorr InverseArcsinhTransformer Performs the inverse of the ArcsinhTransformer Log10Transformer Takes log10 values of all (intensities + 1) LogTransformer Takes log values of all (intensities + 1) MzAffineTransformer Performs affine transformation on the m/z values SqrtTransformer Square root transforms all intensity values
Normalizers IonCurrentNormalizer Sum of all peak intensities is set to 1 UnitVectorNormalizer Euclidian norm of peak intensities is set to 1 NthPeakNormalizer Peak intensities are divided by the intensity of the N-th highest peak RankNormalizer Replaces peak intensities by 1 − r / N, where N is the number of peaks and r the rank of the peak intensity (r = 0:highest, …,r = N-1:lowest)
HighestPeakPerBinNormalizer Normalizes peak intensities by dividing the m/z range into bins and scaling the peaks in each bin such that the most intense peaks in each bin have the same value
the spectrum descriptors in the different PSM file formats. A list of 3.1. Glycomics and proteomics PSM readers and their capabilities can be found in Table 2.The readers have been designed to be easily customized through object To support proteomics and glycomics applications MzJava has data composition and inheritance. Object composition is used where structures to store, manipulate and generate theoretical spectra for there is a lot of variability in the format, for example the mgf title proteins, peptides and glycans. Peptides are represented by the Peptide or sptxt comment tag. Inheritance is used where data has been class which holds an amino acid sequence and one or more post added or removed from the file format. For example the mgf parser translational modifications for each residue or terminus. Modifications can be customized to read new tags or additional information from are defined in the code either by their chemical composition or mass. the peaks by overriding the parseUnknownTag or handlePeakLine They can also be retrieved from the UnimodModificationResolver which method. Setting a PeakProcessorChain on spectrum readers allows wraps an xml dump of the UniMod database [39].Peptidescanbe the peaks to be transformed and filtered while reading. The list of generated by digesting a Protein instance using ProteinDigester and spectrum readers and their attributes can be found in Supplementa- associated classes. ProteinDigester can be configured to output peptides ry Table S2 and S3. that have fixed and variable modifications. Support for the most
Table 2 PSM readers.
MaxQuantPsmReader MODaPsmReader MzIdentMlReader PepXmlReader ProteinPilotPsmReader
File type MaxQuant csv MODa PSI mzIdentML ISB pepXML ProteinPilot csv
Spectrum name ✓✓✓✓✓ Scan numbers ✓✓✓✓x Retention times ✓ x ✓ x ✓ Precursor neutral mass ✓✓✓✓✓ Precursor intensity xxx✓ x Assumed charge ✓✓✓✓✓ Index ✓✓✓✓✓ Spectrum source file x ✓✓ xx Rank ✓✓✓✓✓ Peptide Sequence ✓✓✓✓✓ Number matched ions xxx✓ x Total number of ions ✓ xx ✓ x Mass difference ✓✓✓✓✓ Missed cleavages ✓ xx ✓ x Rejected xx✓✓x Modifications ✓✓✓✓✓ Neutral peptide mass x ✓ xxx Protein ✓✓✓✓✓ O. Horlacher et al. / Journal of Proteomics 129 (2015) 63–70 67 important proteases is built-in and custom proteases can easily be to provide the time evolution of the scores. The results are displayed on added. Theoretical peptide spectra can be constructed by using the a webpage (https://glycoproteome.isb-sib.ch/sonar). A subset of the PeptideFragmenter class. PeptideFragmenter can be configured to pro- code quality scores for some analysed Java libraries (BioJava, ms-data- duce spectra that contain backbone and neutral loss fragment peaks. core-api, jmzml, and Compomics) is displayed in Table 3. The scores To generate peaks for custom fragments an implementation of the are: the number of lines of code, number of classes, code complexity PeptidePeakGenerator interface can be added to the PeptideFragmenter. and issues per class, documentation content, test coverage and level of In MzJava polysaccharides are represented by the Glycan class, which package tangle. Code complexity is a score calculated by SonarQube, holds a directed acyclic graph of Monosaccharide and Substituent nodes which evaluates how large the methods are and how many nested that are connected by GlycosidicLinkage edges. Glycan objects can be code blocks they contain. Low complexity indicates clear code, which created by reading glycan structures in the GlycoCT format [40] using is less prone to error. Code issues are potential weak points in the the GlycoCTReader class. The GlycanFragmenter class can be used to code pointed out by SonarQube. MzJava compares well to the other generate theoretical glycan spectra [21].BydefaulttheGlycanFragmenter libraries, and it has the highest test coverage and the lowest package returns glycosidic, cross ring and neutral loss fragments, however custom tangle index. fragments can be generated by implementing the GlycanPeakGenerator interface. The user can specify the list of Monosaccharides and Substituents 4. Discussions in a configuration file. Glycan classes rely on recognised standard descrip- tion of monosaccharides and full structures [22]. 4.1. Code example MS/MS protein database search
3.2. Hadoop and Spark support Fig. 3 shows the code required to perform a database search using the normalized dot product as a scoring function. The components A brief description of the Hadoop [23] and Spark [24] frameworks that are required for a database search are: a database of theoretical can be found in the Supplementary Material Section S1. The efficiency spectra, a reader to read the experimental query spectra, a similarity of the serialization can have a large impact on the performance of a function to calculate the score and a writer to write the results. First a distributed application. MzJava provides an efficient serialization PeptideFragmenter is created that generates b and y ion backbone mechanism for MzJava data classes that is based on Apache Avro peaks that have an intensity of 50 (line 3) and water loss peak for b (https://avro.apache.org/). Adapter classes are provided to integrate and y ion fragments that contain a S, T, D or E amino acid (line 4 to 6). the MzJava serialization with the mechanisms that are used by Hadoop The generated water loss peaks have an intensity of 10. The fragmenter and Spark. This allows MzJava to be used within these frameworks is then used while building a PeptideSpectrumDB that creates tryptic directly without requiring any additional serialization code. Hadoop peptides with 6 to 60 amino acids from a fasta file containing protein se- provides a pluggable API allowing different serialization frameworks quences (line 9 to 13). Spectra are read from a mgf file using a to be used. The MzJavaSerialization class implements this API to allow MgfReader that pre-processes the raw spectra on the fly (line 15 to MzJava data objects to be automatically serialized by Hadoop. 20). The pre-processing is specified by providing a PeakProcessorChain. Spark uses native Java or Kryo (https://github.com/EsotericSoftware/ In this example the processor chain will process the raw spectra by kryo) serialization. MzJava interoperates with Spark by providing a bridge centroiding (line 17), removing noise by keeping the most intense 2 between its classes and Kryo. In order to automatically serialize MzJava peaks in a 10 m/z sliding window (line 18) and then square root data objects Spark needs to be configured to use Kryo serialization and transforming the intensities (line 19). The normalized dot product sim- to use MzJavaKryoRegistrator for the registration of serializable MzJava ilarity function is initialized to require at least 5 peaks and use a toler- classes. The MzJava serialization can be extended to handle new classes ance of 0.02 Da when aligning peaks (line 22) and a result writer is by implementing the AvroReader and AvroWriter interfaces and tagging built (line 23). The query spectra are then read one at a time (while them with the ServiceProvider interface. MzJavaSerialization and loop line 24) and the similarity score is calculated (line 30) for each da- MzJavaKryoRegistrator then automatically use the reader and writer to se- tabase spectrum whose precursor m/z is within tolerance of the query rialize the new class. spectrum precursor mass (for loop line 28). This example could be ex- tended to calculate a PSM score that is more sophisticated by providing 3.3. Code quality a similarity function that is more appropriate.
At the time of writing the MzJava library contains 26,793 lines of 4.2. Code example Spark code in 529 classes. Since comparable libraries such as BioJava or Compomics have different purposes it appears rather difficult to bench- The next code example illustrates how to use MzJava in the Apache mark performance. However since SonarQube calculates code quality Spark framework in order to search query spectra against a spectrum scores, which are independent of the specific purpose of a library, library (Fig. 4). First we need to configure Spark to use Kryo and MzJava these scores can be used to compare different projects. Thus we run serialization to serialize MzJava classes automatically (lines 2 and 3). SonarQube on some Java libraries and this process is regularly repeated Then the library spectra are read from a.sptxt file and these spectra are
Table 3 Code Quality Scores.
MzJava BioJava ms-data-core-api jmzml Compomics
1.1.0 4.0.0-SNAPSHOT 0.11.27-SNAPSHOT 1.7.2-SNAPSHOT 3.49.7
Lines of code 26,793 100,734 14,243 6382 89,866 Classes 529 1024 126 144 523 Public documented API (%) 61.0% 48.5% 55.6% 70.6% 79.3% Duplicated lines (%) 3.1% 5.3% 4.4% 3.8% 7.3% Complexity/class 12.5 22.2 32.3 8.7 39.4 Issues/class 1 12 5 2 21 Test coverage 85.0% 36.2% 34.3% 43.4% 11.6% Package tangle index 0.0% 17.7% 6.4% 4.5% 23.7% 68 O. Horlacher et al. / Journal of Proteomics 129 (2015) 63–70
Fig. 3. MzJava code example for a MS/MS protein database search. made available to all nodes on the computer cluster (lines 6–8). Line 10 similarity score. In order to calculate the score of a match between defines a PairFunction instance to map a query spectrum to a list of query and library spectrum, a normalized dot product similarity score-peptide pairs; one pair for each library spectrum that is within function is created with a fragment tolerance of 0.02 Da (line 20) and tolerance of the query spectrum precursor mass and has a high an instance of DefaultSpectrumLibrary (line 21), which retrieves all
Fig. 4. MzJava code example for a spectrum library search using the Spark framework. O. Horlacher et al. / Journal of Proteomics 129 (2015) 63–70 69 broadcasted library spectra within 20 ppm of the query spectrum pre- Conflict of interests cursor mass (line 35). For each retrieved library spectrum, the similarity to the query spectrum is calculated and if this similarity is larger than All authors declare that they have no conflicting interests. 0.6, the library spectrum will be added to the list of results (lines 30–32). On line 41 Spark reads the query spectra from a Hadoop se- Acknowledgments quence file into a Spark RDD, and partitions these spectra across the nodes of the cluster. Finally, Spark sends the library search function to This work has been and is supported by the Swiss National Science all nodes of the cluster and applies it to the library spectra on these Foundation [SNSF 315230 130830, 31003A 141215 and CRSII3 nodes. Line 43 saves the results to a Spark object file. Supplementary 136282] and EU (FP7-PEOPLE-2012-ITN #316929). Figure S1 shows an example of how to use MzJava within the Hadoop map-reduce framework. Appendix A. Supplementary data 4.3. Code example glycomics Supplementary data to this article can be found online at http://dx. Supplementary Figure S2A describes how to build a glycan molecule doi.org/10.1016/j.jprot.2015.06.013. using the MzJava Glycan.Builder class. In this example, the N-glycan core, – – – – – i.e. Man(a1 3)[Man(a1 6)] Man(b1 4)GlcNAc(b1 4)GlcNAc(b1 ,is References built. First, monosaccharides and substitutes are read from a configura- tion file. The glucose root node is added and a NAcetyl is attached to the [1] R. Aebersold, M. Mann, Mass spectrometry-based proteomics, Nature 422 (2003) 198–207. second carbon of glucose. Then one more GlcNAc and three mannoses [2] M. Wuhrer, Glycomics using mass spectrometry, Glycoconj. J. 30 (2013) 11–22. are added to obtain the N-glycan core structure. In Supplementary [3] Y. Perez-Riverol, R. Wang, H. Hermjakob, M. Müller, V. Vesada, J.A. Vizcaíno, Open Figure S2B we describe a code example for the fragmentation of source libraries and frameworks for mass spectrometry based proteomics: a developer's perspective, Biochim. Biophys. Acta Protein Proteomics 1844 (2014) GalNAc(b1-?)Gal. The glycan structure is read from a GlycoCT encoded 63–76. string using the GlycoCTReader class. Then the ion types and fragment [4] A. Keller, A.I. Nesvizhskii, E. Kolker, R. Aebersold, Empirical statistical model to esti- types are set. A GlycanFragmenter object is created with the maximal mate the accuracy of peptide identifications made by MS/MS and database search, – numbers of glycosidic and crossring fragments set to 1. Finally a glycan Anal. Chem. 74 (2002) 5383 5392. [5] O. Kohlbacher, K. Reinert, C. Gröpl, E. Lange, N. Pfeifer, O. Schulz-Trieglaff, M. Sturm, fragment spectrum of singly charged fragments is constructed. TOPP—the OpenMS proteomics pipeline, Bioinformatics 23 (2007) e191–e197. [6] M. Sturm, A. Bertsch, C. Gröpl, A. Hildebrandt, R. Hussong, E. Lange, N. Pfeifer, O. — 4.4. Code quality Schulz-Trieglaff, A. Zerck, K. Reinert, O. Kohlbacher, OpenMS an open-source software framework for mass spectrometry, BMC Bioinf. 9 (2008) 163. [7] D. Kessner, M. Chambers, R. Burke, D. Agus, P. Mallick, ProteoWizard: open source The SonarQube results show that the MzJava code quality compares software for rapid proteomics tools development, Bioinformatics 24 (2008) favourably to that of other widely used open source projects. We have 2534–2536. [8] C. Steinbeck, Y. Han, S. Kuhn, O. Horlacher, E. Luttmann, E. Willighagen, The Chem- found that using SonarQube is an effective method for discovering istry Development Kit (CDK): an open-source java library for chemo- and bioinfor- problems in the MzJava code before they lead to bugs in research matics, J. Chem. Inf. Comput. Sci. 43 (2003) 493–500. code. We also use SonarQube to track test coverage, technical debt [9] A. Prlić, A. Yates, S.E. Bliven, P.W. Rose, J. Jacobsen, P.V. Troshin, M. Chapman, J. Gao, C.H. Koh, S. Foisy, R. Holland, G. Rimša, M.L. Heuer, H. Brandstätter–Müller, P.E. and documentation coverage. The good test coverage and low technical Bourne, S. Willis, BioJava: an open-source framework for bioinformatics in 2012, debt of MzJava makes the code more robust and maintainable. High test Bioinformatics 28 (2012) 2693–2695. coverage means that errors introduced during development are more [10] M. Vaudel, H. Barsnes, F.S. Berven, A. Sickmann, L. Martens, SearchGUI: an open- source graphical user interface for simultaneous OMSSA and X!Tandem searches, likely to make some test fail and can therefore be detected. The docu- Proteomics 11 (2011) 996–999. mentation in MzJava is fairly available and should keep on growing. [11] M. Vaudel, J.M. Burkhart, R.P. Zahedi, E. Oveland, F.S. Berven, A. Sickmann, L. Martens, H. Barsnes, PeptideShaker enables reanalysis of MS-derived proteomics – 5. Conclusions data sets, Nat. Biotechnol. 33 (2015) 22 24. [12] H. Barsnes, M. Vaudel, N. Colaert, K. Helsens, A. Sickmann, F.S. Berven, L. Martens, compomics-utilities: an open-source Java library for computational proteomics, MzJava provides a well-engineered and well-tested API with BMC Bioinf. 12 (2011) 70. building blocks commonly required to write software processing mass [13] F. Gluck, C. Hoogland, P. Antinori, X. Robin, F. Nikitin, A. Zufferey, C. Pasquarello, V. Fétaud, L. Dayon, M. Müller, F. Lisacek, L. Geiser, D. Hochstrasser, J.-C. Sanchez, A. spectrometry data in research project. This speeds up the development Scherl, EasyProt — an easy-to-use graphical platform for proteomics data analysis, of new mass spectrometry software, allows bioinformaticians to focus J. Proteomics 79 (2013) 146–160. on writing new algorithms and innovative solutions and helps avoiding [14] E. Ahrné, F. Nikitin, F. Lisacek, M. Müller, QuickMod: a tool for open modification spectrum library searches, J. Proteome Res. 10 (2011) 2913–2921. boilerplate code. In terms of code quality it compares favourably to [15] H. Pak, F. Nikitin, F. Gluck, F. Lisacek, A. Scherl, M. Muller, Clustering and filtering other open source Java projects. Besides classes to deal with proteomics tandem mass spectra acquired in data-independent mode, J. Am. Soc. Mass data, MzJava also provides new functionality to fragment glycans and Spectrom. 24 (2013) 1862–1871. [16] Y. Perez-Riverol, J. Uszkoreit, A. Sanchez, T. Ternent, N. del Toro, H. Hermjakob, J.A. annotate glycan spectra. Interfaces to Hadoop MapReduce and Apache Vizcaíno, R. Wang, ms-data-core-api: an open-source, metadata-oriented library Spark facilitate large scale computing and parallelization of the code for computational proteomics, Bioinformatics (2015) btv250. often required in systems biology applications. [17] E.K. Nelson, B. Piehler, J. Eckels, A. Rauch, M. Bellew, P. Hussey, S. Ramsay, C. Nathe, K. Lum, K. Krouse, D. Stearns, B. Connolly, T. Skillman, M. Igra, LabKey Server: an open source platform for scientific data integration, analysis and collaboration, Abbreviations BMC Bioinf. 12 (2011) 71. API Application Programming Interface [18] A. Bauch, I. Adamczyk, P. Buczek, F.-J. Elmer, K. Enimanev, P. Glyzewski, M. Kohler, T. MS mass spectrometry Pylak, A. Quandt, C. Ramakrishnan, C. Beisel, L. Malmström, R. Aebersold, B. Rinn, openBIS: a flexible framework for managing and analyzing complex data in biology LC liquid chromatography research, BMC Bioinf. 12 (2011) 468. [19] J. Häkkinen, G. Vincic, O. Månsson, K. Wårell, F. Levander, The proteios software Availability environment: an extensible multiuser platform for management and analysis of proteomics data, J. Proteome Res. 8 (2009) 3037–3043. [20] D. Damerell, A. Ceroni, K. Maass, R. Ranzinger, A. Dell, S.M. Haslam, Annotation of MzJava is an open-source project distributed under the AGPL v3.0 Glycomics MS and MS/MS Spectra Using the GlycoWorkbench Software Tool, in: licence. MzJava requires Java 1.7 or higher. Binaries, source code and T. Lütteke, M. Frank (Eds.), Methods in Molecular Biology, vol 1273: Glycoinformatics, Springer Science+Business Media, New York 2015, pp. 3–15. documentation can be downloaded from http://mzjava.expasy.org and [21] B. Domon, C.E. Costello, A systematic nomenclature for carbohydrate fragmentations https://bitbucket.org/sib-pig/mzjava. in FAB-MS/MS spectra of glycoconjugates, Glycoconj. J. 5 (1988) 397–409. 70 O. Horlacher et al. / Journal of Proteomics 129 (2015) 63–70
[22] M.P. Campbell, R. Ranzinger, T. Lütteke, J. Mariethoz, C.A. Hayes, J. Zhang, Y. Akune, [31] J. Bloch, How to design a good API and why it matters, OOPSLA 062006. K.F. Aoki-Kinoshita, D. Damerell, G. Carta, W.S. York, S.M. Haslam, H. Narimatsu, P.M. [32] J.T. Dudley, A.J. Butte, A quick guide for developing effective bioinformatics Rudd, N.G. Karlsson, N.H. Packer, F. Lisacek, Toolboxes for a standardised and programming skills, PLoS Comput. Biol. 5 (2009) e1000589. systematic study of glycans, BMC Bioinf. 15 (2014) S9. [33] G.K. Sandve, A. Nekrutenko, J. Taylor, E. Hovig, Ten simple rules for reproducible [23] J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, computational research, PLoS Comput. Biol. 9 (2013) e10003285. Commun. ACM 51 (2008) 107–113. [34] G. Wilson, D.A. Aruliah, C.T. Brown, N.P. Chue Hong, M. Davis, R.T. Guy, S.H.D. [24] M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica, Spark: cluster comput- Haddock, K.D. Huff, I.M. Mitchell, M.D. Plumbley, B. Waugh, E.P. White, P. Wilson, ing with working sets, Proc. 2nd USENIX Conf. Hot Top. Cloud Comput, 2010. Best practices for scientific computing, PLoS Biol. 12 (2014) e1001745. [25] R.C. Taylor, An overview of the Hadoop/MapReduce/HBase framework and its cur- [35] F. da Veiga Leprevost, V.C. Barbosa, E.L. Francisco, Y. Perez-Riverol, P.C. Carvalho, On rent applications in bioinformatics, BMC Bioinf. 11 (2010) S1. best practices in the development of bioinformatics software, Bioinform. Comput. [26] B. Pratt, J.J. Howbert, N.I. Tasman, E.J. Nilsson, MR-Tandem: parallel X!Tandem using Biol. 5 (2014) 199. Hadoop MapReduce on Amazon Web Services, Bioinformatics 28 (2012) 136–137. [36] K. Beck, Test-driven Development, Addison-Wesley Professional, 2003. [27] A. Kalyanaraman, W.R. Cannon, B. Latt, D.J. Baxter, MapReduce implementation of a [37] P.M. Duvall, S. Matyas and A, Glover, Continuous Integration, Improving Software hybrid spectral library-database search method for large-scale peptide identifica- Quality and Reducing Risk, Addison-Wesley Professional, 2007. tion, Bioinformatics 27 (2011) 3072–3073. [38] R.G. Côté, F. Reisinger, L. Martens, jmzML, an open-source Java API for mzML, the PSI [28] C.-L. Hung, G.-J. Hua, Cloud computing for protein-ligand binding site comparison, standard for MS data, Proteomics 10 (2010) 1332–1335. Biomed. Res. Int. 2013 (2013) 170356. [39] D.M. Creasy, J.S. Cottrell, Unimod: protein modifications for mass spectrometry, [29] M.S. Wiewiórka, A. Messina, A. Pacholewska, S. Maffioletti, P. Gawrysiak, M.J. Proteomics 4 (2004) 1534–1536. Okoniewski, SparkSeq: fast, scalable and cloud-ready tool for the interactive [40] S. Herget, R. Ranzinger, K. Maass, C.W.v.d. Lieth, GlycoCT—a unifying sequence genomic data analysis with nucleotide precision, Bioinformatics 30 (2014) format for carbohydrates, Carbohydr. Res. 343 (2008) 2162–2171. 2652–2653. [30] J. Freeman, N. Vladimirov, T. Kawashima, Y. Mu, N.J. Sofroniew, D.V. Bennett, J. Rosen, C.-T. Yang, L.L. Looger, M.B. Ahrens, Mapping brain activity at scale with cluster computing, Nat. Methods 11 (2014) 941–950. Article
Cite This: Anal. Chem. 2017, 89, 10932-10940 pubs.acs.org/ac
Glycoforest 1.0 † ‡ § † ‡ † ‡ † ‡ Oliver Horlacher, , Chunsheng Jin, Davide Alocci, , Julien Mariethoz, , Markus Müller, , § † ‡ Niclas G. Karlsson, and Frederique Lisacek*, , † Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, Geneva, 1211, Switzerland ‡ University of Geneva, Geneva, 1211, Switzerland § Glyco Inflammatory Group, Department of Medical Biochemistry and Cell Biology, Institute of Biomedicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, SE405 30, Sweden
*S Supporting Information
ABSTRACT: Tandem mass spectrometry, when combined with liquid chromatography and applied to complex mixtures, produces large amounts of raw data, which needs to be analyzed to identify molecular structures. This technique is widely used, particularly in glycomics. Due to a lack of high throughput glycan sequencing software, glycan spectra are predominantly se- quenced manually. A challenge for writing glycan-sequencing software is that there is no direct template that can be used to infer structures detectable in an organism. To help alleviate this bottleneck, we present Glycoforest 1.0, a partial de novo algorithm for sequencing glycan structures based on MS/MS spectra. Glycoforest was tested on two data sets (human gastric and salmon mucosa O-linked glycomes) for which MS/MS spectra were annotated manually. Glycoforest generated the human validated structure for 92% of test cases. The correct structure was found as the best scoring match for 70% and among the top 3 matches for 83% of test cases. In addition, the Glycoforest algorithm detected glycan structures from MS/MS spectra missing a manual annotation. In total 1532 MS/MS previously unannotated spectra were annotated by Glycoforest. A portion containing 521 spectra was manually checked confirming that Glycoforest annotated an additional 50 MS/MS spectra overlooked during manual annotation.
lycomics is a 21st century endeavor that is gaining generated. It appears then that glycomics needs the automation G momentum1 and importance in biological knowledge.2 of analytical methods to enable large-scale studies. An Glycans, in the form of both polysaccharides and glycoconju- increasing range of software tools is available for analyzing gates, are the most abundant class of biomolecules and are MS/MS data produced in glycomics (see Li et al.4 recent book Downloaded via UNIV OF GENEVA on July 30, 2018 at 15:14:01 (UTC). increasingly recognized as being implicated in a wide range of chapter or Maxwell et al.5 for review), but very few were biological processes. Despite this ubiquitous presence in living defined to be operational in high throughput settings. Indeed, systems, glycomics has lagged behind genomics and proteo- automating the identification of glycans from MS/MS spectra is
See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles. mics, in particular, because, until recently, high throughput challenging because there is no direct template, such as the methods for the large scale characterization of the branching genome in proteomics, which can be used to generate a structures of complex carbohydrates, especially while con- database of all glycan structures in a given organism. Instead, jugated to the carrier protein, have been missing. In fact, the possible structures have to be inferred from enzyme activity, or only way to generate oligosaccharide structural data is by de novo algorithms have to be used to assign the spectra. As Li 4 biochemical analysis, but full and detailed characterization of et al. had comprehensively covered the range of existing tools, glycan structures presented by a native protein is a challenging we nonetheless mention the recent progress since Glyco- 6 analytical process that requires sensitive, quantitative, and Workbench, one of the most popular software designed to robust technologies to determine monosaccharide composition, speed up and ease the manual interpretation of MS/MS data. linkages, and branching sequences. Nowadays, tandem mass GlycoWorkbench evaluates a set of user-selected structures spectrometry (MS/MS) is the most established analytical upon establishing how well their theoretical fragment masses technique used to address the challenges of glycomics as it match the real MS/MS peak list, but it is very low-throughput offers high levels of sensitivity and can handle complex as it only processes one spectrum at a time. Such a drawback is mixtures.3 Robust analytical technology combined with strong bioinformatics support is common to most −omics applica- Received: July 14, 2017 tions. Such a combination is designed to extract as much Accepted: September 13, 2017 information as possible from the information-rich data sets Published: September 13, 2017
© 2017 American Chemical Society 10932 DOI: 10.1021/acs.analchem.7b02754 Anal. Chem. 2017, 89, 10932−10940 Analytical Chemistry Article tackled in Glycomics Elucidation and Annotation Tool fragment spectra was proposed in glycomics by Kameyama et (GELATO), a component of the GRITS toolbox (http:// al.18 The experimentally determined MS/MS spectral library www.grits-toolbox.org). This software was designed to be able anticipated is now stored in UniCarb-DB.19 to handle combined MS and MS/MS glycomic files but is Spectra libraries rely on two core principles beyond lacking a module for LC-MS and LC-MS/MS peak mainstream approaches: (1) matching unidentified spectra to identification. Automation of glycomic MS/MS can be further spectra in a spectral library or to unidentified spectra is more increased with the use of Smart Annotation Enhancement efficient than matching to theoretical spectra based on Graph (SAGE),7 a user behavior learning method. SAGE scores reference sequences and (2) consensus interpretation of sets select or reject annotations based on prior structure assign- of related spectra is more reliable than identification of one ments made by a user. spectrum at a time. This type of method relies on spectrum Despite this recent progress, an LC-MS/MS database search clustering and primarily on the definition of similarity. Despite strategy similar to the most common identification approach in some attempts to refine this definition and sharpen spectral proteomics still cannot be easily transposed to glycomics. This matches (e.g., Spec2spec and Tremolo),20,21 the normalized is mainly due to the absence of a comprehensive and well- dot product remains a widely used, reliable, and simple curated set of experimentally derived glycan structures, measure. In this study, we have transposed this approach in the notwithstanding much-improved efforts for collecting − context of glycomics. So far, only one tool implements a similar data,8 10 but also to the lack of subtle scoring schemes. The method, SweetNET,22 which is applied to solving glycopeptide first database search strategy was implemented in GlycosidIQ, structures. which generates a theoretical peak list for each structure in a In this Article, we introduce Glycoforest 1.0, a partial de novo database by computing all its theoretical fragments.11 A scoring algorithm for sequencing glycan structures based on MS/MS function determines the best match between the theoretical spectra. Glycoforest first clusters all MS/MS spectra in each peak lists and the MS/MS spectra. This approach described in individual LC-MS/MS run to collect spectra that belong to the 2004 was limited by the availability of reliable data at the time. same glycan, generating high-quality consensus spectra. Glycan In fact, recent efforts have mostly focused on glycoproteomics structures are then assigned to the consensus spectra using a considering glycans as a source of mass shifts between partial de novo glycan sequencing algorithm. The sequencing glycosylated and nonglycosylated peptides. This approach is algorithm makes use of a similarity measure that was previously difficult due to the unmanageable expansion of the search defined for finding unknown modifications in peptides, space. For example, one of the latest methods tackling large- otherwise called “open modification search”.23,24 The same 12 scale issues is GlycoMasterDB. It uses higher energy collision principle is applied to search a library of reference spectra to dissociation (HCD) spectra to identify glycan structures/ find references that share structural similarities with the query masses and electron transfer dissociation (ETD) to identify the but differ in glycan composition. The score from the “open peptide sequence by matching against a database of glycans and modification search” and the glycan structures obtained from peptides, respectively. The glycan database is constructed from the reference spectra are then used to generate and score glycan GlycomeDB. An alternative option to database search is to spectra matches. This algorithm is partially de novo because 13 adopt a de novo strategy for glycans and/or glycopeptides, but reference spectra for which the structure is known are required such approaches have so far not been extensively validated. In as a starting point for generating structures that are not present all cases, false discovery rate (FDR) calculations are not in the reference set. The sequencing results, including query straightforward and more often than not missing for glycan spectra for which no structure was found, are visualized using a fi 25 identi cations. Finally, scoring schemes are not easy to spectral network that shows the potential structural relation- determine either, since result validation is not fool proof and ships between consensus spectra. Glycoforest was evaluated sometimes glycobiologist dependent. against manually annotated results obtained from data sets of ffi These examples highlight the objective di culty of human gastric (16 raw files) and Atlantic salmon (20 raw files) implementing analytical pipelines. Nevertheless, the annotation mucin-type O-glycan MS/MS spectra. of fragment spectra comprises a series of tedious and repetitive steps whose automation is straightforward and can result in a ■ MATERIALS AND METHODS substantial decrease of the time needed for sequencing a structure. In that respect, recent progress in proteomics and Raw Data. The data used in this study were obtained from metabolomics can benefit glycomics. Small molecule identi- human gastric and Atlantic salmon mucin-type O-glycan MS/ fication has long been relying on the use of spectral MS data set. Both data sets are manually annotated and have libraries,14,15 and in proteomics, a significant number of articles been previously described.26,27 The raw data was downloaded also promote the use of direct spectral matching and the from Swestore (http://srm.swegrid.se/). Briefly, O-glycans construction of spectral libraries to support reliable peptide were released from mucins, and the reduced glycans were identification and quantification, especially in high-throughput analyzed by liquid-chromatography tandem mass spectrometry settings.16,17 Computationally, using spectral libraries is also (LC-MS/MS) using a porous graphitized carbon particle much more efficient as it skips the construction of theoretical column. The eluted O-glycans were detected using an LTQ spectra. The obvious downside is the potential incompleteness mass spectrometer (Thermo Scientific, San Jose, CA, US) in of the library and its subsequent limited coverage. Nonetheless, negative ion mode. The gastric mucin data set consists of 16 the constant improvement of instrument accuracy, sensitivity, raw mass spectrometry files (.raw) and one Excel file that and accumulation of knowledge in glycobiology has contributed contains a summary of the manually sequenced glycans and to the generation of libraries with good coverage. Thus, we their associated retention times in each run. The Atlantic chose to associate automation with the establishment of glycan salmon data set contains 20 raw files as well as a summary Excel MS/MS spectral libraries. The idea of assigning spectra through file of the retention times. This amounts to 36 raw files that matching a library consisting of experimentally determined contain 99 937 glycan and nonglycan MS/MS spectra.
10933 DOI: 10.1021/acs.analchem.7b02754 Anal. Chem. 2017, 89, 10932−10940 Analytical Chemistry Article
Annotation of Known Structures. The manually and a consensus spectrum was built from each cluster. Thus, annotated O-glycan structures that were used to assess the each consensus spectrum represents at least one glycan Glycoforest algorithm were obtained from the summary Excel structure. The aim of clustering is to collect all MS/MS spectra files of the respective papers.26,27 Each row of the retention from the same structure without generating clusters that are time table contains a glycan sequence and the retention time of contaminated with spectra from a different structure. This can that sequence for each run. For each structure, the elution be difficult due to the coelution of isomeric structures. To peaks that contain spectra from that structure were identified account for this, clustering parameters were chosen to prevent using the retention time listed for each run. Once the correct the generation of clusters that contain spectra from more than elution peak was identified, the MS/MS spectra that belong to one structure. The downside of this choice is the possible that peak were annotated with the structure. In cases of generation of multiple clusters for a single structure. However, coelution, each MS/MS spectrum was annotated with the the merging of redundant clusters is easier to address than the structure that contributed more ion current based on the shape splitting of ambiguous clusters. Like other MS/MS clustering of the elution peak. algorithms,31 Glycoforest uses agglomerative hierarchical Methods. Glycoforest 1.0 is a Java program written using clustering based on spectra similarity. However, clustering in the MzJava software library28 and Apache Spark29 (http:// Glycoforest differs from other clustering algorithms on two key spark.apache.org). The spectra data set was sequentially points: (i) the clusters are constrained using within run processed by going through three main steps (Figure 1): (1) retention times; (ii) instead of setting a similarity threshold for preparing data for sequencing; (2) glycan structure sequencing; the definition of a cluster, a statistical model of the clusters (3) spectral network generation. controls and terminates the clustering process when the model with the minimum Bayesian Information Criterion (BIC)32 is found. Briefly, to make clustering more efficient, spectra of each run were grouped on the basis of precursor charge state, retention time, and precursor m/z value. Each of the groups was then clustered. During the clustering, Glycoforest keeps track of the cluster assignment of raw MS/MS spectra so that the consensus spectra can be generated from the unprocessed MS/MS spectra Figure 1. Main steps of the Glycoforest workflow for assigning of each cluster. Further details of the clustering algorithm are glycans. In step 1, raw MS/MS data is clustered and consensus spectra provided in the Supporting Information. are generated. In step 2, the sequencing algorithm is used to assign a Finally, consensus spectra were created by combining all structure to each consensus spectrum. In step 3, a spectral network is peaks from the raw MS/MS spectra that belong to a cluster into built to visualize the relationship between the reference, assigned, and one spectrum. Peaks with similar m/z values (±0.3 Da) were unassigned spectra. grouped, and each group was replaced by a single centroid peak. During centroiding, the number of peaks that were used Step 1 collates the raw MS/MS spectra in each run using a to generate the centroid was recorded to keep track of the clustering algorithm to generate consensus spectra. Step 2 uses number of MS/MS spectra in which the peak occurred the sequencing algorithm to assign structures to each consensus (spectrum count). The intensity of the centroid peak was spectrum. Step 3 then creates a spectral network to visualize the calculated by summing the intensity of all the peaks that were structural relationships that are available to the sequencing merged and taking the square root. Once the consensus spectra algorithm. had been generated, isotopic consensus spectra were removed Preparing Data for Sequencing. The raw MS/MS using the isotope removal algorithm described in the spectra were first prepared for sequencing by data conversion, Supporting Information. filtering, and then clustering to build consensus spectra. The 36 Test Data and Reference Selection. All consensus spectra raw files were centroided and converted to mzXML using that include spectra that were manually annotated with an O- MSConvert.30 The resulting mzXML files were then converted glycan structure (as described in the Materials and Methods) to Hadoop files (http://hadoop.apache.org) using the MzJava constitute what we call the test set. The Glycoforest sequencing library.28 To reduce the number of low quality and nonglycan algorithm requires as reference a set of high-quality spectra that spectra, the following rules were applied to select MS/MS are unambiguously annotated. This set of spectra is called the spectra: contain at least 10 peaks, have a retention time reference set. Consensus spectra were included in the reference set between 10 and 30 min, have a precursor m/z that is within if they had one structure annotation and at least one peak for tolerance (±0.3) of a glycan composition with at least one each glycosidic bond in the structure. The ion types that were HexNAc. considered when determining if a bond is present were B, C, Y, Only the MS/MS spectra meeting the criteria above were and Z ions with all charges smaller than or equal to the considered for clustering. Details on how the glycan consensus precursor charge. We assumed that all manually compositions were generated can be found in the Supporting annotated structures (usually short glycans) are correct in this Information. study. Building Consensus Spectra. It is generally accepted that Glycan Sequencing. We define here “open similarity” to combining MS/MS spectra into consensus spectra improves be the similarity of two spectra, that have different precursor the interpretation of spectral matches. Due to the large number masses and therefore have different glycan compositions. More of MS/MS spectra collected in this work, we employed a precisely, the open similarity accounts for the m/z shift of a clustering strategy to collect spectra that belong together subset of the peaks due to the addition, removal, or substitution followed by applying a merging procedure to create consensus of monosaccharides (Hex, Fuc HexNAc, Neu5Ac, Neu5Gc, spectra. Practically, MS/MS spectra in each run were clustered Kdn) and sulfate (S) as summarized in Table S-1. The
10934 DOI: 10.1021/acs.analchem.7b02754 Anal. Chem. 2017, 89, 10932−10940 Analytical Chemistry Article similarity was calculated using the function previously defined Step 1: For each query spectrum, the algorithm goes through to find unknown modifications in peptides.23,24 The Glyco- the list of reference spectra and checks if the mass difference forest sequencing algorithm exploits the fact that the open between the reference spectrum and query spectrum matches a similarity of two consensus spectra (S1 and S2) correlates with transition mass. If it does, the transitions for the mass are the structural similarity of the associated glycans (G1 and G2). retrieved and a reference spectrum transition pair is kept for The composition difference, which was inferred from the mass each transition. For example, a transition mass of 16 Da has two difference between S1 and S2, combined with open similarity transitions associated, substitution of Fuc for Hex and s was used to generate and score hypothesis about structural substitution of NeuAc for NeuGc. The open similarity ( o) changes (structure transition) between G1 and G2. between the reference and query spectrum is then calculated, as The structure transitions that were considered were: described by Horlacher et al.24 See the Supporting Information addition, insertion, removal, and substitution, which are listed for a brief description. Figure 2 shows an example of the in Table S-1. For example, given that the structure G1 for sequencing steps with a query spectrum that has a precursor m z fi spectrum S1 is known (e.g., S1 was obtained from the reference mass of / 878 (NeuAc1Hex1HexNAc2) that matches ve set), the mass difference is Δ162 Da (+Hex), and the open reference spectra within the transition tolerance. similarity between S1 and S2 is high, it is likely that the Step 2: Stepping through all reference spectra transition pairs structure of G2 is based on G1 with the addition of a Hex and applying the transition to the reference structures produces residue. A set of potential candidate structures for G2 can then a list of glycan candidates. Depending on the structure, this be obtained by generating all structures that have Hex added to generates 0 to n candidates. In the example in Figure 2, three G1. However, the candidate set is not guaranteed to contain candidates are generated from the first reference spectrum G2. This is because a high open similarity does not guarantee (NeuAc1Hex1HexNAc1) by adding one HexNAc. that the transition is the only structural difference between G1 Step 3: A score is calculated for each candidate using eq 1. and G2. In fact, it is possible that G2 is not derived from G1 cs s w s w (1) (e.g., G1 is linear and G2 is branched). To increase the =×oott +× s likelihood of generating G2 as a candidate, Glycoforest where o is the open similarity score between the reference and s generates a set of candidates from multiple transitions to S2 query spectrum, t is the similarity of the query spectrum and w w (Figure 2). The candidate sets, along with the open similarity the theoretical spectrum of the candidate, and o and t are weights assigned to the score components. The weights were set to 0.5 to keep the candidate score (cs) between 0 and 1. The s theoretical spectra to calculate t were generated from the candidate using B, C, Y, and Z ions. In the example of Figure 2, the scores of the three candidates generated from the first reference spectrum are 0.70, 0.56, and 0.30, respectively. Step 4: Candidates are grouped by structure and sorted by score. Step 5: The GSM score is then calculated for each structure group using eq 2. 7 1 1 sGSM=++cs 1 cs 2 cs3 9 9 9 (2) cs cs cs where 1, 2, and 3 are the candidates with the three highest candidate score. If there are only one or two candidates, the cs of the missing scores is set to 0. The final GSM result list is then sorted by GSM score. In the example shown in Figure 2, six candidate glycans are assigned to the query spectrum with Figure 2. Outline of the glycan sequencing algorithm. Green ovals scores ranging from 0.23 to 0.75. represent reference spectra. The colored lines show the progress of Spectral Network Generation. To visualize the structural each candidate structure through the algorithm. Step 1 of the fi dependencies of the reference and query spectra, a spectral algorithm nds reference spectra transition pairs. Step 2 generates network was generated. In this network, a node is a consensus glycan candidates using the reference structure and their associated spectrum, which can have associated structure annotations transitions. For easy visualization, the residue that was altered is shown in orange. In step 3, a score is calculated for each candidate. In step 4, (from manual and/or Glycoforest sequencing). Each edge in candidates are grouped by structure and sorted by score. Step 5 the network represents a structure transition (Table S-1) produces the final GSM list by calculating a GSM score from the top associated with the corresponding open similarity score of the three candidates in each structure group. two connected nodes. Practically, the spectral network was constructed by first creating a node for each consensus spectrum. Edges were then created between each pair of scores, are then used by the sequencing algorithm to generate a nodes when: (i) the two nodes have the same precursor charge scored and ranked list of glycan spectrum matches (GSM) for and (ii) the mass difference between the two nodes is within S2. All the GSMs in the result list are potential candidates for tolerance of a structure transition. The open similarity for each the structure of G2 with higher scoring GSMs being better edge was calculated as described above. candidates than lower scoring GSMs. Measuring Sequencing Performance. Comparing The sequencing algorithm generates a sorted list of GSMs for Structures. A graph isomorphism algorithm was used to each query spectrum. The algorithm is outlined in Figure 2 and determine if a GSM structure is the same as the structure with works as follows: which the consensus spectrum was manually annotated. The
10935 DOI: 10.1021/acs.analchem.7b02754 Anal. Chem. 2017, 89, 10932−10940 Analytical Chemistry Article graph isomorphism algorithm checks the number and type of residues, the location of the reducing end, and which residues are connected to each other. However, the anomericity and linkage of glycosidic bonds are not taken into consideration. The graph isomorphism algorithm that was used is from JGraphT (v 0.9.2) (https://github.com/jgrapht/jgrapht). For consensus spectra that were annotated with more than one structure, the generated structure was checked against all annotated structures. If the generated structure matched one of the annotated structures, the result was judged to be correct. Cross Validation. The performance of the glycan sequencing algorithm was measured using cross validation. Each of the 36 runs was evaluated independently. The query spectra were obtained from the run that is under evaluation and the reference spectra were selected from the remaining 35 runs. Figure 3. A treemap showing the consensus spectra and their annotations. In total, the clustering produced 4203 consensus spectra, The overall result was obtained by combining the results for of which 1866 have manual annotation and 2337 do not. Of the each run as shown in Tables S-2 and S-3. consensus with manual annotation, 1849 (purple) are annotated with Fragment Delta. When comparing GSMs to annotated one structure and 17, with two structures. The 207 (green) high- structures, we observed that in many cases where the GSM with quality consensus spectra that were used as references are a subset of the highest score was not correct, it was nonetheless very the consensus with one annotation. Of the remaining consensus similar to the annotated structure. To quantify how much the spectra that do not have manual annotation, Glycoforest found GSMs proposed structure differs from the correct one and to count for 1532 consensus spectra (cyan) but could not find GSMs for 805 the number of spectra for a given similarity, we calculate a (gray) spectra. The GSMs for 521 consensus spectra were checked “fragment delta” which is the number of fragments that differ manually. 195 of these were previously found; 153 were isotopes; 72 between the correct and proposed structures. The fragment were nonglycan spectra; 51 were peeling products; 50 consensus fi spectra contained structures that were overlooked or wrongly assigned delta was calculated by rst making theoretical spectra for both during the initial manual annotation. the proposed and correct structures using only one ion type (Y ions), followed by counting the number of ions that do not occur in both spectra and dividing by two. See Figure S-6 for an the performance of Glycoforest. ac7b02754_si_004.zip con- example. tains the results of known structures that were assigned to new Interpretation. To evaluate the results, we recorded whether spectra and the results for structures that were overlooked the correct structure was present in the GSM list, the position during previous manual annotation. of the GSM with the correct structure in the GSM list, and the The sequencing algorithm accuracy was evaluated using the fragment delta of the highest scoring GSM. To visualize the test set containing 1866 manually annotated spectra according dependency of the score with the ranking of the correct to the following criteria: (1) Did Glycoforest generate a GSM structure, a histogram was generated by (i) binning the results list that was not empty? (2) Was the GSM with the correct according to scores and (ii) recording the frequency of the rank structure in the result list? (3) What was the rank of the GSM of the correct structure in each bin. The frequency for the 5th with the correct structure and what was the relationship to 51st ranking positions was collected together for easier between the rank of the correct GSM and the highest GSM visualization. Results for which the correct structure was not in score in the list? The result lists for query spectra that were “ ” the GSM list were recorded in the category labeled absent . empty were only counted for criteria 1. For 99% of the query The histogram for the dependency of the score with fragment spectra, Glycoforest generated a GSM list that contains one or delta was generated as above except the fragment delta was more GSMs (criteria 1). It performed especially well on singly recorded for each bin instead of the ranking position. charged ([M − H]−) spectra, where a result was generated for every query spectrum. The accuracy was slightly worse (96%) ■ RESULTS AND DISCUSSION on doubly charged spectra ([M − 2H]2−). Of the nonempty A total of 99 937 MS/MS spectra derived from 36 raw files result lists, 92% contain the GSM with the correct structure were used to evaluate Glycoforest. The preprocessing and (criteria 2). Again, the performance on singly charged spectra clustering reduced this to 4203 consensus MS/MS spectra. was higher (98%) than for doubly charged spectra (68%). Figure 3 shows a treemap of the consensus spectra and their Table A in Figure 4 shows the rank of the correct GSM in the annotations. 1866 of the consensus spectra have manual result list (criteria 3). Overall, 70% of the result lists have the annotations and 2337 do not. Of the consensus with manual correct GSM in the first position. This splits into 78% for singly annotation, 1849 are annotated with one structure and 17, with charged and 40% for doubly charged. When taking the top two structures. Glycoforest was run to assign structures to the three GSMs into account, this increases to 83% (singly and remaining consensus spectra that had not been annotated. doubly charged), 90% (singly charged), and 53% (doubly Glycoforest found GSMs for 1532 queries. Part of these GSMs charged). The relationship between the rank of the correct (521 queries) were checked manually. Among them, 195 of the GSM and the score of the top-ranking GSM is shown in plot A GSMs were previously found; 153 were isotopes; 72 were of Figure 4. The plot shows that as the score decreases the nonglycan spectra; 51 were peeling products; 50 consensus fraction of GSM lists with a wrong structure in the first position spectra contained structures that had been overlooked and increases, especially for doubly charged spectra. A large wrongly assigned in the initial manual annotation. All results are contributor to the lower accuracy of doubly charged results is provided as zipped Excel sheets. ac7b02754_si_003.zip the small number of doubly charged reference spectra. This contains the results of the structures that were used to assess shortage resulted in the correct GSM not being generated for
10936 DOI: 10.1021/acs.analchem.7b02754 Anal. Chem. 2017, 89, 10932−10940 Analytical Chemistry Article
Figure 4. Evaluation of the sequencing algorithm using the test set. (A) The relationship between the GSMs’ score and the rank of the correct structure. Table A shows the percentage of correct results for a given rank with cumulative percentages in brackets. Plot A shows that as the score decreases the fraction of result lists with a wrong structure in the first position increases, especially for doubly charged spectra. (B) The relationship between the score and the fragment delta of the top-ranking GSM. A small fragment delta means that the difference between the proposed and correct structure is small. For example, a fragment delta equal to 1 could be due to a reversal of two adjacent residues or the shift of a branching residue by one position. Table B shows the percentage of correct results for a given fragment delta. Plot B shows that there is a correlation between the fragment delta and GSM score. As the GSM score decreases, the proportion of results with a fragment delta greater than 0 increases, and the size of the fragment delta also increases.
32% of the doubly charged query spectra. If only considering proposed structure into the correct structure. As shown in the doubly charged query spectra where the correct structure is Figure 4B, there is a correlation between the fragment delta and present in the GSM list, the accuracy increases to 60% for the GSM scores. As the GSM score decreases, the proportion of top-ranking GSM and 78% for the top three GSMs. While results with a fragment delta greater than 0 increases, and the these results are still not as good as the singly charged results, size of the fragment delta also increases. 70% of the results have they nonetheless show that the scoring worked reliably where a fragment delta equal to 0, which means that the top-ranked Glycoforest was able to generate the correct structure. GSM structures are correct. 13% have a fragment delta that is We found that for many of the results where the first GSM equal to 1 implying that the difference between the GSM was wrong they were nonetheless very similar to the correct structure and the correct structure is due to a reversal of two structure. To quantify this, the fragment delta between the adjacent residues or the shift of a branching residue by one proposed structure and correct structure was computed (Figure position (for an example, see Figure S-7). A further 6.5% of the 4B). The fragment delta was chosen because it is a results have a fragment delta equal to 2. The structural spectroscopic similarity that correlates with structural similarity differences for these GSMs include a larger shift of a branching and therefore with the transformation required to change the residue or the shift of a residue by 2 positions (Figure S-7). The
10937 DOI: 10.1021/acs.analchem.7b02754 Anal. Chem. 2017, 89, 10932−10940 Analytical Chemistry Article remaining 9.6% of the results have a fragment delta between 3 and 6, indicating a larger structural difference between these GSMs and the correct structure. Splitting the results into singly and doubly charged shows a fragment delta smaller than 3 in 96% of the singly charged spectra. This suggests that these spectra have at most two residues in the wrong place. Only 66% of the doubly charged spectra have a fragment delta that is less than 3. To explore how much the individual parameter of the scoring function (eq 1) contributed to the final score, we repeated the s s evaluation using only t and only o. The resulting accuracies, for singly and doubly charged spectra combined, along with the accuracy obtained from eq 1, are shown in Table 1. The best Figure 5. Spectral network of the consensus and query spectra. The accuracy is obtained by combining the two parameters. Only node colors show the source of the structure annotation, and the node using s reduces the accuracy by 18% while only using s size is proportional to the precursor mass of the consensus spectrum. t o Green nodes are reference spectra that have high-quality structure decreases it by 13%. annotations derived from manual glycan sequencing; purple nodes are a nonreference spectra with a structure annotation derived from manual Table 1. Score Component Accuracies sequencing; cyan nodes have structure annotations that were − − − 2− generated by Glycoforest; gray nodes do not have any structure score component top ranking [M H] and [M 2H] annotations. The network on the left shows singly charged spectra, and s s t and o 70% the network on the right shows doubly charged spectra. A high s t 58% resolution scalable vector graphic (svg) of this figure is available in the s o 63% Supporting Information. a Accuracies that are obtained depend on which scores are included in the GSM score. possible by rerunning Glycoforest after curating the newly assigned spectra. Glycoforest can then select reference spectra Because the test data contains samples from both fish and from the cyan nodes and possibly infer a structure for the gray human samples, we repeated the evaluation for the accuracy of nodes. This can be repeated until no more spectra can be the top-ranking GSM using query spectra from one organism assigned. In the end, Glycoforest is “cumulative” in the sense with reference spectra from the other. This gave an accuracy of that it can be used like a database to allow new results to benefit 61% for gastric query spectra with fish references and 73% for from previous results. fish queries with gastric references. These results are within the As well as showing how much of the spectra data set has variation that is seen with the cross-validation (Tables S-2 and been annotated, the spectral network will be an excellent tool to S-3). This suggests that Glycoforest can be used to sequence visualize the connection between structures. If combined with glycans from samples for which there are no structures in the biosynthesis pathway information, it may become a practical reference set. tool to investigate the alteration of glycosylation due to the Addition and subtraction are the structure transitions that deficiency of enzymes involved in glycosylation. When nodes mimic biological synthesis. To investigate the impact of the and edges are grouped according to epitopes such as blood nonbiological transitions (insertion and substitution), we type, Lewis type, charged, and neutral, it can be used to evaluated the accuracy of the top-ranking GSMs using only compare different glycan profiles from different samples. It can addition and removal and only insertion and substitution. also be used to display complicated glycan profiles in defined Leaving out insertion and substitution does not change the sample sets and compare across sample sets. accuracy but increases the number of queries for which the Current Limitations of Glycoforest 1.0. Despite the use result list is empty from 15 to 43 (Figure S-8). of an isotope-removing algorithm to increase the accuracy, To visualize the structural dependencies of the reference and Glycoforest sometimes assigns erroneous structures, for query spectra, a spectral network was generated (Figure 5). The example, structures containing two Fuc (292.286 Da) instead nodes in the network represent consensus spectra, and the of one NeuAc (291.10 Da) or HexNAc and Kdn (453.400 Da) edges represent potential structural relationships between the instead of Hex and NeuAc (453.15 Da). This is because the nodes based on open spectra similarity. Edges connect nodes same mass may match different glycan compositions within a that have a mass difference that is within tolerance of a tolerance of ±0.3 Da, especially with consensus spectra derived structure transition. The open similarity of the consensus from isotopic peaks. Assignments to isotope spectra can be spectra that are connected is shown using the width of the edge, reduced by looking for the B and Y ions containing NeuAc or and the color of the edge indicates the annotation type of the using a better isotope-filtering algorithm. Nonetheless, the source node. compositions generated by Glycoforest are helpful. A part of The spectral network of singly and doubly charged spectra previously annotated Atlantic salmon O-glycan with a (Figure 5) shows a large connected component as well as 18 composition of NeuAc2Hex1HexNAc1 (Jin et al.26) turned nodes that are not connected. The colors of the nodes make it out to be NeuAc1Kdn1HexNAc2 (Figure S-9). easy to see that the spectra with known structures, the green As described in the Methods, candidates are generated by and purple nodes, clustered together. The newly assigned rewriting structures using the transitions that connect the spectra (cyan) surround the spectra with known structure. The reference and query spectra. To generate the correct structure, spectra that do not yet have a structure form a layer connected at least one reference spectrum annotated with the correct to the spectra with known structures via the newly assigned structure is required. If no such reference spectra are present, ones. This makes structure assignment to the gray nodes the correct GSM is never generated which results in reduced
10938 DOI: 10.1021/acs.analchem.7b02754 Anal. Chem. 2017, 89, 10932−10940 Analytical Chemistry Article accuracy. This was the case for 1.8% of singly charged spectra ing the open similarity score; cross validation; fragment and 32% of doubly charged spectra (Figure 4A). To work deltas; comparison of structure transitions; corrected around this problem, reference spectra can be accumulated structure; N-glycan test data (PDF) iteratively by alternating between automatic sequencing High resolution scalable vector graphic of Figure 5 (ZIP) followed by manual annotation. These newly annotated Results of the structures that were used to assess the structures then become candidates for inclusion as reference performance of Glycoforest (ZIP) spectra, which may increase the accuracy of the results. Results of known structures that were assigned to new Alternatively, additional reference spectra could be obtained spectra and the results for structures that were from a database such as UniCarb-DB.19 overlooked during previous manual annotation (ZIP) The scoring algorithm did not place the GSM with the correct structure in the top position for 22% of the singly ■ AUTHOR INFORMATION charged spectra and 60% of the doubly charged spectra. This Corresponding Author could be due to poor spectra or peak assignment ambiguity. *E-mail: [email protected]. Fax: +41 22 379 58 58. Ambiguity was shown to occur if isomeric fragmentation ions were misinterpreted by Glycoforest 1.0, e.g., taking cross-ring ORCID − − 0,2 Oliver Horlacher: 0000-0003-1826-4509 fragment ions [M H] 220 ( XNeuAc) as a glycosidic fragment (ZM‑HexNAc). An approach to remove these ambiguities Notes is to improve the calculation of the similarity between the The authors declare no competing financial interest. theoretical and experimental spectra of the GSM structure. Currently, the theoretical spectrum is generated using B, C, Y, ■ ACKNOWLEDGMENTS and Z ions and setting all the peaks to the same intensity. The This work has been supported by EU-FP7 GastricGlycoExplor- similarity is then calculated using the normalized dot product. A er ITN under grant agreement no. 316929. more sophisticated scoring algorithm that can make use of more information in the spectrum, such as peak intensity, ■ REFERENCES should increase the score of correct structures and improve (1) Hart, G. W.; Copeland, R. J. Cell 2010, 143 (5), 672−676. their ranks. (2) Pinho, S. S.; Reis, C. A. Nat. Rev. Cancer 2015, 15 (9), 540−555. In addition to the negative-mode O-glycan MS/MS spectra (3) Thaysen-Andersen, M.; Packer, N. H. Biochim. Biophys. Acta, presented in this study, we also have preliminary results Proteins Proteomics 2014, 1844 (9), 1437−1452. (4) Li, F.; Glinskii, O. V.; Glinsky, V. V. Proteomics 2013, 13 (2), showing that Glycoforest can process and assign structures to − O-glycans in positive-ion mode and N-glycans in negative-ion 341 354. mode (see the Supporting Information). We believe that the (5) Maxwell, E.; Tan, Y.; Tan, Y.; Hu, H.; Benson, G.; Aizikov, K.; Conley, S.; Staples, G. O.; Slysz, G. W.; Smith, R. D.; Zaia, J. PLoS One Glycoforest algorithm could also be used on glycosphingolipids 2012, 7 (9), e45474. and labeled glycan spectra data. (6) Ceroni, A.; Maass, K.; Geyer, H.; Geyer, R.; Dell, A.; Haslam, S. M. J. Proteome Res. 2008, 7 (4), 1650−1659. ■ CONCLUSION (7) AlJadda, K.; Ranzinger, R.; Porterfield, M.; Weatherly, B.; We tested the partial de novo algorithm that is used by Korayem, M.; Miller, J. A.; Rasheed, K.; Kochut, K. J.; York, W. S. 2015, http://arxiv.org/abs/1512.08451. Glycoforest on a large MS/MS data set and showed that, on (8) Ranzinger, R.; Herget, S.; von der Lieth, C.-W.; Frank, M. Nucleic this test set, the algorithm is able to generate the correct Acids Res. 2011, 39, D373−D376. structure for 92% of query spectra and place the GSM with the (9) Campbell, M. P.; Peterson, R.; Mariethoz, J.; Gasteiger, E.; correct structure in the top rank in 70% of cases. This makes Akune, Y.; Aoki-Kinoshita, K. F.; Lisacek, F.; Packer, N. H. Nucleic the current implementation of Glycoforest an efficient support Acids Res. 2014, 42, D215−D221. to manual annotation. The largest improvement in accuracy (10) Aoki-Kinoshita, K.; Agravat, S.; Aoki, N. P.; Arpinar, S.; (up to 22%) can be obtained by improving the rank of the Cummings, R. D.; Fujita, A.; Fujita, N.; Hart, G. M.; Haslam, S. M.; correct GSM. This can be achieved by refining the scoring Kawasaki, T.; Matsubara, M.; Moreman, K. W.; Okuda, S.; Pierce, M.; Ranzinger, R.; Shikanai, T.; Shinmachi, D.; Solovieva, E.; Suzuki, Y.; function to make better use of the information contained in Nucleic peak intensities as well as including linkages in the scoring Tsuchiya, S.; Yamada, I.; York, W. S.; Zaia, J.; Narimatsu, H. Acids Res. 2016, 44 (D1), D1237−D1242. function and the reported structures. The spectral network that (11) Joshi, H. J.; Harrison, M. J.; Schulz, B. L.; Cooper, C. a.; Packer, was used by Glycoforest provides a way to visualize the N. H.; Karlsson, N. G. Proteomics 2004, 4 (6), 1650−1664. connection between structures, which is especially valuable (12) He, L.; Xin, L.; Shan, B.; Lajoie, G. A.; Ma, B. J. Proteome Res. when applied to large data sets. The Glycoforest code is 2014, 13 (9), 3881−3895. available on bitbucket (https://bitbucket.org/sib-pig/ (13) Sun, W.; Lajoie, G. A.; Ma, B.; Zhang, K. In Bioinformatics glycoforest-public) and a web interface will be made available Research and Applications; Harrison, R., Li, Y., Mandoiu,̆ I., Eds.; in 2017 at http://glycoforest.expasy.org as part of the tool Springer International Publishing: Cham, 2015; Vol. 9096, pp 320− collection on http://www.expasy.org/glycomics. 330. (14) Stein, S. Anal. Chem. 2012, 84 (17), 7274−7282. ■ ASSOCIATED CONTENT (15) Horai, H.; Arita, M.; Kanaya, S.; Nihei, Y.; Ikeda, T.; Suwa, K.; Ojima, Y.; Tanaka, K.; Tanaka, S.; Aoshima, K.; Oda, Y.; Kakazu, Y.; *S Supporting Information Kusano, M.; Tohge, T.; Matsuda, F.; Sawada, Y.; Hirai, M. Y.; The Supporting Information is available free of charge on the Nakanishi, H.; Ikeda, K.; Akimoto, N.; Maoka, T.; Takahashi, H.; Ara, ACS Publications website at DOI: 10.1021/acs.anal- T.; Sakurai, N.; Suzuki, H.; Shibata, D.; Neumann, S.; Iida, T.; Tanaka, chem.7b02754. K.; Funatsu, K.; Matsuura, F.; Soga, T.; Taguchi, R.; Saito, K.; Nishioka, T. J. Mass Spectrom. 2010, 45 (7), 703−714. Generating glycan compositions; clustering algorithm; (16) Lam, H.; Aebersold, R. Methods 2011, 54 (4), 424−431. isotope removal algorithm; structure transitions; calculat- (17) Griss, J. Proteomics 2016, 16 (5), 729−740.
10939 DOI: 10.1021/acs.analchem.7b02754 Anal. Chem. 2017, 89, 10932−10940 Analytical Chemistry Article
(18) Kameyama, A.; Kikuchi, N.; Nakaya, S.; Ito, H.; Sato, T.; Shikanai, T.; Takahashi, Y.; Takahashi, K.; Narimatsu, H. Anal. Chem. 2005, 77 (15), 4719−4725. (19) Hayes, C. A.; Karlsson, N. G.; Struwe, W. B.; Lisacek, F.; Rudd, P. M.; Packer, N. H.; Campbell, M. P. Bioinformatics 2011, 27 (9), 1343−1344. (20) Yen, C.-Y.; Houel, S.; Ahn, N. G.; Old, W. M. Mol. Cell. Proteomics 2011, 10 (7), M111.007666. (21) Wang, M.; Bandeira, N. J. Proteome Res. 2013, 12 (9), 3944− 3951. (22) Nasir, W.; Toledo, A. G.; Noborn, F.; Nilsson, J.; Wang, M.; Bandeira, N.; Larson, G. J. Proteome Res. 2016, 15 (8), 2826−2840. (23) Na, S.; Bandeira, N.; Paek, E. Mol. Cell. Proteomics 2012, 11 (4), M111.010199. (24) Horlacher, O.; Lisacek, F.; Müller, M. J. Proteome Res. 2016, 15 (3), 721−731. (25) Guthals, A.; Watrous, J. D.; Dorrestein, P. C.; Bandeira, N. Mol. BioSyst. 2012, 8 (10), 2535−2544. (26) Jin, C.; Padra, J. T.; Sundell, K.; Sundh, H.; Karlsson, N. G.; Linden,́ S. K. J. Proteome Res. 2015, 14, 3239−3251. (27) Jin, C.; Kenny, D. T.; Skoog, E. C.; Padra, M.; Adamczyk, B.; Vitizeva, V.; Thorell, A.; Venkatakrishnan, V.; Linden,́ S. K.; Karlsson, N. G. Mol. Cell. Proteomics 2017, 16 (5), 743−758. (28) Horlacher, O.; Nikitin, F.; Alocci, D.; Mariethoz, J.; Müller, M.; Lisacek, F. J. Proteomics 2015, 129,63−70. (29) Zaharia, M.; Franklin, M. J.; Ghodsi, A.; Gonzalez, J.; Shenker, S.; Stoica, I.; Xin, R. S.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S. Commun. ACM 2016, 59 (11), 56−65. (30) Kessner, D.; Chambers, M.; Burke, R.; Agus, D.; Mallick, P. Bioinformatics 2008, 24 (21), 2534−2536. (31) Frank, A. M.; Bandeira, N.; Shen, Z.; Tanner, S.; Briggs, S. P.; Smith, R. D.; Pevzner, P. A. J. Proteome Res. 2008, 7 (1), 113−122. (32) Schwarz, G. Annals of Statistics 1978, 6, 461−464.
10940 DOI: 10.1021/acs.analchem.7b02754 Anal. Chem. 2017, 89, 10932−10940 207
Appendix B
Technical Notes
1 Technical Note 2 SugarSketcher: quick and intuitive online glycan 3 drawing
4 Davide Alocci 1,2, Philip Toukach3, Renaud Costa4, Nicolas Hory4, Julien Mariethoz1,2, Frédérique 5 Lisacek1,2,5* 6 1 Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, Geneva 1211, Switzerland; 7 [email protected], [email protected], [email protected] 8 2 Computer Science Department, University of Geneva, Switzerland 9 3 Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Laboratory of Complex and Nano- 10 scaled Catalysts, Moscow, Russia; [email protected] 11 4 Polytech Nice Sophia, Campus SophiaTech, 06903 Sophia-Antipolis, France; [email protected], 12 [email protected] 13 5 Section of Biology, University of Geneva, Switzerland 14 15 16 * Correspondence: [email protected]; Tel.: +41-22-379-5050 17 Received: date; Accepted: date; Published: date
18 Abstract: SugarSketcher is an intuitive and fast Javascript interface module for online drawing of 19 glycan structures in the popular SNFG notation and exporting them to various commonly used 20 formats encoding carbohydrate sequences (e.g., GlycoCT) or quality images (e.g., svg). It does not 21 require a backend server or any specific browser plugins and can be integrated in any web project 22 of glycoinformatics. SugarSketcher allows drawing glycans both for glycobiologists and non-expert 23 users. The 'quick mode' allows a newcomer to build up a glycan structure having only a limited 24 knowledge in carbohydrate chemistry. On the contrary, the 'normal mode' integrates advanced 25 options which enable glycobiologists to tailor complex carbohydrate structures. The source code is 26 freely available on GitHub and glycoinformaticians are encouraged to participate in the 27 development process while users are invited to test a prototype available at 28 https://glycoproteome.expasy.org/sugarsketcher and send feedback.
29 Keywords: carbohydrate; 2D structure; software. 30
31 1. Introduction 32 It is generally admitted that drawing glycans using a chemical notation can be at least 33 cumbersome, if not a challenge. This issue was addressed very early in glycochemistry [1], and 34 several groups have proposed symbolic nomenclatures to ease the representation of complex 35 carbohydrates. Although these representations have evolved from the original idea to several 36 visualization schemes reviewed in [2], the most compelling ones consist of a series of geometrical 37 shapes that symbolise monosaccharide units connected with lines specifying glycosidic linkages. A 38 recent reappraisal of this representation finally rallied a wide community of glycoscientists who have 39 settled on the usage of the Symbol Nomenclature for Glycans [3]. As a result of the dissemination of 40 this symbolic notation, a variety of software applications for drawing glycans has been developed to 41 fulfil mainly two distinct purposes. A drawing interface may be useful to query databases or to input 42 structures for further analysis, modelling or prediction of properties. Conversely, the extent of glycan 43 encoding formats often requires the translation of commonly used formats into images. The intrinsic 44 user-friendliness of the symbolic nomenclature not only meets the glycoscientists’ needs but also 45 simplifies access to glycoscience for non-specialists.
Molecules 2018, 23, x; doi: FOR PEER REVIEW www.mdpi.com/journal/molecules Molecules 2018, 23, x FOR PEER REVIEW 2 of 9
46 Over the past decades, the variety of glycan representations used in chemistry, glyco-chemistry 47 and glycobiology gave rise to a series of editing tools. We however, limit our coverage to those that 48 meet three requirements: 1) web-based, 2) freely accessible, and 3) exporting structures to standard 49 encoding formats [2]. KegDraw can be considered as the earliest on-line graphical glycan editor. It 50 was designed to perform a similarity search in the KEGG databases where the original CarbBank [4] 51 is integrated [5]. It was a Java application that needed to be installed and could produce low- 52 resolution images where text labels of monosaccharides were connected by lines. GlycanBuilder [6] 53 is a more recent java applet developed during the EuroCarb project [7]. This tool provides an interface 54 to assemble glycan structures using the graphic visualisation scheme proposed by the Consortium of 55 Functional Glycomics (CFG) and described in [8]. GlycanBuilder was upgraded to work in a web 56 environment, but it needs to be installed and connected with a server. Moreover, recent security 57 upgrades of all major browsers seriously challenged the usage of Java applets. This web-based 58 implementation usually involves fairly long time-lags during processing. However, this drawback 59 can be avoided as demonstrated with the glycan structure builder of the GlycoViewer platform [9] a 60 web interface for drawing glycans that pioneered in terms of design, usability and speed. A subjective 61 weak point of this tool is the drag-and-drop implementation that allows computer users to draw a 62 glycan structure easily with the help of a mouse. However, this may become tiresome if many and 63 large structures are drawn on a touch-based device such as a tablet or a smartphone. Also, 64 monosaccharides are displayed in text format like KegDraw neglecting the advantage brought by a 65 symbolic notation. GlycoViewer, like GlycanBuilder, is composed of a client interface and a server 66 written in Ruby on Rails. The drag-and-drop feature is also used in Glycano, a software for drawing 67 glycans entirely written in JavaScript (http://glycano.cs.uct.ac.za). Glycano is browser-independent 68 and does not require a server. The interface uses SNFG symbols, though not in the proper colour 69 scheme, and lacks the option of positioning of monosaccharides according to their linkage. It is 70 designed for trained chemists and glycobiologists precluding access for non-experts. DrawGlycan- 71 SNFG[10] is the most recent published tool to draw glycan structures. It produces high quality and 72 SNFG-compliant depiction of glycan structures but data input is limited to IUPAC linear encoding 73 [11]. Users cannot draw a structure interactively, but only use a string encoding to generate images. 74 In summary, tools developed in the last decade, are either incompatible with SNFG or rely on 75 complicated and/or slow interfaces or have other format or usability limitations. From a technical 76 viewpoint, these tools require a continuous connection with a server to support consistent and fast 77 drawing. From a scientific viewpoint, we believe glycan drawing should be democratised. With this 78 in mind, we have developed SugarSketcher, an intuitive and fast interface to draw glycans online. 79 This tool is entirely built in JavaScript and does not need a connection to any server. It is supported 80 by the major browsers and is fully compatible with the SNFG nomenclature. The interface has been 81 streamlined to accommodate expert and non-expert usage. In particular, a “quick mode” allows users 82 with limited knowledge on glycans to build up a structure quickly while the “normal mode” offers a 83 broader range of options regarding the structural features of complex carbohydrates. Its beta-version 84 is currently implemented as another graphic interface for searching structures in CSDB, the 85 Carbohydrate Structure Database [12]. The GlycoCT [13] export feature allows every software project 86 or a database supporting GlycoCT to translate the SugarSketcher user input into the project native 87 notation and to further process the constructed glycan sequences, including piping data to the search 88 engine. Furthermore, the code is destined to be shared and hopefully improved by the community of 89 glycoinformaticians. A prototype of SugarSketcher is currently included in the tool collection of 90 Glycomics@ExPASy [14] as a standalone application. It can be accessed at 91 https://glycoproteome.expasy.org/sugarsketcher while the code is available on GitHub at 92 https://github.com/alodavide/sugarSketcher. 93
94 2. Results 95 SugarSketcher is divided in two main components: the core Javascript library and the D3.js 96 (https://d3js.org) based interface. This division provides two main advantages: 1) the core library can
Molecules 2018, 23, x FOR PEER REVIEW 3 of 9
97 be used standalone or integrated in other web applications that handle information about glycan 98 structures; 2) the interface can be modified by collaborators without changing the underlying core 99 library. In the “Materials and Methods” section we present how the two components have been built 100 using JavaScript and a set of libraries rich in visualisation components. 101 The interface of SugarSketcher has been designed to address the increasing popularity of glycans 102 among scientists with only basic knowledge on carbohydrates. Without knowledge of chemistry 103 behind each monosaccharide, average users can quickly draw glycan and derivative structures using 104 the "quick mode". In this case, the interface presents twelve monosaccharides most commonly found 105 in mammalian glycans as a reflection of the bias in structural data production observed in [15] for 106 example. This limited set of monosaccharides is used as Lego bricks to sequentially build a structure. 107 Each time a new monosaccharide is added to a structure, the user needs to input only the anomericity 108 of the linkage and the attachment position of the monosaccharide. To depict monosaccharides and 109 glycans, SugarSketcher uses the SNFG icons. 110 Positioning monosaccharides basing on the acceptor linkage only can end up in overlays. For 111 example, this situation occurs when two nodes coming from different branches should be in the same 112 place. We have created a grid system that tracks whether a position is free or already occupied by a 113 monosaccharide. At present, positioning follows the option of depicting the monosaccharide linkages 114 with embedded type and anomericity [16]. Figure 1a shows a galactosylated and sialylated N-glycan 115 core drawn with the “quick mode” option. 116 SugarSketcher provides a "normal mode" for glycochemists and glycobiologists. After switching 117 off the "quick mode", a user gains access to a more sophisticated menu. Now, the addition of a 118 monosaccharide requires knowledge of stereochemistry and ring types, and not only acceptor, but 119 also a donor linking position must be specified to create a glycosidic linkage. In the "normal mode", 120 the user is able to decorate monosaccharides with substituents and can add repeating units. In 121 addition to the delete function available in the "quick mode", an experienced user can modify each 122 monosaccharide using the update button. Figure 1b shows an example of a bacterial LPS drawn in 123 the “normal mode” following the drawing procedure summarised in Figure 2.
124
125 Figure 1. (a) Screenshot of the SugarSketcher interface after the completion of an N-glycan core carried 126 out in the “quick mode”. The upper menu shows a selection of 12 monosaccharides more frequently 127 observed in the composition of mammalian glycans. Monosaccharides are positioned following the 128 option proposed in [16]. Linkage is indicated by the bond angle whereas anomericity by solid (β) or 129 dashed (α) lines. (b) Screenshot of the SugarSketcher interface after the completion of a glycan carried 130 out in the “normal mode”. The top menu is shows a much broader range of possible monosaccharides. 131 The same positioning procedure applies.
132 133 The overall drawing process is a succession of feature selection steps that ends with the display 134 of an SNFG icon. This process is repeated as many times as the target structure contains building 135 blocks. For the realization of the Figure 1b example, the user is first invited to “Add Node” in the top 136 menu. Mousing over this task reveals two options: monosaccharide and substituent. In the example
Molecules 2018, 23, x FOR PEER REVIEW 4 of 9
137 the first entity is a monosaccharide (L-glycero-D-manno-heptose) which is therefore the selected 138 option. Clicking on it prompts the display of a first array of geometrical shapes. In the example, 139 clicking on the hexagon then prompts a second array of colors represented as drops. Clicking on the 140 green drop will result in moving to the next step which is the sequential selection of options for 141 anomericity, ring and linkage characteristics. Selecting optional values at each step (fully described 142 in Supplementary Tables 1-5) will lead to the placement of the corresponding monosaccharide (alpha- 143 linked L-glycero-D-manno-heptose) in main space of the interface. 144
145 146 Figure 2. Drawing process of an SNFG icon in the “normal mode”. The first task is to select between 147 a monosaccharide and a substituent. If the choice is “monosaccharide”, next, the user will select the 148 geometrical and color attributes of a monosaccharide in the SNFG nomenclature from (1) an array of 149 shapes and (2) an array of colors (framed at the top of this figure). If the choice is “substituent”, 150 multiple buttons are displayed for selection (framed at the bottom of this figure). Then the user 151 successively selects optional values (shown below each box) to specify further the characteristics of 152 the ring and the linkage (anomericity, carbon acceptor and donor shown in boxes). This attribute 153 selection process results in the placement of the corresponding monosaccharide/substituent on the 154 screen. It is then repeated as many times as the targeted structure contains monosaccharides/ 155 substituents.
156
157 3. Discussion 158 We encourage glycoinformatics project holders to integrate SugarSketcher as an alternative 159 structure input tool. Currently it has been incorporated in CSDB [10]. Ultimately it is also destined to 160 become the graphic interface for querying databases and running software from the 161 glycomics@ExPASy collection. 162 At the moment, SugarSketcher supports the depiction of glycan structures following the 163 standard imposed by SNFG. However, there are various features that we would like to implement in 164 the nearest future: 165 ● On the fly Copy-Paste of the glycan image 166 ● Import structure via the URL 167 ● Automatic adjustment of parameters where it is possible to detect their chemically forbidden 168 combinations
Molecules 2018, 23, x FOR PEER REVIEW 5 of 9
169 Several users have asked for the possibility of copying the glycan image from SugarSketcher to 170 other applications via the clipboard. The choice of using SVG as the primary image format precludes 171 this possibility since a general-purpose software does not support it. Thus, it is necessary to develop 172 a converter that can automatically save the glycan depiction in a more usable format like PNG. 173 However, it requires a server side-component because modern browsers cannot run a converter. 174 Another feature in development is the import of a glycan from the GlycoCT or IUPAC string in 175 the URL parameter, so that the picture gets automatically created. 176 As SugarSketcher is developed primarily for non-glycobiologists, one of the main features we 177 are working on is the possibility of automatic preselection of various parameters where it is possible 178 to infer them from the user input. For example, the linking position in a donor residue (sometimes 179 referred to as “anomeric carbon”) can be automatically detected based on the residue type (aldose vs. 180 ketose). We provide in supplementary material, the non-exhaustive list of parameters/rules to be 181 introduced in the next version in order to control the consistency of users’ input.
182 4. Materials and Methods 183 SugarSketcher is divided into two parts, the library and the interface. The library is a collection 184 of javascript files which get compressed into a single file. The Interface is composed of 9 files plus the 185 index.html and two CSS files (which can be merged in one). In the end, the complete SugarSketcher 186 needs 13 files. 187 188 4.1. The core library 189 The main assumption behind the core library is that each glycan can be represented as a graph. 190 This concept has already been applied by users of MzJava [17] from which the core library is inspired. 191 To avoid reinventing the wheel, the Graph class has been taken from Sigma.js, an established 192 JavaScript library (http://sigmajs.org) which allows the integration of graphs in a web environment. 193 The Graph class, together with GraphNode and GraphEdge classes, form the general data 194 structure that can handle a general graph. Since a glycan structure holds specific chemical 195 information for each node and edge, the basic data structure is extended in the glycomics package to 196 encapsulate glycan-specific information. For example, GraphNode is extended by Monosaccharide 197 and Substituent classes. The Glycan class allows the creation of saccharide objects by connecting 198 monosaccharides and substituents with glycosidic and substituent linkages, respectively. 199 To avoid inputting wrong data, the core library provides a collection of dictionaries, one for each 200 glycan-specific entity. These dictionaries comprehend anomericity, ring type, isomer, etc. [see Tables 201 1-5.] The user is required to pick an entry from the dictionary blocking possible arbitrary inputs. At 202 the level of nodes, the library provides dictionaries for MonosaccaridesTypes, SubstituentTypes, 203 RingTypes and Anomericity. In addition, edges are defined from donor and acceptor linkage 204 positions. The data structure is detailed in Figure 3 where the relationships between entities are 205 detailed.
Molecules 2018, 23, x FOR PEER REVIEW 6 of 9
206 207
208 Figure 3. Entity relationships in the SugarSketcher data structure. The icon legend describes the color 209 code for boxes.
210 211 Since glycan structures are not always fully defined, the core library has been designed to handle 212 fuzziness in both edges (linkage position) and nodes (monosacchride type, its anomeric, absolute and 213 ring size configurations). The user can input monosaccharides with multiple alternative connections 214 on each carbon. Repeating parts of structure are handled by the repeating unit class and can be added 215 to the glycan structure as single nodes. A collection of glycan structures is already encoded in the 216 library and can be directly used to create sugar objects. 217 The input-output section of the library allows the import and export of glycan sequences. Since 218 we are mainly using GlycoCT encoding [13] across the tools in Glycomics@ExPASy [14], the first 219 implemented parser/writer implements the GlycoCT standard. Parsers and Writers are completely 220 decoupled from the data structure allowing the implementation of adapters for any glycan encoding 221 format. The way the library is built and the Apache licence 2.0 (https://www.apache.org) allow any 222 research group to contribute with their specific import and export adapters. 223 The core library has been designed to facilitate external contributions and encourage further 224 extension. Unless the Graph class, the rest of the code follow the ECMAScript6 (ES6) standard and 225 comes with unit test. The project includes resources for minification and transpilation to 226 ECMAScript5 (ES5). 227 228 4.2. The interface 229 SugarSketcher runs in two different modes: “quick mode” and “normal mode” the application 230 of which is illustrated in the “results” section. In either mode, a collection of pre-built structures can 231 be used as a template amenable to extension. This selection currently mirrors the trend in over- 232 representation of animal N- and O-linked carbohydrate moieties in glycoproteins in recent databases 233 and in the literature (except for CSDB [12]). All N- and O-linked core structures reported in [8] are 234 listed. Additionally, a shortlist of glycan epitopes is provided. For example, to draw a di-antennary 235 core fucosylated N-linked structure, an N-linked core fucosylated template can be loaded and the 236 first antenna added manually. Then this antenna is selected and with the copy-paste functionality 237 available on right click, the second antenna can be pasted to complete the structure. Mistakes can be 238 corrected with the “Delete” button that has been enabled to prune the tree-like structure.
Molecules 2018, 23, x FOR PEER REVIEW 7 of 9
239 Once the structure is completed, the depiction can be downloaded in classical Portable Network 240 Graphics (PNG) or in high resolution Scalable Vector Graphics (SVG) formats. As an alternative to 241 images, SugarSketcher provides an export of glycan structures to the GlycoCT machine readable 242 format [13]. Since several glycoinformatic tools provide the export in GlycoCT, SugarSketcher has an 243 internal engine for parsing and displaying GlycoCT encoded structures. In that respect, GlycoCT 244 encoded structures can also be imported and further modified in the SugarSketcher interface. 245 The interface is built using HTML5, CSS3 and the JavaScript library D3.js. (V3). To handle all 246 glycan information, the interface is connected with the core library (see section 2.1). SugarSketcher 247 works with any browser that supports JavaScript (up to version ES6) and can be integrated in any 248 website. In addition, it can be combined with web-based glycoinformatics tools, that accept GlycoCT 249 as encoding format. For example, GlycoDigest [18] that simulates the digestion of glycans by 250 exoglycosidases accepts the structure input in GlycoCT format. The same is true for a number of 251 modern glycan databases, such as GlyTouCan [19] or CSDB [12]. 252
253 5. Conclusions 254 To conclude, we are aware that SugarSketcher still shows weaknesses that are listed and further 255 documented in the Supplementary Materials file. Nonetheless, we are actively attending to these 256 items and invite users and developers to participate in the GitHub issue tracker to send feedback and 257 report bugs.
258 Supplementary Materials: The following are available online at www.mdpi.com/xxx/s1. 259 Author Contributions: Conceptualization, D.A., J.M. and F.L.; methodology, D.A. and F.L.; software, D.A., N.H., 260 R.C. and J.M.; validation, D.A., P.T., J.M. and F.L.; writing—original draft preparation, D.A. and F.L.; writing— 261 review and editing, P.T. and F.L.; supervision, J.M and F.L.; funding acquisition, P.T. and F.L. 262 Funding: This work was supported by the European Union FP7 Innovative Training Network [grant number 263 316929], IB-Carb (http://ibcarb.com/) and by the Swiss Federal Government through the State Secretariat for 264 Education, Research and Innovation SERI. ExPASy is maintained by the web team of the Swiss Institute of 265 Bioinformatics and hosted at the Vital-IT Competency Center. Architectural and interface design testing was 266 supported by Russian Science Foundation, grant 18-14-00098. 267 Acknowledgments: We thank Claire Doherty and Prof Sabine Flitsch for making this work fit the scope of the 268 IB-Carb network and Elisabeth Gasteiger for helping with integration in the ExPASy server. 269 Conflicts of Interest: None.
270 Appendix A 271 The appendix contains details of the current procedure for importing the software, the full 272 description of dictionaries in tables, lists of shortcomings and known bugs as well as suggestions for 273 introducing consistency rules. All is saved in a Supplementary file. 274
Molecules 2018, 23, x FOR PEER REVIEW 8 of 9
275 References
276 1. Li, E.; Tabas, I.; Kornfeld, S. The synthesis of complex-type oligosaccharides. I. Structure of the lipid- 277 linked oligosaccharide precursor of the complex-type oligosaccharides of the vesicular stomatitis virus 278 G protein. J. Biol. Chem. 1978, 253, 7762–7770.
279 2. Campbell, M. P.; Ranzinger, R.; Lütteke, T.; Mariethoz, J.; Hayes, C. A.; Zhang, J.; Akune, Y.; Aoki- 280 Kinoshita, K. F.; Damerell, D.; Carta, G.; York, W. S.; Haslam, S. M.; Narimatsu, H.; Rudd, P. M.; 281 Karlsson, N. G.; Packer, N. H.; Lisacek, F. Toolboxes for a standardised and systematic study of 282 glycans. BMC Bioinformatics 2014, 15, S9, doi:10.1186/1471-2105-15-S1-S9.
283 3. Varki, A.; Cummings, R. D.; Aebi, M.; Packer, N. H.; Seeberger, P. H.; Esko, J. D.; Stanley, P.; Hart, G.; 284 Darvill, A.; Kinoshita, T.; Prestegard, J. J.; Schnaar, R. L.; Freeze, H. H.; Marth, J. D.; Bertozzi, C. R.; 285 Etzler, M. E.; Frank, M.; Vliegenthart, J. F.; Lütteke, T.; Perez, S.; Bolton, E.; Rudd, P.; Paulson, J.; 286 Kanehisa, M.; Toukach, P.; Aoki-Kinoshita, K. F.; Dell, A.; Narimatsu, H.; York, W.; Taniguchi, N.; 287 Kornfeld, S. Symbol Nomenclature for Graphical Representations of Glycans. Glycobiology 2015, 25, 288 1323–1324, doi:10.1093/glycob/cwv091.
289 4. Doubet, S.; Bock, K.; Smith, D.; Darvill, A.; Albersheim, P. The Complex Carbohydrate Structure 290 Database. Trends Biochem. Sci. 1989, 14, 475–477.
291 5. Hashimoto, K.; Goto, S.; Kawano, S.; Aoki-Kinoshita, K. F.; Ueda, N.; Hamajima, M.; Kawasaki, T.; 292 Kanehisa, M. KEGG as a glycome informatics resource. Glycobiology 2006, 16, 63R-70R, 293 doi:10.1093/glycob/cwj010.
294 6. Ceroni, A.; Dell, A.; Haslam, S. M. The GlycanBuilder: a fast, intuitive and flexible software tool for 295 building and displaying glycan structures. Source Code for Biology and Medicine 2007, 2, 3, 296 doi:10.1186/1751-0473-2-3.
297 7. von der Lieth, C.-W.; Freire, A. A.; Blank, D.; Campbell, M. P.; Ceroni, A.; Damerell, D. R.; Dell, A.; 298 Dwek, R. A.; Ernst, B.; Fogh, R.; Frank, M.; Geyer, H.; Geyer, R.; Harrison, M. J.; Henrick, K.; Herget, 299 S.; Hull, W. E.; Ionides, J.; Joshi, H. J.; Kamerling, J. P.; Leeflang, B. R.; Lutteke, T.; Lundborg, M.; Maass, 300 K.; Merry, A.; Ranzinger, R.; Rosen, J.; Royle, L.; Rudd, P. M.; Schloissnig, S.; Stenutz, R.; Vranken, W. 301 F.; Widmalm, G.; Haslam, S. M. EUROCarbDB: An open-access platform for glycoinformatics. 302 Glycobiology 2011, 21, 493–502, doi:10.1093/glycob/cwq188.
303 8. Essentials of Glycobiology; Varki, A., Ed.; 2nd ed.; Cold Spring Harbor Laboratory Press: Cold Spring 304 Harbor, N.Y, 2009; ISBN 978-0-87969-770-9.
305 9. Joshi, H. J.; von der Lieth, C.-W.; Packer, N. H.; Wilkins, M. R. GlycoViewer: a tool for visual summary 306 and comparative analysis of the glycome. Nucleic Acids Research 2010, 38, W667–W670, 307 doi:10.1093/nar/gkq446.
308 10. Cheng, K.; Zhou, Y.; Neelamegham, S. DrawGlycan-SNFG: a robust tool to render glycans and 309 glycopeptides with fragmentation information. Glycobiology 2016, doi:10.1093/glycob/cww115.
310 11. Sharon, N. IUPAC-IUB Joint Commission on Biochemical Nomenclature (JCBN). Nomenclature of 311 glycoproteins, glycopeptides and peptidoglycans: JCBN recommendations 1985. Glycoconjugate 312 Journal 1986, 3, 123–133, doi:10.1007/BF01049370.
313 12. Toukach, P. V.; Egorova, K. S. Carbohydrate structure database merged from bacterial, archaeal, plant 314 and fungal parts. Nucleic Acids Research 2016, 44, D1229–D1236, doi:10.1093/nar/gkv840.
315 13. Herget, S.; Ranzinger, R.; Maass, K.; Lieth, C.-W. v. d. GlycoCT—a unifying sequence format for 316 carbohydrates. Carbohydrate Research 2008, 343, 2162–2171, doi:10.1016/j.carres.2008.03.011.
Molecules 2018, 23, x FOR PEER REVIEW 9 of 9
317 14. Mariethoz, J.; Alocci, D.; Gastaldello, A.; Horlacher, O.; Gasteiger, E.; Rojas-Macias, M.; Karlsson, N. 318 G.; Packer, N.; Lisacek, F. Glycomics@ExPASy: Bridging the gap. Molecular & Cellular Proteomics 319 2018, mcp.RA118.000799, doi:10.1074/mcp.RA118.000799.
320 15. Herget, S.; Toukach, P. V.; Ranzinger, R.; Hull, W. E.; Knirel, Y. A.; von der Lieth, C.-W. Statistical 321 analysis of the Bacterial Carbohydrate Structure Data Base (BCSDB): Characteristics and diversity of 322 bacterial carbohydrates in comparison with mammalian glycans. BMC Structural Biology 2008, 8, 35, 323 doi:10.1186/1472-6807-8-35.
324 16. Harvey, D. J.; Merry, A. H.; Royle, L.; Campbell, M. P.; Dwek, R. A.; Rudd, P. M. Proposal for a 325 standard system for drawing structural diagrams of N- and O-linked carbohydrates and related 326 compounds. Proteomics 2009, 9, 3796–3801, doi:10.1002/pmic.200900096.
327 17. Horlacher, O.; Nikitin, F.; Alocci, D.; Mariethoz, J.; Müller, M.; Lisacek, F. MzJava: An open source 328 library for mass spectrometry data processing. Journal of Proteomics 2015, 129, 63–70, 329 doi:10.1016/j.jprot.2015.06.013.
330 18. Gotz, L.; Abrahams, J. L.; Mariethoz, J.; Rudd, P. M.; Karlsson, N. G.; Packer, N. H.; Campbell, M. P.; 331 Lisacek, F. GlycoDigest: a tool for the targeted use of exoglycosidase digestions in glycan structure 332 determination. Bioinformatics 2014, 30, 3131–3133, doi:10.1093/bioinformatics/btu425.
333 19. Tiemeyer, M.; Aoki, K.; Paulson, J.; Cummings, R. D.; York, W. S.; Karlsson, N. G.; Lisacek, F.; Packer, 334 N. H.; Campbell, M. P.; Aoki, N. P.; Fujita, A.; Matsubara, M.; Shinmachi, D.; Tsuchiya, S.; Yamada, I.; 335 Pierce, M.; Ranzinger, R.; Narimatsu, H.; Aoki-Kinoshita, K. F. GlyTouCan: an accessible glycan 336 structure repository. Glycobiology 2017, 1–5, doi:10.1093/glycob/cwx066.
337 © 2018 by the authors. Submitted for possible open access publication under the terms and 338 conditions of the Creative Commons Attribution (CC BY) license 339 (http://creativecommons.org/licenses/by/4.0/).
217
Bibliography
Aebersold, Ruedi et al. (2018). “How many human proteoforms are there?” en. In: Nat. Chem. Biol. 14.3, pp. 206–214. Afgan, Enis et al. (2016). “The Galaxy platform for accessible, reproducible and col- laborative biomedical analyses: 2016 update”. In: Nucleic Acids Res. 44.W1, W3– W10. Akune, Yukie et al. (2010). “The RINGS resource for glycome informatics analysis and data mining on the Web”. en. In: OMICS 14.4, pp. 475–486. Allende, Christian, Erik Sohn, and Cedric Little (2015). “Treelink: data integration, clustering and visualization of phylogenetic trees”. en. In: BMC Bioinformatics 16.1, p. 414. Anugraham, Merrina et al. (2014). “Specific Glycosylation of Membrane Proteins in Epithelial Ovarian Cancer Cell Lines: Glycan Structures Reflect Gene Expression and DNA Methylation Status”. en. In: Mol. Cell. Proteomics 13.9, pp. 2213–2232. Aoki, Kiyoko F et al. (2004). “KCaM (KEGG Carbohydrate Matcher): a software tool for analyzing the structures of carbohydrate sugar chains”. en. In: Nucleic Acids Res. 32.Web Server issue, W267–72. Aoki-Kinoshita, Kiyoko et al. (2016). “GlyTouCan 1.0–The international glycan struc- ture repository”. en. In: Nucleic Acids Res. 44.D1, pp. D1237–42.
Banin, Ehud et al. (2002). “A Novel Linear Code R Nomenclature for Complex Car- bohydrates”. en. In: Trends Glycosci. Glycotechnol. 14.77, pp. 127–137. Bard, Jonathan B L and Seung Y Rhee (2004). “Ontologies in biology: design, appli- cations and future challenges”. en. In: Nat. Rev. Genet. 5.3, p. 213. Barsnes, Harald et al. (2009). “PRIDE Converter: making proteomics data-sharing easy”. en. In: Nat. Biotechnol. 27.7, p. 598. 218 BIBLIOGRAPHY
Bateman, Alex et al. (2017). “UniProt: the universal protein knowledgebase”. In: Nu- cleic Acids Res. 45.D1, pp. D158–D169. Battle, Robert and Edward Benson (2008). “Bridging the semantic Web and Web 2.0 with Representational State Transfer (REST)”. In: Web Semantics: Science, Services and Agents on the World Wide Web 6.1, pp. 61–69. Benson, Dennis A et al. (2018). “GenBank”. In: Nucleic Acids Res. 46.D1, pp. D41–D47. Bern, Marshall, Yong J Kil, and Christopher Becker (2012). “Byonic: advanced pep- tide and protein identification software”. en. In: Curr. Protoc. Bioinformatics Chap- ter 13, Unit13.20. Bohne-Lang, A et al. (2001). “LINUCS: linear notation for unique description of car- bohydrate sequences”. en. In: Carbohydr. Res. 336.1, pp. 1–11. Brazma, Alvis et al. (2001). “Minimum information about a microarray experiment (MIAME)—toward standards for microarray data”. en. In: Nat. Genet. 29.4, p. 365. Buneman, Peter, Sanjeev Khanna, and Tan Wang-Chiew (2001). “Why and Where: A Characterization of Data Provenance”. In: Lecture Notes in Computer Science, pp. 316–330. Campbell, Matthew P et al. (2014). “Toolboxes for a standardised and systematic study of glycans”. en. In: BMC Bioinformatics 15.1, S9. Chibucos, Marcus C et al. (2017). “The Evidence and Conclusion Ontology (ECO): Supporting GO Annotations”. en. In: Methods Mol. Biol. 1446, pp. 245–259. Cook, Charles E et al. (2018). “The European Bioinformatics Institute in 2017: data coordination and integration”. en. In: Nucleic Acids Res. 46.D1, pp. D21–D29. Cooper, Catherine A et al. (2003). “GlycoSuiteDB: a curated relational database of glycoprotein glycan structures and their biological sources. 2003 update”. en. In: Nucleic Acids Res. 31.1, pp. 511–513. Creasy, David M and John S Cottrell (2004). “Unimod: Protein modifications for mass spectrometry”. en. In: Proteomics 4.6, pp. 1534–1536. Cummings, Richard D (2009). “The repertoire of glycan determinants in the human glycome”. en. In: Mol. Biosyst. 5.10, pp. 1087–1104. Doubet, S and P Albersheim (1992). “CarbBank”. en. In: Glycobiology 2.6, p. 505. Doubet, S et al. (1989). “The Complex Carbohydrate Structure Database”. en. In: Trends Biochem. Sci. 14.12, pp. 475–477. BIBLIOGRAPHY 219
Dudhe, Anil and S S Sherekar (2014). “Performance Analysis of SOAP and RESTful Mobile Web Services in Cloud Environment”. In: IJCA Special Issue on Recent Trends in Information Security RTINFOSEC.1, pp. 1–4. Fellers, Ryan T et al. (2015). “ProSight Lite: graphical software to analyze top-down mass spectrometry data”. en. In: Proteomics 15.7, pp. 1235–1238. Fillbrunn, Alexander et al. (2017). “KNIME for reproducible cross-domain analysis of life science data”. en. In: J. Biotechnol. 261, pp. 149–156. Gaitatzes, Athanasios et al. (2018). “Genome U-Plot: a whole genome visualization”. In: Bioinformatics 34.10, pp. 1629–1634. Gaudet, Pascale et al. (2015). “The neXtProt knowledgebase on human proteins: cur- rent status”. en. In: Nucleic Acids Res. 43.Database issue, pp. D764–70. Goble, Carole and Robert Stevens (2008). “State of the nation in data integration for bioinformatics”. en. In: J. Biomed. Inform. 41.5, pp. 687–693. Gomez-Cabrero, David et al. (2014). “Data integration in the era of omics: current and future challenges”. en. In: BMC Syst. Biol. 8 Suppl 2, p. I1. Gotz, Lou et al. (2014). “GlycoDigest: a tool for the targeted use of exoglycosidase di- gestions in glycan structure determination”. en. In: Bioinformatics 30.21, pp. 3131– 3133. Greenwood, Mark et al. (2003). “Provenance of e-Science Experiments-experience from Bioinformatics”. In: Halevy, Alon Y (2001). “Answering queries using views: A survey”. In: VLDB J. 10.4, pp. 270–294. Hamid, Jemila S et al. (2009). “Data integration in genetics and genomics: methods and challenges”. en. In: Hum. Genomics Proteomics 2009. Harvey, David J et al. (2011). “Symbol nomenclature for representing glycan struc- tures: Extension to cover different carbohydrate types”. en. In: Proteomics 11.22, pp. 4291–4295. Herget, S et al. (2008). “GlycoCT-a unifying sequence format for carbohydrates”. en. In: Carbohydr. Res. 343.12, pp. 2162–2171. Hermjakob, Henning (2006). “The HUPO proteomics standards initiative–overcoming the fragmentation of proteomics data”. en. In: Proteomics 6 Suppl 2, pp. 34–38. 220 BIBLIOGRAPHY
History of the World Wide Web - Wikipedia. https : / / en . wikipedia . org / wiki / History_of_the_World_Wide_Web. Accessed: 2018-6-13. Hoehndorf, Robert, Paul N Schofield, and Georgios V Gkoutos (2015). “The role of ontologies in biological and biomedical research: a functional perspective”. In: Brief. Bioinform. 16.6, pp. 1069–1080. Hosoda, Masae, Yukie Akune, and Kiyoko F Aoki-Kinoshita (2017). “Development and application of an algorithm to compute weighted multiple glycan alignments”. en. In: Bioinformatics 33.9, pp. 1317–1323. Hsiao, Shun-Wen et al. (2011). “A Secure Proxy-Based Cross-Domain Communica- tion for Web Mashups”. In: 2011 IEEE Ninth European Conference on Web Services. HTML - Wikipedia. https://en.wikipedia.org/wiki/HTML. Accessed: 2018-6-14. Joshi, Hiren J et al. (2010). “GlycoViewer: a tool for visual summary and comparative analysis of the glycome”. en. In: Nucleic Acids Res. 38.Web Server issue, W667–70. Joshi, Hiren J et al. (2018). “GlycoDomainViewer: a bioinformatics tool for contextual exploration of glycoproteomes”. en. In: Glycobiology 28.3, pp. 131–136. Jupp, Simon et al. (2014). “The EBI RDF platform: linked open data for the life sci- ences”. In: Bioinformatics 30.9, pp. 1338–1339. Karim, Md Rezaul et al. (2017). “Improving data workflow systems with cloud ser- vices and use of open data for bioinformatics research”. In: Brief. Bioinform. Kelso, Janet, Robert Hoehndorf, and Kay Prüfer (2010). “Ontologies in Biology”. en. In: Theory and Applications of Ontology: Computer Applications. Springer, Dordrecht, pp. 347–371. Khoury, George A, Richard C Baliban, and Christodoulos A Floudas (2011). “Proteome- wide post-translational modification statistics: frequency analysis and curation of the swiss-prot database”. en. In: Sci. Rep. 1, p. 90. Kikuchi, Norihiro et al. (2005). “The carbohydrate sequence markup language (Ca- bosML): an XML description of carbohydrate structures”. en. In: Bioinformatics 21.8, pp. 1717–1718. Kolarich, Daniel et al. (2013). “The Minimum Information Required for a Glycomics Experiment (MIRAGE) Project: Improving the Standards for Reporting Mass-spectrometry- based Glycoanalytic Data”. en. In: Mol. Cell. Proteomics 12.4, pp. 991–995. BIBLIOGRAPHY 221
Komatsoulis, George A et al. (2008). “caCORE version 3: Implementation of a model driven, service-oriented architecture for semantic interoperability”. en. In: J. Biomed. Inform. 41.1, pp. 106–123. Konishi, Yoshitsugu and Kiyoko F Aoki-Kinoshita (2012). “The GlycomeAtlas tool for visualizing and querying glycome data”. en. In: Bioinformatics 28.21, pp. 2849– 2850. Lane, Lydie et al. (2012). “neXtProt: a knowledge platform for human proteins”. In: Nucleic Acids Res. 40.D1, pp. D76–D83. Lapatas, Vasileios et al. (2015). “Data integration in biological research: an overview”. en. In: J. Biol. Res. 22.1, p. 9. Lauc, Gordan et al. (2016). “Mechanisms of disease: The human N-glycome”. In: Biochimica et Biophysica Acta (BBA) - General Subjects 1860.8, pp. 1574–1582. Launay, G et al. (2015). “MatrixDB, the extracellular matrix interaction database: up- dated content, a new navigator and expanded functionalities”. en. In: Nucleic Acids Res. 43.Database issue, pp. D321–7. Le Pendu, Jacques, Kristina Nyström, and Nathalie Ruvoën-Clouet (2014). “Host– pathogen co-evolution and glycan interactions”. In: Curr. Opin. Virol. 7, pp. 88– 94. Leinonen, Rasko et al. (2011). “The European Nucleotide Archive”. In: Nucleic Acids Res. 39.suppl_1, pp. D28–D31. Leser, Ulf and Felix Naumann (2007). Informationsintegration: Architekturen und Meth- oden zur Integration verteilter und heterogener Datenquellen. de. Lieth, Claus-Wilhelm von der et al. (2011). “EUROCarbDB: An open-access platform for glycoinformatics”. en. In: Glycobiology 21.4, pp. 493–502. Lisacek, Frederique et al. (2016). “Databases and Associated Tools for Glycomics and Glycoproteomics”. In: Methods in Molecular Biology, pp. 235–264. Liu, Gang et al. (2017a). “A Comprehensive, Open-source Platform for Mass Spectrometry- based Glycoproteomics Data Analysis”. en. In: Mol. Cell. Proteomics 16.11, pp. 2032– 2047. Liu, Yan et al. (2017b). “The minimum information required for a glycomics experi- ment (MIRAGE) project: improving the standards for reporting glycan microarray- based data”. In: Glycobiology 27.4, pp. 280–284. 222 BIBLIOGRAPHY
Luscombe, N M, D Greenbaum, and M Gerstein (2001). “What is bioinformatics? A proposed definition and overview of the field”. en. In: Methods Inf. Med. 40.4, pp. 346–358. Lütteke, Thomas (2008). “Web resources for the glycoscientist”. en. In: Chembiochem 9.13, pp. 2155–2160. Lütteke, Thomas et al. (2006). “GLYCOSCIENCES.de: an Internet portal to support glycomics and glycobiology research”. In: Glycobiology 16.5, 71R–81R. Manzoni, Claudia et al. (2018). “Genome, transcriptome and proteome: the rise of omics data and their integration in biomedical sciences”. In: Brief. Bioinform. 19.2, pp. 286–302. Mariethoz, Julien et al. (2014). “SugarBindDB, a Resource of Pathogen Lectin-Glycan Interactions”. In: Glycoscience: Biology and Medicine, pp. 1–6. Masseroli, Marco et al. (2014). “Integrated Bio-Search: challenges and trends for the integration, search and comprehensive processing of biological information”. en. In: BMC Bioinformatics 15.1, S2. Matsubara, Masaaki et al. (2017). “WURCS 2.0 Update To Encapsulate Ambiguous Carbohydrate Structures”. en. In: J. Chem. Inf. Model. 57.4, pp. 632–637. Munevar, Steven (2017). “Unlocking Big Data for better health”. en. In: Nat. Biotech- nol. 35.7, pp. 684–686. Oinn, Tom et al. (2004). “Taverna: a tool for the composition and enactment of bioin- formatics workflows”. In: Bioinformatics 20.17, pp. 3045–3054. Okuda, Shujiro, Hiromi Nakao, and Toshisuke Kawasaki (2017). “GlycoEpitope”. en. In: A Practical Guide to Using Glycomics Databases. Springer, Tokyo, pp. 227–245. Packer, Nicolle H et al. (2008). “Frontiers in glycomics: bioinformatics and biomark- ers in disease. An NIH white paper prepared from discussions by the focus groups at a workshop on the NIH campus, Bethesda MD (September 11-13, 2006)”. en. In: Proteomics 8.1, pp. 8–20. Patwardhan, Ardan (2017). “Trends in the Electron Microscopy Data Bank (EMDB)”. en. In: Acta Crystallogr D Struct Biol 73.Pt 6, pp. 503–508. Pérez, Serge et al. (2015). “Glyco3D: a portal for structural glycosciences”. en. In: Methods Mol. Biol. 1273, pp. 241–258. BIBLIOGRAPHY 223
Pérez, Serge et al. (2016). “Glyco3D: A Suite of Interlinked Databases of 3D Struc- tures of Complex Carbohydrates, Lectins, Antibodies, and Glycosyltransferases”. In: A Practical Guide to Using Glycomics Databases, pp. 133–161. Ponniah, Paulraj (2004). Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals. en. John Wiley & Sons. Prasad, T S Keshava et al. (2017). “Integrating transcriptomic and proteomic data for accurate assembly and annotation of genomes”. en. In: Genome Res. 27.1, pp. 133– 144. Quesada-Martínez, Manuel et al. (2017). “Preliminary Analysis of the OBO Foundry Ontologies and Their Evolution Using OQuaRE”. en. In: Stud. Health Technol. In- form. 235, pp. 426–430. Ranzinger, René et al. (2008). “GlycomeDB – integration of open-access carbohydrate structure databases”. In: BMC Bioinformatics 9.1, p. 384. Ranzinger, Rene et al. (2015). “GlycoRDF: an ontology to standardize glycomics data in RDF”. en. In: Bioinformatics 31.6, pp. 919–925. Redaschi, Nicole and UniProt Consortium (2009). “UniProt in RDF: Tackling Data Integration and Distributed Annotation with the Semantic Web”. In: Nature Pre- cedings. Sahoo, Satya S et al. (2005). “GLYDE-an expressive XML standard for the represen- tation of glycan structure”. en. In: Carbohydr. Res. 340.18, pp. 2802–2807. Sahoo, Satya S et al. (2006). “Knowledge modeling and its application in life sci- ences”. In: Proceedings of the 15th international conference on World Wide Web - WWW ’06. Salas-Zárate, María del Pilar et al. (2015). “Analyzing best practices on Web devel- opment frameworks: The lift approach”. In: Science of Computer Programming 102, pp. 1–19. Shannon, Paul et al. (2003). “Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks”. en. In: Genome Res. 13.11, pp. 2498– 2504. Shoaib, Muhammad, Adnan Ahmad Ansari, and Sung-Min Ahn (2017). “cMapper: gene-centric connectivity mapper for EBI-RDF platform”. In: Bioinformatics 33.2, pp. 266–271. 224 BIBLIOGRAPHY
Silva, Jose Cleydson F et al. (2017). “Geminivirus data warehouse: a database en- riched with machine learning approaches”. en. In: BMC Bioinformatics 18.1, p. 240. Smith, Barry et al. (2007). “The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration”. en. In: Nat. Biotechnol. 25.11, pp. 1251–1255. Stelzer, Gil et al. (2016). “The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses”. In: Current Protocols in Bioinformatics, pp. 1.30.1– 1.30.33. Stevens, R et al. (2000). “TAMBIS: transparent access to multiple bioinformatics in- formation sources”. en. In: Bioinformatics 16.2, pp. 184–185. Struwe, Weston B et al. (2016). “The minimum information required for a glycomics experiment (MIRAGE) project: sample preparation guidelines for reliable report- ing of glycomics datasets”. In: Glycobiology 26.9, pp. 907–910. Tanabe, Mao and Minoru Kanehisa (2012). “Using the KEGG Database Resource”. In: Current Protocols in Bioinformatics. Taylor, Chris F et al. (2007). “The minimum information about a proteomics experi- ment (MIAPE)”. en. In: Nat. Biotechnol. 25.8, p. 887. Terrapon, Nicolas et al. (2017). “The CAZy Database/the Carbohydrate-Active En- zyme (CAZy) Database: Principles and Usage Guidelines”. en. In: A Practical Guide to Using Glycomics Databases. Springer, Tokyo, pp. 117–131. Tiemeyer, Michael et al. (2017). “GlyTouCan: an accessible glycan structure reposi- tory”. en. In: Glycobiology 27.10, pp. 915–919. Toukach, Philip V (2011). “Bacterial carbohydrate structure database 3: principles and realization”. en. In: J. Chem. Inf. Model. 51.1, pp. 159–170. Turk, Žiga (2006). “Construction informatics: Definition and ontology”. In: Advanced Engineering Informatics 20.2, pp. 187–199. Varki, Ajit et al., eds. (2010). Essentials of Glycobiology. Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press. Varki, Ajit et al. (2015). “Symbol Nomenclature for Graphical Representations of Gly- cans”. en. In: Glycobiology 25.12, pp. 1323–1324. Vizcaíno, Juan Antonio et al. (2016). “2016 update of the PRIDE database and its related tools”. In: Nucleic Acids Res. 44.D1, pp. D447–D456. BIBLIOGRAPHY 225
W3C Semantic Web Activity Homepage. https://www.w3.org/2001/sw/. Accessed: 2018-6-1. webcomponents.org - Discuss & share web components. https://www.webcomponents. org/introduction. Accessed: 2018-6-14. Weidman, Scott and Thomas Arrison (2010). Steps Toward Large-Scale Data Integra- tion in the Sciences: Summary of a Workshop. Washington (DC): National Academies Press (US). York, William S et al. (2014). “MIRAGE: The minimum information required for a glycomics experiment”. In: Glycobiology 24.5, pp. 402–406. Yu, Xinwei et al. (2016). “Profiling IgG N-glycans as potential biomarker of chrono- logical and biological ages: A community-based study in a Han Chinese popula- tion”. In: Medicine 95.28, e4112. Zerbino, Daniel R et al. (2018). “Ensembl 2018”. In: Nucleic Acids Res. 46.D1, pp. D754– D761. Zhang, Hongen (2016). “Overview of Sequence Data Formats”. en. In: Methods Mol. Biol. 1418, pp. 3–17. Ziegler, Patrick and Klaus R Dittrich (2004). “Three Decades of Data Integration — all Problems Solved?” In: IFIP International Federation for Information Processing, pp. 3–12.