Thesis Reference

Thesis An interactive and integrative view of glycobiology ALOCCI, Davide Abstract Glycosylation is one of the most abundant post-translational modifications (PTM) and the primary cause of microheterogeneity in proteins (glycoforms). It consists of the addition of a sugar moiety on the surface of a protein, and it has been suggested that over 50% of mammalian cellular proteins are typically glycosylated. Despite its importance, glycobiology is still lacking concrete bioinformatics support which, in the last two decades, has boosted other major ’omics’ like proteomics, transcriptomics and genomics. Additionally, the inherent complexity of carbohydrates and the use of many orthogonal experimental techniques to elucidate a structure have slowed down the annotation of novel glycans leaving glycobiology fragmented internally. This thesis aims at the creation of a collection of integrative, explorative and knowledge-based tools to reconstitute the puzzle of biological evidence produced by different glycomics experiments and, at the same time, fill the gap with other ’omics’. Reference ALOCCI, Davide. An interactive and integrative view of glycobiology. Thèse de doctorat : Univ. Genève, 2018, no. Sc. 5280 URN : urn:nbn:ch:unige-1122950 DOI : 10.13097/archive-ouverte/unige:112295 Available at: http://archive-ouverte.unige.ch/unige:112295 Disclaimer: layout of this document may differ from the published version. 1 / 1 UNIVERSITE DE GENÈVE FACULTE DES SCIENCES Département d’informatique Dr. Frédérique Lisacek An Interactive and Integrative view of Glycobiology THÈSE Présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention bioinformatique Par Davide Alocci de Orbetello (Italy) Thèse N ◦ 5280 GENÈVE 2018 iii Résumé en Français La glycosylation est l’une des modificationes post- traductionnelles (PTM) les plus abondantes ainsi que la première cause de la microhétérogénéité des proteines (gly- coformes). Elle consiste en l’adjonction d’une molécule de sucre (glycane) à la surface d’une protéine, d’autre part, il a été suggéré que plus de la moitié des protéines cellulaires des mammifères sont glycosylées . Malgré son importance, la glycobiologie reste isolée des autres domaines de la bi- ologie et les ressources bioinformatiques nécessaires à son essor sont aussi éparpil- lées, peu standardisées et rarement intégrées. Pourtant au cours des deux dernières décénnies, la bioinformatique et l’intégration de données ont joué un rôle crucial dans les autres disciplines ‘ -omiques’ comme le protéomique, transcriptomique et génomique . De plus la complexité intrinsèque des glycanes ainsi que le recours à une large panoplie de techniques expérimentales orthogonales pour élucider leurs structures, ralentissent la découverte de nouvelles glycanes et contribuent à la frag- mentiation interne de la glycobiologie. Ce travail de recherche vise à la création d’un ensemble d’outils pour reconstituer le puzzle des différentes sources de données expérimentales produites en glycomique aussi qu’à rétablir la liaison avec les autres ressources ’ -omiques ’ . Afin de car- actériser l’objectif propre de chaque outil nous les avons classés en trois catégories: intégratif, exploratoire et basé sur la connaissance. Les outils de nature integrative ont été conçus pour combiner des informations que l’on trouve dans différentes bases de données et ne peuvent pas être intégrées par un simple hyper-lien. Cette catégorie est dominée par deux outil principaux: GlyS3 et EpitopeXtractor. Le premier recherche des sous-structure de glycanes, le deuxième définit les épitopes de glycanes contenus dans un ou plusieurs glycanes complets sur la base d’un ensemble de référence. Les outils de nature exploratoire permettent à l’utilisateur d’ analyser des données à travers une approche graphique et interactive. Le premier logiciel de cette catégorie est PepSweetener qui facilite l’annotation manuelle des glycopeptides. Le deuxième est Gynsight qui, grâce à une interface web dédiée, permet de visualiser et comparer iv des profils de glycanes. Enfin, nous avons aussi repris la ressource Glydin’ qui permet de naviguer dans un réseau de glyco-épitopes pour mettre en valeur la parenté éventuelle de ligands dans des données d’interactions sucre-protéine. La dernière categorie concerne les outils basés sur la connaissance. La plateforme centrale appelée GlyConnect est un tableau de bord intégré pour l’analyse de don- nées glycomiques et glycoprotéomiques. En conclusion, tous les outils présentés dans cette thèse font partie de l’initiative Glycomics@ExPASy qui vise à diffuser l’utilisation de la bioinformatique en glycobiologie et à mettre l’accent sur la relation entre la glycobiologie et la bioinformatique à partir des protéines. v Abstract Glycosylation is one of the most abundant post-translational modifications (PTM) and the primary cause of microheterogeneity in proteins (glycoforms). It consists of the addition of a sugar moiety on the surface of a protein, and it has been suggested that over 50% of mammalian cellular proteins are typically glycosylated. Despite its importance, glycobiology is still lacking concrete bioinformatics support which, in the last two decades, has boosted other major ’omics’ like proteomics, transcriptomics and genomics. Additionally, the inherent complexity of carbohydrates and the use of many orthogonal experimental techniques to elucidate a structure have slowed down the annotation of novel glycans leaving glycobiology fragmented internally. This thesis aims at the creation of a collection of tools to reconstitute the puzzle of biological evidence produced by different glycomics experiments and, at the same time, fill the gap with other ’omics’. To characterise the aim of each tool, we have created three different categories: integrative, explorative and knowledge- based. Integrative tools have been designed to combine specific pieces of information which are stored in different databases and cannot be integrated by a simple connection. The category is populated by two different tools: GlyS3 and EpitopeXtractor. The first performs glycan substructure search whereas the second defines which glycan epitopes are contained in one or more glycans based on a curated collection. Ex- plorative tools enable the user to perform data analysis with a graphical and interactive approach, speeding up the analysis process. The first tool of this category is PepSweetener which facilitates the manual annotation of glycopeptides. Then, we have Glynsight which proposes a way to visualise and compare glycan profiles with a dedicated web-interface. To conclude, we mention Glydin’, a network visualisation of almost 600 glycan epitopes where connections exemplify substructure relationships. The last category regards the Knowledge-based tools and is filled by GlyConnect, an integrated dashboard for glycomics and glycoproteomics analysis. All tools presented in this thesis are part of the Glycomics@ExPASy initiative which aims at spreading the use of bioinformatics in glycobiology and stressing the relation between glycobiology and protein-oriented bioinformatics. vi vii Acknowledgements Firstly, I would like to express my sincere gratitude to my advisor Dr Frede´rique´ Lisacek for the continuous support of my PhD study and related research, for her patience and motivation. She always supported my ideas promoting the exploration of new technologies. Besides my advisor, I would like to thank the rest of my thesis committee: Prof. Bastien Chopard, Prof. Amos Bairoch, and Prof. Serge Perez, for their insightful comments and encouragement. I would like to thank my fellow labmates for the stimulating discussions during coffee breaks and for all the fun we have had in the last four years. Additionally, I would like to thank the Swiss Institute of Bioinformatics and all the partners of the Gastric Glyco Explorer that have given me the possibility to reach the end of this PhD. In the end, I would like to thank my family and my friend for supporting me spiri- tually throughout writing this thesis and my life in general. ix Contents Résumé en Français iii Abstractv Acknowledgements vii Contents ix 1 Introduction1 1.1 Bioinformatics.................................4 1.2 Data Integration................................5 1.2.1 Data discovery............................6 1.2.2 Data exploitation...........................6 1.2.3 Data provenance...........................7 1.3 Data integration strategies..........................8 1.3.1 Service oriented architecture....................9 1.3.2 Link integration............................9 1.3.3 Data warehousing.......................... 10 1.3.4 View integration........................... 11 1.3.5 Model-driven service oriented architecture............ 12 1.3.6 Integration applications....................... 12 1.3.7 Workflows............................... 13 1.3.8 Mashups................................ 14 1.3.9 The Semantic Web and RDF..................... 15 1.4 Data integration in bioinformatics..................... 16 1.4.1 Standards............................... 16 1.4.2 Ontologies............................... 17 x 1.4.3 Formats and reporting guidelines................. 19 1.4.4 Identifiers............................... 20 1.4.5 Visualisation.............................. 20 1.5 Glycomics................................... 21 1.6 Data integration in glycomics........................ 25 1.6.1 Glycan formats............................ 25 1.6.2 Glycan graphical representations.................

Thesis Reference

A Comprehensive Workflow for Variant Calling Pipeline Comparison and Analysis Using R Programming

Communicate the Future PROCEEDINGS HYATT REGENCY ORLANDO 20–23 MAY | ORLANDO, FL and Summit.Stc.Org #STC18

Property Graph Vs RDF Triple Store: a Comparison on Glycan Substructure Search

Genomic Alignment (Mapping) and SNP / Polymorphism Calling

Genetics 211 - 2018 Lecture 3

Efficient Support for Data-Intensive Scientific Workflows on Geo

Apache Taverna

Introduction to Bioinformatics (Elective) – SBB1609

A Review of Scalable Bioinformatics Pipelines

Meeting Booklet

Programming Models to Support Data Science Workflows

ADMI Cloud Computing Presentation