Thesis Reference

Thesis Analysis of large biological data: metabolic network modularization and prediction of N-terminal acetylation CHARPILLOZ, Christophe Abstract During last decades, biotechnology advances allowed to gather a huge amount of biological data. This data ranges from genome composition to the chemical interactions occurring in the cell. Such huge amount of information requires the application of complex algorithms to reveal how they are organized in order to understand the underlying biology. The metabolism forms a class of very complex data and the graphs that represent it are composed of thousands of nodes and edges. In this thesis we propose an approach to modularize such networks to reveal their internal organization. We have analyzed red blood cells' networks corresponding to pathological states and the obtained in-silico results were corroborated by known in-vitro analysis. In the second part of the thesis we describe a learning method that analyzes thousands of sequences from the UniProt database to predict the N-alpha-terminal acetylation. This is done by automatically discovering discriminant motifs that are combined in a binary decision tree manner. Prediction performances on N-alpha-terminal acetylation are higher than the other published classifiers. Reference CHARPILLOZ, Christophe. Analysis of large biological data: metabolic network modularization and prediction of N-terminal acetylation. Thèse de doctorat : Univ. Genève, 2015, no. Sc. 4883 URN : urn:nbn:ch:unige-860463 DOI : 10.13097/archive-ouverte/unige:86046 Available at: http://archive-ouverte.unige.ch/unige:86046 Disclaimer: layout of this document may differ from the published version. 1 / 1 UNIVERSITÉ DE GENÈVE FACULTÉ DES SCIENCES Département d’informatique Professeur Bastien Chopard Analysis of Large Biological Data: Metabolic Network Modularization and Prediction of N-Terminal Acetylation THÈSE présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention sciences informatiques par Christophe CHARPILLOZ de Bévilard (BE) Thèse No 4883 GENÈVE Atelier de reproduction Uni-Mail 2015 REMERCIEMENTS Je souhaite commencer par remercier mon directeur de thèse, Bastien Chopard, pour m’avoir offert l’opportunité d’accomplir un doctorat au sein du laboratoire de calcul scientifique et parallèle (Scientific and Parallel Com- puting Group, SPC). Sa curiosité et son intérêt dans le domaine des sciences computationnelles m’ont permis d’explorer librement et rigoureusement le domaine de l’analyse du métabolisme in silico. Ses encouragements et son appui m’ont aidé à terminer ce travail dans les meilleures conditions possibles. Je remercie aussi chaleureusement Jean-Luc Falcone pour son encadrement ainsi que pour toute l’aide qu’il m’a apportée. Ses conseils, allant de la biologie à la rédaction scientifique, m’ont permis de mener à bien ce travail. J’exprime également toute ma gratitude envers les membres du jury. À Anne-Lise Veuthey pour son expertise en protéomique, ses suggestions et ses remarques sur mon travail. À Alexandre Masselot pour avoir aussi accepté de mettre ses compétences de (bio)informaticien à disposition pour évaluer la qualité de mon travail. Je remercie Alexandros Kalousis pour avoir partagé ses compétences dans le domaine de l’apprentissage automatique (machine learning) et de l’explo- ration de données (data mining). Son assistance et ses enseignements dans ces domaines ont été d’un grand secours. Je suis aussi reconnaissant envers Felix Kwok, Martin Jakob Gander et Pierre-Alain Cherix. En effet, grâce à leur savoir-faire mathématique et à leur gentillesse, une section complète de ce manuscrit a pu être réalisée. Ce travail n’aurait pas été possible sans le soutien de nombreuses personnes, en commencant par ceux avec qui j’ai passé la quasi totalité de mes années au SPC. Merci à Orestis Pileas Malaspinas dont le soutien scientifique et amical ainsi que l’expérience dans l’encadrement de travaux académiques m’ont été d’une très grande aide. Un grand merci aussi à mon collègue et ami Daniel Walter Lagrava Sandoval dont la compagnie a été très apréciée et a contribué à rendre mon parcours académique stimulant et amusant. Je remercie également Xavier Meyer dont les échanges m’ont permis d’aborder mon travail avec plus de calme et de sérénité. Je n’oublie évidemment pas mes collègues du SPC et membres du dépar- tement des sciences informatiques avec qui j’ai partagé de nombreux bons moments et qui ont aussi supporté mes sauts d’humeurs. Merci à Alexandre, Andrea, Aziza, Gregor, Jonas, Kae, Mohamed, Pablo, Pierre, Reto et Yann. Certains d’entre eux sont devenus des amis avec qui j’espère garder le contact bien au-delà de ce travail de doctorat. Merci aussi à toutes les personnes qui n’ont pas été mentionnées avec qui j’ai interagi pendant toutes ces années. Finalement, une immense reconnaissance à ma mère et à mon père qui m’ont encouragé et soutenu de manière constante et indefectible du début à la fin de ce travail. Sans eux, ce travail ne serait certainement jamais abouti. iii ABSTRACT Biotechnology allowed to gather a huge amount of biological data. Those data range from the nucleotides that compose the genome to the chemical interactions occurring between molecules in the cell. Some of these data can be interpreted by expert but some others need the application of complex algorithms in order to extract knowledge. The development of such algorithms is now a major research field in computational biology (or bioinformatics). In this work we develop such approach to analyze two types of biological data: the stoichiometric models and proteins sequences to discover how these data are structured or organized in order to understand the underlying biology in silico. Chapter 1 and 2 are introduction to the basic concepts needed to understand this manuscript. In the first chapter basic molecular biology is introduced. This allow the reader to have an intuition of what are the objects represented by the data extracted from the biological databases. In the second chapter the models used to mathematically represent the metabolism or metabolic network are described. Namely the stoichiometric matrices and the graphs. In the chapter 3 the problem of extreme pathways computation is tackled. An algorithm based on network reduction and hierarchical computing of the extreme pathways is described in details. To implement our algorithm the concept of meta-reaction is introduced. A meta-reaction is a grouping of chemical reactions’ subset connected by their substrates or products in the network. A meta-reaction summarizes the subset, or subsystem, only by its inputs and outputs. Thus ignoring the intermediate metabolites and allows the reduction in size of the network. The meta-reactions are built with respect to the stoichiometry of the encapsulated subsystems. Also experiments that allows to assess the efficiency of the reduction and hierarchical computation are described in this chapter. The latter ends by the description of a new approach allowing the detection of intractable systems by considering the reduced network with the meta-reactions Chapter 4 contains a description of a metric based on the extreme pathways to measure the similarity between chemical reactions in a metabolic network. This metric allow the usage of clustering algorithms to detect functional modules in the network. As the definition of the proposed metric needs the complete enumeration of the extreme pathways, an approximation of the metric is proposed. Then to assess the quality of the detected modules, we applied the approach to the human erythrocyte metabolism. Also a quantitative experiment that detect pair of co-expressed genes has been done. This allows producing a score for our modules and thus comparing our metric with other approaches. As we also propose a supervised learning method to predict the initiator methionine cleavage and Na-terminal acetylation. Thus the Chapter 5 is a reminder on supervised learning. It contains also a review of the already existing approach to detect the Na-terminal acetylation. Then the Chapter 6 provides the description of criteria allowing to fetch the proteomic datasets. Those datasets are the one used as learning and test datasets for our model. Our model is described and evaluated in Chapter 7 and 8. The model is based on combination of discriminant motifs in a binary decision tree v abstract manner. A discriminant motif allows to select a protein according the level of detection of the motif in the protein’s primary structure. We called our model motifs-tree. Such a tree recursively split a proteins’ set into two subsets: one undergoing a given post-translational modification, the other does not. To select the motifs that compose the decision tree’s nodes an evolutionary algorithm was used to explore the space of all variable size motifs. Then our model is compared to the state of the art. Our automatically built model provides score on par with the experts’ state of the art. Moreover it has been able to detect subtle features to correctly identify acetylated sequences which have not been detected by experts (e.g. the proteins acetylated by NatB and NatC). The model was also used to explain the initiator methionine cleavage and Na-terminal acetylation in H. sapiens. This was successfully done for the Na-terminal acetylation but with less success in the case of Na-terminal acetylation. Indeed for the latter the wide range of acetylated proteins makes the model difficult to analyze. Chapter 9 is the final chapter and contains a general conclusion about this work and briefly assess the problem of validation of bioinformatics approaches. We also bring out the growing role of computer science in biology. vi RÉSUMÉ La biotechnologie a permis de récolter de larges quantités de données biologiques, allant des séquences de nucléotides qui composent le génome jusqu’aux intéractions chimiques des molécules nécessaires à la vie cellulaire.

Thesis Reference

Cancer Immunology of Cutaneous Melanoma: a Systems Biology Approach

Bio::Graphics HOWTO Lincoln Stein Cold Spring Harbor Laboratory1 [email protected]

UCLA UCLA Electronic Theses and Dissertations

Stein Gives Bioinformatics Ten Years to Live

Computational Modeling to Design and Analyze Synthetic Metabolic Circuits

An Introduction to Perl for Bioinformatics

The Bioperl Toolkit: Perl Modules for the Life Sciences

Identification of Functional Elements and Regulatory Circuits By

Bioclipse: an Open Source Workbench for Chemo- and Bioinformatics

Genome Informatics

Unlocking the Secrets of the Genome

The Pan-NLR'ome of Arabidopsis Thaliana