Thesis

Analysis of large biological data: metabolic network modularization and prediction of N-terminal acetylation

CHARPILLOZ, Christophe

Abstract

During last decades, biotechnology advances allowed to gather a huge amount of biological data. This data ranges from genome composition to the chemical interactions occurring in the cell. Such huge amount of information requires the application of complex algorithms to reveal how they are organized in order to understand the underlying biology. The metabolism forms a class of very complex data and the graphs that represent it are composed of thousands of nodes and edges. In this thesis we propose an approach to modularize such networks to reveal their internal organization. We have analyzed red blood cells' networks corresponding to pathological states and the obtained in-silico results were corroborated by known in-vitro analysis. In the second part of the thesis we describe a learning method that analyzes thousands of sequences from the UniProt database to predict the N-alpha-terminal acetylation. This is done by automatically discovering discriminant motifs that are combined in a binary decision tree manner. Prediction performances on N-alpha-terminal acetylation are higher than the other published classifiers.

Reference

CHARPILLOZ, Christophe. Analysis of large biological data: metabolic network modularization and prediction of N-terminal acetylation. Thèse de doctorat : Univ. Genève, 2015, no. Sc. 4883

URN : urn:nbn:ch:unige-860463 DOI : 10.13097/archive-ouverte/unige:86046

Available at: http://archive-ouverte.unige.ch/unige:86046

Disclaimer: layout of this document may differ from the published version.

1 / 1 UNIVERSITÉ DE GENÈVE FACULTÉ DES SCIENCES Département d’informatique Professeur Bastien Chopard

Analysis of Large Biological Data: Metabolic Network Modularization and Prediction of N-Terminal Acetylation

THÈSE

présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention sciences informatiques

par Christophe CHARPILLOZ de Bévilard (BE)

Thèse No 4883

GENÈVE Atelier de reproduction Uni-Mail 2015

REMERCIEMENTS

Je souhaite commencer par remercier mon directeur de thèse, Bastien Chopard, pour m’avoir offert l’opportunité d’accomplir un doctorat au sein du laboratoire de calcul scientifique et parallèle (Scientific and Parallel Com- puting Group, SPC). Sa curiosité et son intérêt dans le domaine des sciences computationnelles m’ont permis d’explorer librement et rigoureusement le domaine de l’analyse du métabolisme in silico. Ses encouragements et son appui m’ont aidé à terminer ce travail dans les meilleures conditions possibles. Je remercie aussi chaleureusement Jean-Luc Falcone pour son en- cadrement ainsi que pour toute l’aide qu’il m’a apportée. Ses conseils, allant de la biologie à la rédaction scientifique, m’ont permis de mener à bien ce travail. J’exprime également toute ma gratitude envers les membres du jury. À Anne-Lise Veuthey pour son expertise en protéomique, ses suggestions et ses remarques sur mon travail. À Alexandre Masselot pour avoir aussi accepté de mettre ses compétences de (bio)informaticien à disposition pour évaluer la qualité de mon travail. Je remercie Alexandros Kalousis pour avoir partagé ses compétences dans le domaine de l’apprentissage automatique (machine learning) et de l’explo- ration de données (data mining). Son assistance et ses enseignements dans ces domaines ont été d’un grand secours. Je suis aussi reconnaissant envers Felix Kwok, Martin Jakob Gander et Pierre-Alain Cherix. En effet, grâce à leur savoir-faire mathématique et à leur gentillesse, une section complète de ce manuscrit a pu être réalisée. Ce travail n’aurait pas été possible sans le soutien de nombreuses per- sonnes, en commencant par ceux avec qui j’ai passé la quasi totalité de mes années au SPC. Merci à Orestis Pileas Malaspinas dont le soutien scientifique et amical ainsi que l’expérience dans l’encadrement de travaux académiques m’ont été d’une très grande aide. Un grand merci aussi à mon collègue et ami Daniel Walter Lagrava Sandoval dont la compagnie a été très apréciée et a contribué à rendre mon parcours académique stimulant et amusant. Je remercie également Xavier Meyer dont les échanges m’ont permis d’aborder mon travail avec plus de calme et de sérénité. Je n’oublie évidemment pas mes collègues du SPC et membres du dépar- tement des sciences informatiques avec qui j’ai partagé de nombreux bons moments et qui ont aussi supporté mes sauts d’humeurs. Merci à Alexandre, Andrea, Aziza, Gregor, Jonas, Kae, Mohamed, Pablo, Pierre, Reto et Yann. Certains d’entre eux sont devenus des amis avec qui j’espère garder le contact bien au-delà de ce travail de doctorat. Merci aussi à toutes les personnes qui n’ont pas été mentionnées avec qui j’ai interagi pendant toutes ces années. Finalement, une immense reconnaissance à ma mère et à mon père qui m’ont encouragé et soutenu de manière constante et indefectible du début à la fin de ce travail. Sans eux, ce travail ne serait certainement jamais abouti.

iii

ABSTRACT

Biotechnology allowed to gather a huge amount of biological data. Those data range from the nucleotides that compose the genome to the chemical interactions occurring between molecules in the cell. Some of these data can be interpreted by expert but some others need the application of complex al- gorithms in order to extract knowledge. The development of such algorithms is now a major research field in (or ). In this work we develop such approach to analyze two types of biological data: the stoichiometric models and proteins sequences to discover how these data are structured or organized in order to understand the underlying biology in silico. Chapter 1 and 2 are introduction to the basic concepts needed to un- derstand this manuscript. In the first chapter basic molecular biology is introduced. This allow the reader to have an intuition of what are the objects represented by the data extracted from the biological databases. In the sec- ond chapter the models used to mathematically represent the metabolism or metabolic network are described. Namely the stoichiometric matrices and the graphs. In the chapter 3 the problem of extreme pathways computation is tackled. An algorithm based on network reduction and hierarchical computing of the extreme pathways is described in details. To implement our algorithm the concept of meta-reaction is introduced. A meta-reaction is a grouping of chemical reactions’ subset connected by their substrates or products in the network. A meta-reaction summarizes the subset, or subsystem, only by its inputs and outputs. Thus ignoring the intermediate metabolites and allows the reduction in size of the network. The meta-reactions are built with respect to the stoichiometry of the encapsulated subsystems. Also experiments that allows to assess the efficiency of the reduction and hierarchical computation are described in this chapter. The latter ends by the description of a new approach allowing the detection of intractable systems by considering the reduced network with the meta-reactions Chapter 4 contains a description of a metric based on the extreme pathways to measure the similarity between chemical reactions in a metabolic network. This metric allow the usage of clustering algorithms to detect functional modules in the network. As the definition of the proposed metric needs the complete enumeration of the extreme pathways, an approximation of the metric is proposed. Then to assess the quality of the detected modules, we applied the approach to the human erythrocyte metabolism. Also a quantitative experiment that detect pair of co-expressed genes has been done. This allows producing a score for our modules and thus comparing our metric with other approaches. As we also propose a supervised learning method to predict the initiator methionine cleavage and Nα-terminal acetylation. Thus the Chapter 5 is a reminder on supervised learning. It contains also a review of the already existing approach to detect the Nα-terminal acetylation. Then the Chapter 6 provides the description of criteria allowing to fetch the proteomic datasets. Those datasets are the one used as learning and test datasets for our model. Our model is described and evaluated in Chapter 7 and 8. The model is based on combination of discriminant motifs in a binary decision tree

v abstract manner. A discriminant motif allows to select a protein according the level of detection of the motif in the protein’s primary structure. We called our model motifs-tree. Such a tree recursively split a proteins’ set into two subsets: one undergoing a given post-translational modification, the other does not. To select the motifs that compose the decision tree’s nodes an evolutionary algorithm was used to explore the space of all variable size motifs. Then our model is compared to the state of the art. Our automatically built model provides score on par with the experts’ state of the art. Moreover it has been able to detect subtle features to correctly identify acetylated sequences which have not been detected by experts (e.g. the proteins acetylated by NatB and NatC). The model was also used to explain the initiator methionine cleavage and Nα-terminal acetylation in H. sapiens. This was successfully done for the Nα-terminal acetylation but with less success in the case of Nα-terminal acetylation. Indeed for the latter the wide range of acetylated proteins makes the model difficult to analyze. Chapter 9 is the final chapter and contains a general conclusion about this work and briefly assess the problem of validation of bioinformatics approaches. We also bring out the growing role of computer science in biology.

vi RÉSUMÉ

La biotechnologie a permis de récolter de larges quantités de données biologiques, allant des séquences de nucléotides qui composent le génome jusqu’aux intéractions chimiques des molécules nécessaires à la vie cellulaire. Si certaines de ces données peuvent être facilement analysées ou interpré- tées par les experts, d’autres, pour des raisons de taille ou de complexité, nécessitent des approches algorithmiques pour pouvoir les exploiter et en extraire au mieux des connaissances. Développer de telles approches pour comprendre ces données est un enjeu majeur en bioinformatique et un do- maine de recherche actif. Au cours de ce travail nous avons contribué dans cette direction en proposant des approches pour analyser deux types de don- nées biologiques : les modèles stœchiométriques et les séquences protéiques. Le but commun étant d’analyser de larges bases de données biologiques pour tenter de découvrir comment ces données sont biologiquement structu- rées (dans un large sens) afin de tenter de comprendre in silico la biologie sous-jacente. Le premier chapitre est une introduction aux objets biologiques sur les- quels s’appliquent les approches analytiques décrites dans ce manuscrit. Les notions de base de biologie moléculaire y sont exposées mais sans avoir la prétention d’être un support approfondi ou exhaustif. Dans le deuxième chapitre, une autre brève introduction décrit deux modèles permettant de représenter mathématiquement le métabolisme (ou réseau métabolique) : les matrices stœchiométriques ainsi que les graphes. Après ces introductions, nous traitons de l’analyse du métabolisme in silico. En effet, les modèles stœchiométriques et les graphes permettent de représenter l’ensemble connu des réactions chimiques se produisant dans la cellule. Ces modèles offrent aux scientifiques des objects mathématiques pouvant être manipulés et traités afin d’en extraire de la connaissance, telle l’analyse de voies métaboliques ou de modules fonctionels. Dans le troisième chapitre, le problème du calcul des extreme pathways dans un réseau métabolique est abordé, un extreme pathway étant une descrip- tion mathématique d’une voie métabolique. Le problème d’énumération de ces extreme pathways est généralement intraitable (c’est-à-dire que la solution n’est pas trouvée dans un temps suffisamment court) lorsqu’il est appliqué à des réseaux métaboliques à l’échelle génomique. Nous proposons donc une approche pour calculer les extreme pathways de manière hierarchique, ceci dans le but de traiter un ensemble de problèmes sur des réseaux plus petits (c’est-à-dire comportant peu de réactions chimiques) et donc potentiellement plus simples. L’idée présentée remplace récursivement des sous-ensembles de réactions chimiques connectées entre elles par des métabolites par des meta-réactions. Une meta-réactions représente donc un sous-système et le résume uniquement par les métabolites qu’il consomme et qu’il produit en faisant abstraction des métabolites intermédiaires. Ceci tout en respectant la stœchiométrie imposée par le sous-système. L’utilisation des meta-réactions permet de réduire la taille du graphe (ou réseau métabolique), dans l’es- poir d’obtenir un système dont le calcul des extreme pathways se retrouve simplifié. Une description détaillée de l’algorithme est fournie ainsi qu’une formalisation en terme d’opérations matricielles. Le chapitre traite ensuite des problèmes de performances de l’utilisation des meta-réactions et se

vii résumé termine par la présentation d’une nouvelle approche pour identifier les sys- tèmes intraitable. Ceci à l’aide de la réduction du graphe par le biais des meta-reactions. Le quatrième chapitre décrit une métrique exploitant les extreme pathways pour calculer une similarité entre les réactions chimiques qui composent le graphe (ou réseau métabolique). Cette similarité est utilsée pour procéder à des opérations de clustering (partitionnement de données) pour détecter des modules fonctionels au sein de la cellule. Comme la construction de la métrique nécessite le calcul complet des extreme pathways, une solution est poposée pour approximer cette similarité sans avoir à calculer l’ensemble complet des extreme pathways du réseau métabolique. Pour valider la détec- tion de modules fonctionels, nous l’avons appliquée au métabolisme des érythrocytes chez l’humain. Les résultats ont permis l’extraction de modules déjà bien identifiés, confirmant qualitativement l’approche proposée. Ensuite des enzymopathies ont été simulées afin d’évaluer leurs conséquences sur les modules déctectés. Ceci a permis de déduire correctement certaines alté- rations du métabolisme des érythrocytes dûes à certaines des pathologies étudiées. Pour terminer, une validation quantitative a été appliquée pour détecter les paires de gènes co-exprimés dans un réseau métabolique. Ceci permettant de calculer un score mesurant la qualité des modules détectés et donc de comparer notre approche à celles déjà publiées dans la littérature. La deuxième partie est consacrée à l’analyse de séquences protéiques. Les bases de données, telles que UniProtKB, contiennent des centaines de milliers de séquences annotées. Ces annotations fournissent de nombreuses informations sur les séquences, allant du gène encodant le polypeptide jusqu’à l’organisme à partir duquel il a été identifié. Dans ce manuscrit, nous décrivons comment ces données ont été exploitées pour automati- quement construire un modèle permettant de prédire deux modifications co-traductionelles : le clivage de la méthionine initiale et l’acétylation N- terminale. Le cinquième chapitre est un bref rappel sur l’apprentissage supervisé ainsi que ses applications au problème de prédiction de l’acétyla- tion N-terminale. Les modifications post-traductionelles ont, quand à elles, été introduites dans le premier chapitre. Le sixième chapitre traite sur la construction des ensembles de données nécessaires aux algorithmes d’apprentissage supervisés. Des critères per- mettant la sélection des séquences adéquates ont donc été développés. Ces derniers étant loin d’être triviaux, un chapitre leur est donc dédié. Le septième chapitre est consacré au modèle choisi pour prédire les deux modifications post-traductionelles considérées. Nous avons opté pour un arbre de décision binaire combinant des motifs discriminants. Un motif discriminant permet de séléctionner une protéine si cette dernière contient à un niveau suffisamment significatif le motif. Dans le cadre de ce travail, ce modèle a été nommé motifs-tree. Le but d’un tel arbre de décision est de récursivement séparer un ensemble de protéines en deux sous-ensembles : l’un contenant les protéines subissant une modification, l’autre non. La contrainte computationelle étant la recherche des motifs qui composent les nœuds de l’arbre. Pour ce faire, un algorithme évolutionaire a été utilisé pour rechercher dans l’espace de motifs de taille variable, les motifs les plus discriminants pour chaque niveau de l’arbre. Dans le huitième chapitre, les résultats obtenus par les motifs-trees sont présentés et comparés avec l’état de l’art du moment (en août 2015, date de rédaction de ce manuscrit). Il est démontré qu’une méthode automatique produit des performances équivalentes à celles obtenues par des experts en

viii résumé protéomique. Dans certains cas particuliers, le modèle présenté est capable de détecter de subtiles caractéristiques dans les séquences qui n’ont pas été détectées par les experts (comme les protéines acétylées par la NatB ou NatC). Les modèles produits ont aussi été utilisés pour comprendre quelles sont les caractéritiques nécessaires pour que ces modifications enzymatiques aient lieu chez l’humain. Ceci à été accompli avec succès pour le clivage de la méthionine initiale et avec une efficactié moindre dans le cas de l’acétylation N-terminale, étant donné la complexité du modèle produit et la grande variété des protéines acétylées en leur extrémité N-terminale. Le dernier chapitre est une conclusion générale et un court essai trai- tant brièvement des problèmes de validation des approches proposées dans certains domaines de la bioinformatique ainsi que de l’importance de l’infor- matique dans la biologie.

ix

CONTENTS remerciements iii abstract v résumé vii Contents xi List of Figures xv List of Tables xix i basic molecular biology1 1 biological concepts3 1.1 Proteins ...... 3 1.2 Genes ...... 5 1.3 Enzymes ...... 10 1.4 Post-translational modifications ...... 11 1.4.1 Initiator methionine cleavage ...... 12 1.4.2 Nα–terminal acetylation ...... 13 1.5 The metabolism ...... 16 ii metabolic network analysis 21 2 metabolic network models 23 2.1 Stoichiometric modeling ...... 23 2.1.1 Metabolic pathways ...... 27 2.2 Graph modeling ...... 30 2.3 Metabolic network reconstruction ...... 31 3 hierarchical computation of extreme pathways 35 3.1 Overview of the approach ...... 36 3.2 Simplifying metabolic networks ...... 36 3.3 Meta-reaction ...... 37 3.4 Metabolic subnetworks ...... 42 3.5 Packing the metabolic network ...... 44 3.6 Extreme pathways unpacking ...... 45 3.6.1 The straightforward case ...... 46 3.6.2 The shared case ...... 46 3.6.3 The encapsulated case ...... 47 3.6.4 Last independence check ...... 47 3.7 Matrix description of the complete algorithm ...... 49 3.7.1 The constraints matrix ...... 51 3.7.2 Example ...... 52 3.8 Efficiency of the network packing ...... 55 3.8.1 Random subnetwork packing ...... 56 3.8.2 Hierarchical packing ...... 59 3.8.3 Comparison of the results ...... 63 3.9 Performance of hierarchical extreme pathways computation . 64 3.10 Detection of intractable systems ...... 68 3.11 Conclusion and perspective ...... 74 4 module detection in metabolic network 77 4.1 Motivation ...... 77 4.2 An extreme pathways similarity measure ...... 78 4.2.1 Example of extreme pathways similarity ...... 81 4.3 The ε-graph ...... 81

xi contents

4.4 Hierarchical clustering ...... 84 4.5 Computation of the distance ...... 84 4.5.1 Approximation ...... 86 4.6 Red blood cell functional modules analysis ...... 89 4.6.1 Glucose-6-phosphate dehydrogenase deficiency . . . . 98 4.6.2 Pyruvate kinase deficiency ...... 99 4.7 Cluster analysis of the E. coli metabolism ...... 103 4.7.1 Detection of intra-operonic pairs of genes in E. coli .. 105 4.7.2 Exploring genes pairs in E. coli ...... 111 4.8 Conclusion ...... 113 iii sequence analysis 117 5 background in post-translational modifications clas- sification 119 5.1 Classification ...... 119 5.2 Prediction of Nα-terminal acetylation ...... 121 6 proteins datasets 125 6.1 General criteria ...... 125 6.2 Nα-terminal acetylation criterion ...... 125 6.3 Non-Nα-terminal acetylation criteria ...... 128 6.4 Quality of the datasets ...... 128 6.5 Datasets composition ...... 129 6.6 Conclusion ...... 130 7 motifs-trees 135 7.1 Motivation ...... 135 7.2 Sequence motif ...... 135 7.2.1 Aligned motif ...... 136 7.3 Tokens ...... 137 7.3.1 Any amino acid ...... 138 7.3.2 Fixed amino acid ...... 138 7.3.3 Included or excluded amino acids ...... 138 7.3.4 Amino acid physicochemical similarity ...... 138 7.4 Motif search by genetic algorithm ...... 140 7.4.1 Individual ...... 140 7.4.2 Genetic operators ...... 141 7.4.3 Fitness computation ...... 143 7.5 Motifs-tree: motif combination ...... 144 7.5.1 Motifs-tree growth ...... 146 7.5.2 Motifs-tree pruning ...... 146 8 motifs-trees performances and proteomic analysis for h. sapiens 149 8.1 Initiator methionine cleavage ...... 149 8.1.1 Parameters selection and classification performance . 149 8.1.2 Human MetAPs specificity analysis ...... 151 8.2 N-terminal acetylation ...... 159 8.2.1 Classification performance ...... 159 8.2.2 NatB and NatC potential substrates ...... 160 8.2.3 Can a motifs-tree learn like Martinez et al.?...... 163 8.3 Human NATs specificity analysis ...... 164 8.3.1 The root motif ...... 166 8.3.2 The second motif ...... 170 8.3.3 UniProtKB release 2015_07 ...... 173 8.4 Ensemble learning ...... 173

xii contents

8.4.1 Motifs forest ...... 174 8.4.2 Classification performances of the motifs forest . . . . 176 8.5 Conclusion and perspective ...... 177 iv conclusion 179 conclusion & perspective 181 bibliography 183 Curriculum vitae 193

xiii

LISTOFFIGURES

Figure 1 Structure of an amino acid...... 3 Figure 2 Peptide bond formation between two amino acids. . 5 Figure 3 Representation of a polypeptide...... 5 Figure 4 The steps involved in the biosynthesis...... 6 Figure 5 Example of a small partial regulatory network. . . . . 9 Figure 6 Illustration of the feedback inhibition...... 11 Figure 7 N-terminal acetylation by N-terminal acetyltransferases 14 Figure 8 Illustration of the three stages in the metabolism. . . 17 Figure 9 The Kyoto encyclopedia of genes and genomes metabolism map...... 18 Figure 10 A simple linear pathway...... 19 Figure 11 Example of high dimensional cone in the fluxes-space. 26 Figure 12 Example of a simple system and its two extreme path- ways...... 28 Figure 13 Example of a directed and undirected graph...... 30 Figure 14 A directed bipartite stoichiometric graph...... 31 Figure 15 Transformation of a bipartite metabolic network into a reactions network and a compounds network. . . . 32 Figure 16 A schematic view of the XML KEGG files...... 34 Figure 17 Four cases of reactions which will never be part of an extreme pathway...... 37 Figure 18 Derivation of a meta-reaction from a system of chem- ical equations...... 40 Figure 19 Derivation of two meta-reactions from a system of chemical equations...... 40 Figure 20 Derivation of two meta-reactions from a system of chemical equations...... 41 Figure 21 A network composed of four exchange fluxes. . . . . 43 Figure 22 Packing of a network with meta-reactions...... 46 Figure 23 Wrong extreme pathways matrix reconstruction. . . . 48 Figure 24 Metabolic sub-network improperly extracted from the network...... 48 Figure 25 Packing of metabolic network with cycle...... 49 Figure 26 The metabolic network and its division...... 52 Figure 27 The chosen metabolic subnetworks...... 52 Figure 28 The packed network...... 55 Figure 29 Fowlkes-Mallows index between all pairs of 50 ran- domly packed E. coli networks ...... 59 Figure 30 Distribution of the ratios of the rejected partition dur- ing a compression step...... 60 Figure 31 Binary split of the vertices of a graph...... 60 Figure 32 The pruning process and the selection process of the partitions ...... 61 Figure 33 Conversion of one metabolite into an external metabo- lite...... 63 Figure 34 Sizes of the sample networks used to assess the per- formance of hierarchical packing...... 65

xv list of figures

Figure 35 Plot of the computation times on the uncompressed networks versus the compressed networks...... 66 Figure 36 Percentage of a 24 hours timeouts in function of the number of vertices...... 67 Figure 37 Plot of the computation times on the samples net- works versus the vertices numbers composing the samples...... 68 Figure 38 Plot of the log of the degrees distribution for the compounds and reactions of the 10 easy networks. . . 69 Figure 39 Empirical cumulative function and log of the degrees distribution for the reactions...... 71 Figure 40 Empirical cumulative function and the log of the dis- tribution of the common compounds degrees in the K12 networks...... 72 Figure 41 Two samples Kolmogorov-Smirnov tests for all pairs of network having similar sizes...... 74 Figure 42 A toy metabolic network...... 81 Figure 43 Extreme pathways of the toy metabolic network. . . . 81 Figure 44 The two clusters and three clusters produced by the spectral clustering algorithm...... 83 Figure 45 Hierarchical clustering of the toy metabolic network using UPGMA...... 85 Figure 46 Measure of the quality of the extreme pathways dis- tance approximations...... 88 Figure 47 Network representation of the erythrocyte’s metabolism. 91 Figure 48 Undirected hierarchical clustering of the erythrocyte metabolism and the resulting modules...... 94 Figure 49 Separation of the non-oxidative PPP undirected module into two directed modules...... 96 Figure 50 Directed hierarchical clustering of the erythrocyte metabolism and the resulting modules...... 97 Figure 51 Directed hierarchical clustering of an healthy and a G6PD deficient erythrocyte...... 100 Figure 52 Directed hierarchical clustering of a G6PD deficient erythrocyte and the resulting modules...... 101 Figure 53 Directed hierarchical of an healthy and a PK defi- ciency erythrocyte...... 102 Figure 54 Directed functional modules of an PK deficiency ery- throcyte...... 104 Figure 55 Distribution of extreme pathways distances in E. coli for all pairs of reactions...... 105 Figure 56 Hierarchical clustering of the reaction in the E. coli metabolism...... 106 Figure 57 Cluster isolated after a cut in the dendrogram in the E. coli metabolism...... 107 Figure 58 Performance of the intra-operonic pairs detection in E. coli through the hierarchy...... 108 Figure 59 Receiver operating characteristic curve for the detec- tion of intra-operonic pairs in E. coli...... 110 Figure 60 Comparison of three different linkage criteria for the detection of intra-operonic pairs in E. coli...... 110 Figure 61 Extracted subnetwork for gpp, spoT, ppx and relA. . 113

xvi list of figures

Figure 62 Sequence logo for the initiator methionine cleavage of the 2012 dataset...... 131 Figure 63 Sequence logo for the initiator methionine cleavage of the 2015 dataset...... 132 Figure 64 Sequence logo for the Nα-terminal acetylation of the 2012 dataset...... 133 Figure 65 Sequence logo for the Nα-terminal acetylation of the 2015 dataset...... 134 Figure 66 The crossover operators...... 141 Figure 67 The mutation operators...... 142 Figure 68 Example of bloat in alignments with a motif contain- ing bloat and a motif without bloat...... 142 Figure 69 A graphical representation of a motifs-tree...... 145 Figure 70 The motifs-tree predicting the initiator methionine cleavage for from the H. sapiens 2012 dataset...... 152 Figure 71 The motif score profile difference for H. sapiens initia- tor methionine cleavage root node...... 155 Figure 72 The motif score profile difference for H. sapiens initia- tor methionine cleavage node at depth one...... 157 Figure 73 The motif score profile difference for H. sapiens initia- tor methionine cleavage node at depth two...... 158 Figure 74 The motifs-tree for the prediction of Nα-terminal acety- lation in H. sapiens...... 165 Figure 75 Manually pruned motifs-tree for Nα-terminal acetyla- tion prediction in H. sapiens...... 167 Figure 76 The average scores difference and histograms of aligned positions of the root motif the motifs-tree predicting Nα-terminal acetylation in H. sapiens...... 169 Figure 77 Average score difference and histograms of aligned position for the second motif in the simplified motifs- tree for the prediction of Nα-terminal acetylation in H. sapiens...... 172 Figure 78 Construction of an ensemble classifier based on deci- sion trees...... 175

xvii

LISTOFTABLES

Table 1 List of functions accomplished by proteins...... 4 Table 2 Names and abbreviations of the 22 amino acids. . . . 4 Table 3 The protein’s structure levels...... 6 Table 4 The genetic code...... 8 Table 5 The six major classes of enzymes...... 10 Table 6 List of some common post-translational modifications. 12 Table 7 Supposed substrate specificities of the six N-terminal acetyltransferases ...... 15 Table 8 The result obtained by applying the simplification algorithm to reconstructed networks of H. sapiens and E. coli...... 38 Table 9 The criterion to decide the type of exchange flux. . . 44 Table 10 Size of the reconstructed network E. coli...... 56 Table 11 Compression of the reconstructed E. coli metabolic netowrk...... 58 Table 12 Packing efficiency on the reconstructed E. coli metabolic network...... 64 Table 13 Identification of tractable systems with the Kolmogorov- Smirnov statistic...... 74 Table 14 Partitions produced by spectral clustering of the toy network...... 82 Table 15 The percentage of network components represented by a subnetwork...... 89 Table 16 The Standard deviations for each pair of parameters used in the subnetworks sampling...... 89 Table 17 List of chemical abbreviations used in the human erythrocyte metabolic network...... 92 Table 18 List of enzyme abbreviations used in the human ery- throcyte metabolic network...... 93 Table 19 Outliers in the hierarchy for the considered human red blood cell states ...... 95 Table 20 Confusion matrix for the genes pairs...... 107 Table 21 The area under the curve for the detection of intra- operonic pairs in E. coli...... 109 Table 22 Discovered pair in the E. coli hierarchical clustering. . 112 Table 23 Pair of genes encoding for proteins interacting with ppGpp...... 113 Table 24 Pair of genes that encode for PLP dependent proteins. 114 Table 25 Criteria used to build the Nα-acetylated and the non Nα-acetylated datasets...... 126 Table 26 Criteria for the protein existence in UniProtKB...... 127 Table 27 Number of sequences and content of the different datasets extracted from the two release of UniProtKB (2012_07 and 2015_07)...... 130 Table 28 Hydropathy index (KYTJ820101) from the AAIndex1. 139 Table 29 Summarized list of the type of tokens used to build a motif...... 139 Table 30 Numbers of possible tokens...... 140

xix list of tables

Table 31 Parameters use to build the motifs-trees...... 151 Table 32 Results obtained by outer and inner cross-validation. 151 Table 33 Results assessing the quality of the initiator methion- ine cleavage prediction on the 2012 dataset...... 153 Table 34 Cross-validated results for the Nα-terminal acetylation prediction to selected the N-terminus length on the 2012 datasets...... 159 Table 35 McNemar’s tests to assess difference between classifiers160 Table 36 Performance for Nα-terminal acetylation prediction by TermiNator3 on the 2012 datasets...... 160 Table 37 Cross-validated scores obtained by Eukaryota classi- fiers versus TermiNator3 on NatB or NatC proposed substrates...... 161 Table 38 Predictions of proteins with known Nats using the Terminus H. sapiens classifier...... 162 Table 39 Performance of the motifs-trees when the models are built on the complete 2012 datasets to predict Nα- terminal acetylation...... 162 Table 40 Results assessing the quality of the algorithm in pre- dicting Nα-terminal acetylation with a reduced dataset.163 Table 41 Motif used in the H. sapiens motifs-tree for Nα-terminal acetylation prediction...... 164 Table 42 Description of the used physico-chemical property token in the H. sapiens Nα-terminal acetylation motifs- tree...... 168 Table 43 Amino acid frequency for the second residue in the H. sapiens proteome...... 168 Table 44 Simplified rules of the second motif in the simplified motifs-tree for Nα-terminal acetylation in H. sapiens.. 170 Table 45 Lysine, proline and arginine scores when aligned on the physico-chemical property tokens of the motif at depth two...... 171 Table 46 Simplified rules of the second motif in the simplified motifs-tree for Nα-terminal acetylation in H. sapiens.. 173 Table 47 10-folds cross-validated results for Nα-terminal acety- lation prediction with the motifs-trees on the 2015 version of the datasets...... 173 Table 48 Results assessing TermiNator3 quality of prediction for Nα-terminal acetylation...... 173 Table 49 Cross-validated performances of the ensemble learn- ing method on the 2012 datasets...... 177

xx Part I

BASICMOLECULARBIOLOGY

BIOLOGICALCONCEPTS 1

In this chapter we introduce the basic biological knowledge necessary to understand the biological objects we studied in this work, namely the metabolic networks, the proteins and their chemical modifications. The reader should note that this chapter does not have the pretension of being exhaustive in the description of those biological objects. But it should allow the unfamiliar reader to get a more clearer picture of what metabolism, proteins and gene regulation are. Unless specified, the biological process described take place in eukaryotic cells. The content of sections 1.1 and 1.3 are inspired from [Berg et al., 2002]. This reference may not be cited anymore in those sections.

1.1 proteins

Proteins are large biological molecules synthesized by the cell and are formidable molecular machines that accomplish functions in virtually every process within the cell, like food digestion or immunity. All the diverse roles in the organism can be accomplished because of the variety in shapes and sizes of proteins. Indeed, their structures or shapes define the function of the proteins and studying it teaches biologists or biochemists how the protein functions. They are also key structural components of biological materials like cartilage, hair or spider silk. The remarkable scope of their function is exemplified in the table 1. Chemically the proteins in Eukaryota are built from a set of 22 amino acids (table 2). Those amino acids form the basic structural unit of the protein and come in different shape, size and chemical properties. In other words amino acids are the building blocks of the protein. More precisely an amino acid is a molecule consisting of a central carbon, called α carbon, that bonds to an amino group (–nh2), a carboxyl group (–cooh), an hydrogen atom and a variable lateral chain, called R group. All amino acids share this common structure (figure 1) and they differs only by the lateral chain (or R group). This is this chain that confer to the amino acid its variability. The amino acids have an important property of being able to bind to one other. The amino group of an amino acid can react to the carboxyl group to another one and form a covalent bond called peptide bond (figure 2). Several amino acids can bind and form a linear chain called polypeptide or proteins. In such a chain an amino acid is called a residue. The residues of a polypeptide are read from the amino group (N-terminal end) to the carboxyl group (C-terminal end) (figure 3).

R H OH N α

H2 O Figure 1: Structure of an amino acid.

3 1 biological concepts

Table 1: List of functions accomplished by proteins. The list may not be exhaustive.

Type Function Example Structure Structural proteins create Collagen is a fibrosis protein and maintains biological found, for example, in bones, structures by giving shape tendons and cartilage to cells, tissues and organs Transport Transport proteins can Hemoglobin transports oxy- bind to substances (small gen within the erythrocytes molecules or ions) and (red blood cells) from lungs transport from one location to tissues to another Defense Some proteins play an active Antibodies or immunoglobu- role in cell protection lins are proteins synthesized as a response to a foreign substance (virus, bacteria or parasite) Regulation Proteins are signal sub- Insulin is an hormone stances (hormones) or recep- produced by pancreatic tors cells, regulating glucose metabolism Catalysis Enzymes are catalyzers ac- Trypsine is a serine protease celerating biochemical reac- which are enzymes cleaving tions up to 1016 time faster peptidic bounds in comparison to the non- catalyzed reaction Motion Some proteins provide mo- Actine and myosine form tion capability to cells, like the contractile system of the in the cellular division or cells muscle contraction Storage Storage proteins works as Ferritin bound to iron and tanks for essential sub- allow the storage to this es- stances sential metal

Table 2: Names and abbreviations of the 22 amino acids directly encoded for protein synthesis by the genetic code of Eukaryota.

Name Abbrev. Name Abbrev. Alanine Ala a glycine Gly g Arginine Arg r Histidine His h Asparagine Asn n Isoleucine Ile i Aspartatic acid Asp d Leucine Leu l Cysteine Cys c Lysine Lys k Glutamic acid Glu e Methionine Met m Glutamine Gln q Phenylalanine Phe f Proline Pro p Serine Ser s Threonine Thr t Tryptophan Trp w Tyrosine Tyr y Valine Val v Pyrrolysine Pyl o Selenocysteine Sec u

4 1.2 Genes

R + H 2O R 1 1 O H O H H3 H - N O + - N - O N O N+ + H3 H3 R 2 O H H 2O O R 2 H Figure 2: Peptide bond formation between two amino acids: the amine group

reacts with the carboxyl group releasing a water molecule. H R O H R H R 1 H 3 n-1 H O N-terminal + N N _ C-terminal end N N N O end H 3 O H O H O } R2 H Rn H } Figure 3: Representation of a polypeptide with the localization of the N- terminal end and the C-terminal end. Usually a polypeptide is read from the N-terminal end to the C-terminal end.

The sequence of amino acids that compose a protein is called the primary structure. This structure or sequence is defined by the nucleotides sequence of the gene encoding for the protein. We recall that the gene is the molecular unit of heredity of a living organism and is a segment of a deoxyribonucleic acid molecule (DNA). Proteins are biosynthesized from those DNA molecules. Roughly, the steps of the are the followings: 1. the gene is first transcribed into a precursor messenger RNA (pre-RNA) molecule in the cell nucleus. 2. Then the pre-RNA is processed into a messenger RNA (mRNA) by removing the introns (the splicing). The introns are part of the gene that do not encode for an amino acid in the synthesized protein. 3. The next step is the translation where the mRNA is decoded by a molecular machine, called ribosome, to produce an amino acid chain. Proteins are always synthesized from N-terminus to C-terminus end. The figure 4 illustrate the steps with the different molecules involved in the biosynthesis. Proteins mainly differ from one to another mainly by primary structure. This primary structure induces the shape and size of the protein in the cell. We say that the protein folds in a complex three-dimensional structure. There are four distinct levels of three-dimensional structure (secondary, tertiary and quaternary structure), they are all listed and described in table 3.

1.2 genes

Genes are the molecule of heredity in living organism. Their function is to encode a protein which performs the necessary cellular functions in the cell (see section 1.1). A gene forms a strand of deoxyribonucleic acid (DNA). This strand is itself part of a longer strand of DNA and twists with another to form a DNA double helix. These very long DNA double helix forms a chromosome. Hence a chromosome can encode up to thousands genes. A molecule of DNA is a polymer composed of repeating units called nucleotides. These nucleotides are composed of a nitrogenous base, a five- carbon sugar (a deoxyribose), and one phosphate group. In the case of the DNA, four nucleotides, or bases, form the building blocks of DNA: the adenine (A), the thymine (T), the cytosine (C) and the guanine (G). Often those nucleotides are abbreviated by the letter given between the parentheses.

5 1 biological concepts

DNA replication General flow DNA Unusual flow Reverse transcription Transcription

Direct traduction

Traduction RNA replication Protein (+) sense RNA (-) sense RNA Figure 4: The steps involved in the biosynthesis. The arrows indicates the information flows from a molecule to another molecule. The dashed arrows represent unusual flows (e.g. reverse transcrip- tion in viruses, direct translation in cell-free systems). The (-) sense represents the antisense RNA (as RNA) which is the complemen- tary to a mRNA. The (+) sense represents the mRNA transcribed in the cell.

Table 3: The protein’s structure levels and their description. Structure Description Primary The linear amino acids sequence of the polypeptide chain. Secondary Local structures stabilized by hydrogen bonds between the chain peptide groups. There are two main types structure, the α-helix and the β-strand or β-sheets. Super secondary Compact three-dimensional structure of several adja- cent secondary structure, like β-hairpins, α-helix hair- pins, and β-α-β motifs. Tertiary This is the spatial positions of the secondary structures to one another, generally stabilized by nonlocal inter- actions (in other words this is the overall shape of a single protein). Quaternary The structure formed by several protein, called protein subunits, functioning as a single complex.

6 1.2 Genes

They have the property to be able to bond by pair by forming hydrogen bonds. These bonds hold together the two strands of the DNA helix. The binding is made between the pairs A–T and G–C. To pass the flow of information from a gene to a protein, the DNA goes through several steps. The first step is the transcription into messenger ribonucleic acid (mRNA) which is a molecule very similar to the DNA. The main difference between DNA is that it uses uracil (U) instead of thymine. Moreover the RNA is mainly a single-stranded molecule. However, it can forms double stranded molecule by complementary base pairing as it is in transfer RNA (tRNA). The mRNA is then composed of the complementary bases of a DNA strand. It consists of four steps: 1. the pre-initiation and initiation are the events that allow the molecular machine called the RNA polymerase to bind with the DNA. This ma- chine produces primary transcript RNA. At first, the RNA polymerase does not bind directly but rather to proteins called the transcription factors. Those transcription factor bind to region of DNA, called pro- moter, during the pre-initiation step and facilitate the binding of the RNA polymerase. 2. The promoter clearance is the transition from initiation into transcript elongation. During this intermediate phase, the contact with initiation factors is lost and stable association with the nascent transcript is established. 3. The elongation is when the strand of the DNA is used as a template for the RNA synthesis. The RNA polymerase traverses the DNA and the RNA is assembled by using base pairing A–U and G–C. 4. The termination is the end of the transcription. This step is not yet well understood and therefore not described in this thesis. The mRNA serves as a template for protein biosynthesis in the translation process. It is formed by codons which are triplet of nucleotide. These codons form the template read by the ribosome, which is a large molecular machine that links amino acids together in the order specified by mRNA. The first codon is called the start codon and is very often AUG. It corresponds to a methionine when translated in amino-acid. The rest of the codon encode for one of the twenty-two amino acids. There are also stop codons that mark the end of the template. The mapping between a codon and an amino acid is the genetic code. This code is highly similar among all organisms. As an example, the complete genetic code of E. coli is provided in table 4. The translation process consists of four steps: 1. the initiation, the ribosome assembles around the target mRNA and the first tRNA is attached to the start codon. Roughly, the tRNA is a small strand of folded RNA that is linked to an amino acid. It is also composed of an anticodon which three bases that can form pairs with a codon in the mRNA. 2. The elongation, the tRNA transfers an amino acid corresponding to the next codon. 3. The translocation, the ribosome moves to the next mRNA codon. 4. The termination, the ribosome releases the polypeptide when a stop codon is reached. Some genes are constitutive, that is to say a gene which is continually transcribed. But there are genes that are facultative and are only transcribed when needed. Indeed the gene expression, the name given to the process

7 1 biological concepts

Table 4: The genetic code mapping a triplet of ribonucleotides to an amino acid. The methionine (Met) also act as the start codon (AUG). The Amber, Ocre and Opal codon are the stop codons. This table maps only to twenty amino acids, but it has been recently discovered that the UGA codon maps to a selenocysteine when the selenocysteine insertion sequence element is present during the transcription. The UAG codon is translated into pyrrolysine in a similar way [Rother and Krzycki, 2010, Papp et al., 2007, Zhang et al., 2005]. The genetic code UUU Phe UCU Ser UAU Tyr UGU Cys UUC Phe UCC Ser UAC Tyr UGC Cys UUA Leu UCA Ser UAA Ocre UGA Opal UUG Leu UCG Ser UAG Amber UGG Trp CUU Leu CCU Pro CAU His CGU Arg CUC Leu CCC Pro CAC His CGC Arg CUA Leu CCA Pro CAA Gln CGA Arg CUG Leu CCG Pro CAG Gln CGG Arg AUU Ile ACU Thr AAU Asn AGU Ser AUC Ile ACC Thr AAC Asn AGC Ser AUA Ile ACA Thr AAA Lys AGA Arg AUG Met ACG Thr AAG Lys AGG Arg GUU Val GCU Ala GAU Asp GGU Gly GUC Val GCC Ala GAC Asp GGC Gly GUA Val GCA Ala GAA Glu GGA Gly GUG Val GCG Ala GAG Glu GGG Gly

by which the information in a gene is used to build a functional product (usually a protein), is subject to regulation. The gene regulation is a complex process and is not yet fully understood. In this thesis we only describe roughly the process of transcriptional regulation, which is the way the cell regulates the copy of DNA into a RNA molecule. More precisely we focus on the regulation through transcription factors. Transcription factors are proteins that bind to specific DNA region to regulate the expression of a gene (e.g. promoter). The transcription factor can act as an activator or an inhibitor by recruiting or repressing the RNA poly- merase. Usually several transcription factors must bind to the DNA to recruit other factors and the RNA polymerase. Interestingly, only a small subset in the genome, that is to the complete set of DNA, may encode transcription factors (≈ 2000, i.e. 7% of the human set of proteins or proteome). But as they function in group, the combinatorial use of this subset could mean that each gene is uniquely regulated [Brivanlou and Darnell, 2002]. Also post-translational modifications (i.e. chemical modification of the protein) orchestrate all transcription factor functions, including subcellular localiza- tion, protein stability, protein-protein interactions (i.e. with cofactors) and transcriptional activities [Filtz et al., 2014, Tootle and Rebay, 2005]. This combinatorial use of transcription factor gives rise to complex inter- actions between these regulator. Such interactions can be modeled by a so called gene regulatory network and represent the interactions between the DNA regions that are regulated and other molecules (like proteins). These interactions can be direct (i.e. the binding of a transcription factors activating the transcription) but not only. Indeed, some transcription factors may acti- vate or inhibit the transcription of other transcription factor or a protein (an

8 1.2 Genes

β-D-fructofuranose-1-P Cra

PhoB

Crp-cAMP

Fis ppGpp DksA

IHF NsrR

Figure 5: Example of a small partial regulatory network. This diagram il- lustrate the crp regulation in E. coli and shows how the σ-factor is recruited (σ70). This factor is needed to initiate the transcription. Pointed arrows represent activation, diamond arrows represent activation or inhibition and blunt arrows are inhibition. This dia- gram shows that Cra, Crp and Fis have a direct regulating effect on the σ-factor. Cra activates the transcription initiation, Fis inhibits and Crp-cAMP activates or inhibits. The network also show other activators and inhibitors effects, e.g. β-D-fructofuranose-1-P inhibits Cra and Fis activates Crp. All these interactions form the regulatory network. enzyme) responsible of a post-translational modification having an effect on another transcription factor. These kind of interactions and many more make the complex regulatory network. A small example of network representing the crp regulation in E. coli is provided in figure 5. The reconstruction and study of these networks is an active subject of research in system biology [Lee et al., 2002, Teichmann and Babu, 2004, Hecker et al., 2009]. It is also interesting to note that a promoter is not always regulating one unique gene, but can control the expression of several genes, called an operon. An operon is a cluster of genes controlled by a single promoter. That is to say, when the transcription process starts, all genes in the operon are transcribed into mRNAs. These mRNA are then translated together (polycistronic mRNA) or are trans-spliced, thus producing monocistronic mRNAs which are translated independently. A mRNA is monocistronic when it contains the information of only a one polypeptide. A polycistronic mRNA encodes for several polypeptides 1. An operon is composed of four components. 1. A promoter (see previous description in the text). 2. An operator which is a DNA region between the promoter and the genes where a repressor binds and obstructs the RNA polymerase. 3. The genes that are co-regulated by the operon. Operons are found in prokaryotes and eukaryotes and expression of eu- karyotic operons usually lead to the transcription of monocistronic mRNAs (as opposed to prokaryotic operons, which leads to polycistronic mRNAs) [Blumenthal, 2004].

1. A more correct way to explain it, is to say that a polycistronic mRNA contains several open reading frames (ORFs). But the description of an ORF is not provided in this document and the reader can find good descriptions in [Lodish et al., 2000, Berg et al., 2002]

9 1 biological concepts

Table 5: The six major classes of enzymes and the reaction’s types. The table is partially taken from [Berg et al., 2002]. EC Class Type of reaction Example 1 Oxidoreductases Oxidation-reduction Lactate dehydrogenase 2 Transferases Group transfer N-acetyltransferase 3 Hydrolases Hydrolysis reactions Methionine aminopep- (transfer of functional tidase groups to water) 4 Lyases Addition or removal Fumarase of groups to form dou- ble bonds 5 Isomerases Isomerization (in- Triose phosphate iso- tramolecular group merase transfer) 6 Ligases Ligation of two sub- Aminoacyl-tRNA syn- strates at the expense thetase of ATP hydrolysis

One can cite several other regulation process, like histone rearrange- ment, action of transcriptional enhancers [Levine, 2010] and other post- transcriptional regulation strategies. The gene expression processes are very complex and may not be fully understood. We will not provide a description, even summarized, of the regulation strategies as it goes far beyond the scope of this thesis.

1.3 enzymes

Enzymes are biological molecule acting as the catalysts of biological sys- tems. The majority of enzymes are proteins 2 and increases the reaction rate to a factor of a million. Several impressive and extreme examples are found in Radzicka and Wolfenden[ 1995] where the orotidine 5’-phosphate decar- boxylase enhances the rate of reaction by a factor of 1017 (non-enzymatic half-life 3: 7.8 · 107 years) or the staphylococcal nuclease by a factor of 1014 (non-enzymatic half-life: 1.3 · 105 years). Hence without such catalysts most biological reactions will not function at a rate able to sustain life in the cell. They are also highly specific and usually catalyze a single or a closely set of chemical reactions. The enzymes are classified based on the chemical reactions (table 5) they catalyze and currently there are 4867 active enzyme classes [Schomburg et al., 2012]. Enzymes are usually much larger than their substrates and only few amino acids play a role in the catalysis. Those few amino acids compose the catalytic site, located next to binding sites where residues orient the substrates. The catalytic and binding site compose the enzyme’s active site. It should be noted that in some enzymes no amino acids are directly involved in catalysis. Rather cofactors bind to the enzyme and take part themselves to the catalytic reaction.

2. Also catalytic RNA molecules or ribozymes (ribonucleic acid enzymes) are capable of catalyzing specific biochemical reactions [Kruger et al., 1982]. 3. The half-life (denoted as t1/2) is a description of how fast a reaction is occurring. It is the time for half of the reactant initially present to decompose.

10 1.4 Post-translational modifications

a b c d

Figure 6: Illustration of the feedback inhibition in a chain of three enzymatic reactions (partially reproduced from Berg et al.[ 2002]). The chemi- cal “a” is transformed by enzymes to the final product “d”. Each arrow represents a reaction catalyzed by an enzyme. The dashed arrow represents the feedback inhibition of the first enzyme where the chemical “d” binds to the enzyme. The binding is reversible in order to allow the conversion from “a” into “d”.

Enzyme activity can be regulated by several strategies. For example they can be inhibited by their final product (feedback inhibition, figure 6) by a reversible allosteric interaction. Some enzymes change conformation when it interact with other molecules, thus modifying their activity. This mechanism is called allostery. For example an effector molecule binds to a site other than the active site. This binding often resulting in a conformational change of the protein, thus altering the catalytic activity of the enzyme. An effector molecule enhancing the enzyme’s activity is called an allosteric activators. The opposite, a molecule decreasing the activity is called an allosteric inhibitors. Many enzymes are synthesized in an inactive form (called a zymogen) and are activated by a digestive enzyme that cleaves them. The cleavage induces a conformational change that produce an active form of the enzyme. Such activation is called proteolytic activation. Covalent modification is another mechanism of regulation. It consists in the attachment (mostly reversible) of a chemical group that modifies the properties of the enzyme (i.e. by blocking the substrate binding to the active site). Let’s add that enzymes undergo the process of gene expression regulation, as they are proteins.

1.4 post-translational modifications

Post-translational modifications are covalent modifications occurring dur- ing protein biosynthesis. These modifications or processing change the properties of a protein by the attachment of the functional groups (e.g. Nα- terminal acetylation), changes of the chemical nature (e.g. deamidation), cleavage of one or more residues (e.g. initiator methionine cleavage), or structural changes. For example, it can determine: — the activation of a protein as it has been seen with the proteolytic activation of a zymogen in section 1.1. — The localization and turnover as this will be illustrated with the de- scription of Nα-terminal acetylation in section 1.4.2. — The structure, like the disulfide bonds between two cysteines that stabilizes the folded form of a protein by holding two portions of the protein together. The post-translational modifications broaden the diversity of functional groups of the 22 standard amino acids and produce diverse forms of proteins that cannot be directly derived only from its genes [Walsh, 2006, Schwartz et al., 2009]. Since the mature form of a protein cannot be inferred only by

11 1 biological concepts

Table 6: List of some common post-translational modifications (PTM) and their functions. This list is non-exhaustive and is partially taken from Mann and Jensen[ 2003]. PTM type Function Notes Phosphorylation Activation/inactivation of enzyme ac- Reversible tivity, modulation of molecular interac- tions, signaling Acetylation Protein stability, protection of N termi- nus, regulation of protein-DNA inter- actions Methylation Regulation of gene expression Acylation Cellular localization and targeting sig- nals, membrane tethering, mediator of protein-protein interactions Glycosylation Excreted proteins, cell-cell recogni- Reversible tion/signaling, regulatory functions Hydroxyproline Protein stability and protein-ligand in- teractions Sulfation Modulator of protein-protein and receptor-ligand interactions Disulfide bond Intra and intermolecular crosslink, pro- formation tein stability Deamidation Possible regulator of protein-ligand and protein-protein interactions Pyroglutamic Protein stability, blocked N terminus acid Ubiquitination Destruction signal Nitration of ty- Oxidative damage during inflamma- rosine tion genes, the knowledge of a protein’s post-translational modification helps to understand the roles, the possible interactions, or the activity of a protein. In this work we focused on two post-translational modifications: the initiator methionine cleavage and the Nα-terminal acetylation of eukaryotic proteins. Those post-translational modifications will be introduced in the next sections (see 1.4.1 and 1.4.2). Regarding the description of those post- translational modifications, the content of this introduction may not be sufficient to understand it. If the reader is interested in understanding the details, he is invited to read [Berg et al., 2002, chap. 3, 5 and 8] and [Lodish et al., 2000, chap. 3, 4, 6 and 18].

1.4.1 Initiator methionine cleavage

A ribonucleic acid molecule (RNA) is a chain of ribonucleotides joined by covalent bonds. Each nucleotide is composed of one of the following base: adenine, cytosine, guanine or uracil (respectively symbolized by A, C, G and U). During the translation step a mRNA molelcule is decoded by reading the nucleotide by triplet or codon. The translated messenger RNA

12 1.4 Post-translational modifications usually starts with an AUG codon which correspond to a methionine in all genetic code 4. Hence the first residue of newly synthesized protein is the methionine [Meinnel et al., 1993, Nakamoto, 2009]. However this methionine is usually removed in a process is called N-terminal methionine cleavage. For any for any given proteome , this occurs for between 50% and 70% of the proteins [Meinnel et al., 1993, Frottin et al., 2006]. The initiator methionine cleavage seems to be conserved in all organisms and the rules driving the cleavage seems to be similar in those organisms. The enzyme catalyzing the initiator methionine cleavage is the methionine aminopeptidase (MAP) and process the nascent protein during its biosynthe- sis as soon as the first residues are assembled by the ribosome [Arfin and Bradshaw, 1988]. The process is thus cotranslational and also irreversible (∼15 residues). Two classes of MAPs have been identified and both can be expressed in the same organism (i.e. in Eukaryota): MAP Ib and MAP IIb [Bradshaw et al., 1998]. Recently it has been suggested that the activity of initiator methionine cleavage is controlled instead of being considered as constitutive [Giglione et al., 2004]. While in Eubacteria both MAP genes essential, their essentiality in Eukaryota is less clear. In S. cerevisiae the disruption of both genes is lethal, suggesting that cytoplasmic initiator methionine cleavage is essential in lower Eukaryota. In higher Eukaryota the essentiality of the initiator methionine cleavage is unknown but data suggest that disruption of MAP2 genes causes abnormal development phenotype in the Drosophila and also that in malignant human cell MAP2 inhibition causes apoptosis. In addition the blocking of MAP2 5 activity with a specific inhibitor results in the interruption of the G1 phase . Recently data strongly suggest that initiator methionine cleavage is involved in controlling the protein half-life. However despite these discoveries, the role of initiator methionine cleavage is poorly understood [Giglione et al., 2004]. The proteomics of initiator methionine cleavage is well characterized and substrate specificity for MAP has been studied. The rule is that cleavage occurs if the side chain of the amino acid following the initiator methionine is small enough (Ala, Cys, Pro, Ser, Thr and Val). This rule suggest that the process is correlated with the length and the gyration radius of the residue’s side chain [Frottin et al., 2006]. A more detailed analysis on the proteomic of the initiator methionine cleavage is provided in chapter 8 to illustrate the efficiency of the approach presented in this thesis.

1.4.2 Nα–terminal acetylation

The first Nα-acetylated protein has been discovered in 1958 [Narita, 1958] and now it has been published that the process of acetylation is a very common modification and occurs on ≈ 50 % of the yeast proteins and ≈ 80 % of human proteins [Brown and Roberts, 1976, Arnesen et al., 2009, Polevoda et al., 2009]. This irreversible process is one of the most common covalent modification. The Nα-terminal acetylation is a cotranslational modification and occurs when between 25 and 70 residues of a nascent protein emerge from the ribosome [Strous et al., 1974, Pestana and Pitot, 1975]. The process is catalyzed by N-acetyltransferases (NAT) and consists in transferring the

4. Although non-AUG codons may be used to initiate the translation in Eukaryota and in Prokaryota. 5. The growth 1/gap 1 phase is the first of four phases of the cell cycle that takes place in eukaryotic cell division

13 1 biological concepts

O

+ NH3 NH

+ _ HATs & KATs + _ NH3 NH NH COO NH3 NH NH COO O O

NH2 NH2 OHO OHO O O N O O N P N P N S O O O S O O O N N P N N N N P N N H H O H H O O OH O OH OH O OH

HO HO O OH O OH P P HO HO O O

R O R _ _ + NH COO NH COO NH 3 NATs N O H O

Figure 7: The N-terminal acetyltransferases (NATs) use Ac-CoA to catalyze the acetylation of α-amino groups on N-terminal residues. This reaction produce a CoA and an N-terminal acetylated polypeptide. There is another case of peptide acetylation on the ε-amino group of internal lysine side chains. The acetylation is catalyzed by histone acetyltransferases (HATs) and internal lysine acetyltransferases (KATs). The acetyl functional groups are drawn in orange.

acetyl group from from an acetyl group from acetyl-coenzyme A (Ac-CoA) to the α-amino group of the first amino acid of the protein [Gautschi et al., 2003, Pestana and Pitot, 1975, Polevoda and Sherman, 2003, Polevoda et al., 2008, 2009]. NATs catalyze this transfer and the reaction releases a coenzyme A (CoA) and an N-terminal acetylated polypeptide (figure 7). Three NATs have been identified and are thought to be responsible for the majority of the N-terminal acetylation events, counting for ∼85% of the acetylated proteins [Arnesen et al., 2009]. Those NATs are named by NatA, NatB and NatC. The rest is believed to be catalyzed by three other NATs, named by NatD, NatE and NatF. Thus as for the time of the writing of this thesis, six NATs have been identified. However the NATs’ substrates are still not well known (or only partially known). The first three NATs (NatA, B, C) are well conserved from yeast to humans and are characterized based on their supposed substrate specificity Polevoda et al.[ 1999]. The NatA is the enzyme having the more supposed substrates (six) and the only one acetylating the non methionine residue at the start of the polypeptide (i.e. after the initiator methionine cleavage) [Polevoda and Sherman, 2003]. The NatB and C acetylate the methionine first residue and may be associated with the ribosome. The NatB seems to prefer polar substrate and the NatC hydrophobic substrates. The other three NATs (NatD, E, F) may not be spread among the organisms as the first three. For example, NatD activity was described in yeast but no such activity has been noticed in higher eukaryotes and seems to acetylate only N-termini of histones H2A and H4 (Ser-Gly-Gly and Ser-Gly-Arg) [Hole et al., 2011]. For the NatE, only in vitro activity has been described for the human but evidence of in vivo activity is lacking [Evjenth et al., 2009]. Regarding the NatF, it seems to be responsible of the increase in occurrence of Nα-acetylated proteins from lower to higher eukaryotes. Because NatF has only been found in higer Eukaryota, this presence could explain the higher observed rate of acetylation in those organisms. Indeed, NatF shares substrate with the previously described NAT [Van Damme et al., 2011].

14 1.4 Post-translational modifications

Table 7: Supposed substrate specificities of the six NATs. One may note that the NatF have overlapping substrates with NatC and NatE. The overlapping is total with NatC and partial with NatF (Met-Lys- and Met-Leu-). The acetylation site is always the first residue in the substrate. The number of residue describing the substrate and the different substrates by NAT are also provided. NAT Substrates Num. res. Diff. res. A Ser-, Ala-, Gly-, Thr-, Val-, Cys- 1 6 B Met-Glu-, Met-Asp-, Met-Asn-, Met-Gln- 2 4 C Met-Leu-, Met-Ile-, Met-Trp-, Met-Phe- 2 4 D Ser-Gly-Gly-, Ser-Gly-Arg- 3 2 E Met-Ala-, Met-Lys-, Met-Leu-, Met-Met- 2 4 F Met-Lys-, Met-Leu-, Met-Ile-, Met-Trp-, 2 5 Met-Phe-

The role of the Nα-terminal acetylation is unclear but some insight have been proposed on the physiological role of the Nα-terminal acetylation [Holle- beke et al., 2012]. Several researches have reported that naturally occurring Nα-acetylated proteins are not degraded by the ubiquitin system at a signifi- cant rate while their non-acetylated counterparts from other species are good substrates for ubiquitination. This suggests that one function of Nα-terminal acetylation is to prevent their degradation [Hershko et al., 1984, Arfin and Bradshaw, 1988, Hwang et al., 2010]. The human NatA was reported to be im- portant for cell survival and growth of various types of cancer. Depletion of the human NatA induces the p53-dependent apoptosis and p53-independent growth inhibition. may be induced by human NatA knockdown and could be exploited for use in combinatorial chemotherapy. This indicates that several mechanisms of growth inhibition and sensitivity of cells to apoptotic signals are connected with the Nα-terminal acetylation of proteins [Gromyko et al., 2010, Caroline et al., 2011]. Also the Nα-terminal acetylation of yeast ribosomal protein affects the protein synthesis. It was found that NatA and NatB deletion cause a defect in the 80S ribosome assembly. Thus that ribosomal protein Nα-terminal acetylation is necessary to maintain the ribo- some’s protein synthesis function [Kamita et al., 2011]. It also seems that the Nα-terminal acetylation is an early step in the cellular sorting of nascent polypeptides. A study have shown that when the N-terminal secretory signal sequence is acetylated, the secretion to the endoplasmic reticulum is inhib- ited and the protein stays in the cytosol [Forte et al., 2011]. The acetylation may interfere with the binding of the signal recognition particle (SRP) on the signal sequence. Indeed, both process occurs as the peptide emerges from the ribosome and may be in competition. The binding of SRP leads to the slowing of protein synthesis. Then SRP targets the ribosome and the nascent polypeptide complex to the translocon in the endoplasmic reticulum membrane.

Regarding the role and function of Nα-terminal acetylation good reviews are provided in [Hollebeke et al., 2012, Arnesen, 2011]. Furthermore the Ph.D thesis of Liszczak provide a very good review of the NATs and Nα-terminal acetylation in one unique document [Liszczak, 2013].

15 1 biological concepts

1.5 the metabolism

While there is no indisputable definition of life, all living organisms share the the following attributes: adaptation, growth, organization, energy transformation, reproduction, and metabolism. In this section we focus on last cited attribute, the metabolism. The metabolism is the set of chemical reactions taking place in a living organism to sustain life. Although some of those reactions take place outside the cell, like small molecules transport between cells or the breakdown of food (i.e. digestion), the majority happens in the cell and form the inter- mediate metabolism. Unless specified, the term metabolism is a substitute for intermediate metabolism. This set of chemical reactions is commonly separated into two main categories: catabolism and anabolism. It should be noted that this separation between the anabolism and catabolism is arti- ficial. The reactions of each category are interconnected through molecules processed or produced by these reactions. The catabolic reactions break down molecules into smaller molecule to release energy. In animals the set of catabolic reactions extracting energy from foodstuffs can be grouped into three main stage (chapter 14 of Berg et al.[ 2002] and figure 8). 1. Large organic molecules (i.e. proteins or lipids) are digested into smaller components outside cells. 2. The smaller molecule from the previous step are transported in the cell and converted to smaller molecules. Mainly sugars, fatty acids and several amino acids are converted into acetyl-CoA. This stage generate some energy (i.e. ATP). 3. The acetyl group on the CoA is oxidized to water and carbon dioxide in the Krebs cycle and electron transport chain. The stored energy is released by reducing the coenzyme nicotinamide adenine dinucleotide (NAD+ into NADH). The Anabolism is the set of reactions taking part in the biosynthesis of cellular components from precursors of low molecular weight. These reactions uses the energy released by the catabolism. Anabolism also involves three basic stages. 1. The production of precursors (i.e. monosaccharides, nucleotides and amino acids). 2. Their activation into reactive forms using energy from ATP. 3. The assembly of the precursors into complex molecules such as polysac- charides, nucleic acids and proteins). The vast majority of these reactions (anabolic and catabolic) are catalyzed by enzyme and linked together by the chemical compounds they transform. All these interaction form a so called metabolic network. For example a reaction convert the chemical “a” into a chemical “b” and another reactions convert the chemical “b” into a chemical “c”. Thus forming a chain. The figure 9 is a nice illustration of a non-organism specific metabolism or com- plete 6 metabolic network. Such a network will be more precisely described in chapter 2. Chained reactions can be grouped in metabolic pathway. In a pathway an initial or, input chemical, is modified by a sequence of enzymatic reactions

6. Complete means the network is built based on the best knowledge at the time of the writing of this thesis.

16 1.5 The metabolism

Fats Polysaccharides Proteins

Fatty acid Glucose and Amino acids and glycerol other sugars Stage 1

Stage 2 Acetyl-CoA

CoA

Krebs Cycle 2CO2

8e- O2 Oxidative phosphorylation

H O Stage 3 2 ATP Figure 8: Illustration of the three stages (taken from Berg et al.[ 2002]). Stage 1: digestion of large organic molecules. Stage 2: Transport in the cell and conversion into acetyl-CoA. Stage 3: Krebs cycle, electron transport chain, reduction of NAD+ and release of ATP.

17 1 biological concepts

Figure 9: Illustration fetched from the Kyoto encyclopedia of genes and genomes (KEGG) representing the complex network that is pro- duced by the metabolic reactions [Kanehisa and Goto, 2000]. A dot represents a chemical and an edge (or link) between dots indicates that a chemical can be converted to the linked chemical. The colors represent a classification of the reactions regarding their purpose (i.e. amino acid biosynthesis or degradation). This separation is not necessarily physical (i.e. different organelles) but rather logical or historical. This figure does not represent a specific organism metabolism, but is an artificial aggregation of the most common reactions and their main reactants.

18 1.5 The metabolism

Enzyme 1 Enzyme 2 Enzyme 3 a b c d

(a)

Enzyme 1 Enzyme 4 a b c d e Enzyme 3 Enzyme 5

Enzyme 2 f g h

(b)

Figure 10: A simple linear pathway formed by three chained reactions (a). This pathway converts the chemical “a” into the chemical “d”. A more complicated pathway with several branches and input/output chemicals (b). This pathway converts the chemicals “a” and “g” into the chemical “e” along with the production of “f”. into one or several products, or outputs through several intermediates. In such a sequence, the product of one enzyme is the substrate for the next enzyme in the sequence. A pathway is not necessarily a simple chain of reaction, but can be composed of several inputs, several outputs and several branches. The figure 10 illustrate a simple and more complex pathway. The figure 10 illustrates the pathways flowing in one direction. This is often the case in the cell as the conditions are usually thermodynamically favorable to one given direction, even when reversible. This means that in some cases, the synthesis and the degradation of components do not follow the same pathway (i.e. one unique pathway, but reversing the flow). For example, the pyruvate is produced at the end of the glycolysis and pathways beginning with either pyruvate cause the synthesis of several amino acids, like the valine and the leucine. However to be degraded, the valine and leucine undergo transamination, oxidative decarboxylation and dehydrogenation to produce acetoacetyl-CoA or acetyl-CoA and enter the Krebs cycle. Hence the degradation is not the reverse of the synthesis of those amino acids. However this is not always the case. A famous example is the glycolysis where the glucose is transformed into pyruvate. The glycogenesis, which is the production of glycolysis from pyruvate, is the reversed glycogenesis. In brief, the metabolic pathways can be viewed as the set of paths in the metabolism that transform a set of given input chemicals into a set of other chemicals. The path describe which intermediary chemicals (or metabolites) are needed to achieve the transformations carried by the pathways.

19

Part II

METABOLICNETWORKANALYSIS

2 METABOLICNETWORKMODELS

As soon as we start studying talking about in silico metabolic networks there is the need to have a mathematical representation of this object. This chapter introduces two models for the metabolism: the stoichiometric model and the graph model. The purpose is to provide the necessary knowledge to understand how the metabolism and its pathways can be mathematically represented.

2.1 stoichiometric modeling

The metabolism can be seen as a large system of interconnected system of chemical equations. A chemical equation describes the relative quantities of reactants and products in a chemical reaction. Those relative quantities are defined as stoichiometry. For example in the duodenum, the hydrochloric acid or gastric acid (HCl) is neutralized by sodium bicarbonate (NaHCO3). This is described with the following equation:

HCl + NaHCO3 −−→ NaCl + H2CO3 (1)

Hence the acid is neutralized (balanced) with carbonic acid and sodium chloride (i.e. salt). In equation 1 the stoichiometry is not shown but the quantities are unitary. Chemical equations follow the law of mass conserva- tion: the total mass of the reactants equals the total mass of the products. Thus in equation 1 the mass of the chemical is conserved: on the right- and left-hand side of the equation there are two H, one Na, one Cl and one CO3. A example with of a non-trivial stoichiometry is the propane combustion (C3H8): C3H8 + 5 O2 −−→ 4 H2O + 3 CO2 (2) where a molecule of propane and five molecules of dioxygen are burnt into four molecules of water and three of carbon dioxide. Also in this equation the total mass is conserved. The 10 molecules of dioxygen (10 = 5 · 2) form water (4) and carbon dioxide (6 = 3 · 2) after the combustion. Those equations can be arranged in a system composed of several equations. For example, the carbonic acid in duodenum is transformed further in carbon dioxide and water, thus producing the system:

HCl + NaHCO −−→ NaCl + H CO (Neutralization) 3 2 3 (3) H2CO3 −−→ H2O + CO2 (Anhydration) Again, when not specified, the stoichiometric coefficients are equal to one. This system describes that the hydrochloric acid and sodium bicarbonate are balanced with salt, water and carbon dioxide. Such a system can be

23 2 metabolic network models rearranged in a matrix of size 6 × 2 with the rows corresponding to the chemicals and the columns corresponding to the reactions:

Neutr. Anhydr. HCl  −1 0  NaHCO3  −1 0    NaCl  1 0   .(4) H2CO3  1 −1    H2O  0 1 

CO2 0 1

When a chemical is consumed by the reaction, the stoichiometric coefficients are negative and when it is produced, the coefficients are positive. A zero coefficient corresponds to a chemical that is not part of the reaction. Thus such matrix can be used to represent the metabolism. Each column of the matrix corresponds to a reaction that occurs in the cell and each row to a chemical transformed by those reactions in the cell. Usually those matrices are sparse and there is more columns than rows. The sparsity is a consequence that the reactions are not transforming all chemicals and the non-square form of the matrix happens because several reactions share common chemical (i.e. water or carbon dioxide).

The stoichiometric matrix is also a fundamental representation for dy- namic modeling of the metabolic process. Lets consider the simple chemical transformation:

−−→ x (in) vx v1 vy −−−−→ x −−−→ y −−−−→ or x −−→ y (r1) (5) y −−→ (out)

We first see that there are reactions that only have a chemical on one side of the equation, e.g. in and out in equation 5. Those are in and out reactions that act as input and/or output of the system. In this trivial system the chemical x is inputted at rate vx and is transformed at rate v1 by reaction r1 into chemical y. The chemical y is exits out of of the system at rate vy. The input and output rates are called the exchange fluxes and the internal rates (like v1) are called internal fluxes. The mass balance of x and y can be written with the following system of differential equations:

dx = vx(t) − v1(t) dt (6) dy = v (t) − v (t) dt 1 y

−1 −1 −1 where dx/dt can be expressed in mol · L and the rates vi in mol · L s . This system describe that the change of concentration of the chemical x is equal to the rate of production of x minus the rate of destruction x in the system. Those rates are respectively described by vx(t) and v1(t). The same applies for the chemical y. This system can be rewritten in a matrix form:   dx     v1(t)  dt  −1 1 0   = vx(t) .(7)  dy  1 0 −1   v (t) dt y

24 2.1 Stoichiometric modeling

In its general form, the left hand side of equation 7 is is a vector in Rm representing the m chemicals or metabolites concentrations, the right hand side is the product of the stoichiometric matrix in Rm×n and the vector of reaction rates for the n reactions in the chemical system. Thus the system can be expressed with the stoichiometric matrix S:

dc(t) = S · v(t) (8) dt where c(t) is the vector of metabolites concentrations and v(t) is the vector of reaction rates. When the cell functions under a specific genetic and biochemical back- ground and no event changing this program occurs (like a stress response or cell division), we can suppose that the cell will neither accumulate nor deplete chemicals. This is because compared to regulatory events in the cell, the metabolism involves fast reactions. This difference in time scale allow to assume the quasi-steady state and when applied to equation 8 the problem becomes: 0 = S · r (9) and is called the metabolite balancing equation. In real biological networks the number of reactions m is much larger that the number of metabolites n, hence the system is underdetermined and there is an infinite number of solutions for r satisfying equation 9. Obviously the trivial solution r = 0 has no interest as there are no net flows of matter. All solutions of equation 9 are part of a vector space called the null-space or kernel of S:

null(S) = ker(S) = {x ∈ Rn : Sx = 0}.(10)

The rank-nullity theorem states that the dimension of the null-space (or nullity) is: nullity(S) = n − rank(S) (11) the rank(S) being the size of the largest collection of linearly independent columns in S. So the kernel K is a n × k matrix whose columns form a basis for the null-space and where k = n − rank(S). Hence any flux vector r can be expressed as a linear combination of the column vector of K:

k k r = ∑ Kiai, a ∈ R .(12) i where Ki is the ith column of K. In addition to the quasi steady state (0 = S · r), thermodynamic condition constraints the fluxes rate of internal reactions to non-negative values:

0 ≤ ri.(13)

It should be noted that every reversible reaction can be split into two irre- versible reaction, one being the opposite of the other. For example:

r1 : 2 a ←−→ 1 b is split into: 0 r1 : 2 a −−→ 1 b 00 r1 : 1 b −−→ 2 a

25 2 metabolic network models

Figure 11: Example of high dimensional cone in the fluxes-space for three reactions r1, r2 and r3. The pi are called the extreme rays and are sufficient to define the shape of the cone.

0 00 where r1 (forward) and r1 (backward) are the irreversible counterparts of the reversible reaction r1. Hence the constraint in equation 13 is always valid. For exchange fluxes, the constraints depend on whether it is a source, a sink (i.e. production or consumption) or both for a given metabolite. These constraints can be described as:

αj ≤ rj ≤ βj.(14)

If it is a source αj is set to negative infinity and βj to zero:

−∞ ≤ rj ≤ 0, (15) if it is a sink αj is set to zero and βj to positive infinity:

0 ≤ rj ≤ ∞,(16) and if it is both, it is left unconstrained:

−∞ ≤ rj ≤ ∞.(17)

With those constraints the set of rates or fluxes is defined by the following convex polyhedral cone:

R = {r ∈ Rn : 0 = S · r, 0 ≤ Br} (18) corresponding to the intersection of the null space of S and the reactions positive closed half-space. This set describes the flux distribution under which a system can operate and thus represents the capabilities of a network. An illustration of such set of rates (i.e. the polyhedral cone) in R3 is shown in figure 11.

26 2.1 Stoichiometric modeling

2.1.1 Metabolic pathways

The flux distributions in equation 18 allow to describe any steady-state that a given biological system can operate. However to analyze a metabolic network we are not interested in the entire cone, but rather in its minimal functional pathways. A pathway may be described as a route of connected reactions in the network from particular starting substrates to given products. More formally it can be represented by a vector p in Rn where positive entries correspond to the utilized reactions in the pathway, the other entries being zero, and lies in the flux cone (see the pi in figure 11). Such a pathway is said to be minimal if it is non-decomposable, that is a minimal pathway cannot be formed by a linear combination of other pathways. If the vector p satisfies the following conditions: 1. the quasi steady state: S · p = 0;

2. all rates are in the forward direction: pi ≥ 0; k 3. there exist no vectors vi such as S · vi = 0 and p = ∑i αivi, the vector is called an elementary flux mode. The last condition is called the non-decomposability or systemic independence. Moreover if p satisfies also the following conditions: — all reversible reactions are split into two irreversible reactions, thus augmenting the null-space dimension; — it is impossible to write p as a positive linear combination of other pathways, p is called an extreme pathway. Those extreme pathways can be arranged in a matrix P whose ith column is the ith extreme pathway pi. Both concepts are very close to each other Papin et al.[ 2004b], Klamt and Stelling[ 2003]. The extreme pathways can be categorized into three categories: — type I extreme pathway represent the primary metabolic pathways and describe the conversion from input exchange fluxes to output exchange fluxes through intermediary products. — Type II extreme pathway is similar to type I but only describe the conversion of the so called currency metabolite. Currency metabolites (like H2O, ATP or NADH) form an artificial set of metabolites that are present in so much reactions that they are considered to be part of an infinite pool on which the cell has access. This is why those pathways have no interest to be analyzed because we do not want to study exchanges in the pool. — Type III extreme pathway represents reversible cycles that are mainly due to reversible reactions. These pathways show no activity with exchange fluxes. Thus we are mainly interested in studying the capabilities of the system through its type I extreme pathways. Therefore type II and type III are discarded from the matrix P. Thus the pi are pathways in the system that represent routes in the metabolic network. For example the system in t figure 12 (a) is composed of two extreme pathways p1 = (1, 0, 1, −1, 0) t and p2 = (0, 1, 0, 1, −1) assuming that the flux order is (v1, v2, b1, b2, b3). (figure 12 (b)). Those two extreme pathways can be linearly combined to produce any other more complex steady-state route (figure 12 (c)). The combination of extreme pathways is a combination of the extreme rays of the flux cone that lies into the the flux cone, like the one illustrated in figure 11 (the cone in figure 11 is not the cone generated by the extreme pathways of the system in figure 12).

27

2 metabolic network models =

+

Figure 12: Example of a simple system (a) and its two extreme pathways p1 and p2 (b). An example of linear combination of the extreme pathways q = p1 + p2 (c). The extreme pathway q stays in the flux cone such as the one described in figure 11

28 2.1 Stoichiometric modeling

Computing extreme pathways is a problem of finding extreme ray of a polyhedral cone given its constraints representation. Extreme ray enumera- tion is an algorithm for finding the extreme rays of a polyhedral cone given its constraints representation:

R = {r ∈ Rn : 0 = S · r, 0 ≤ B r} (19) and converts it into the form:

R = {r ∈ Rn : r = P · c, 0 ≤ c}.(20) where the columns of the generating matrix P form the extreme rays. The double description method is such an algorithm [Fukuda and Prodon, 1996]. The two most relevant variants of the double description methods [Jevre- movic et al., 2008, Terzer, 2009] are the canonical basis approach [Schuster et al., 2002a] and the nullspace approach [Urbanczik and Wagner, 2005b, Gag- neur and Klamt, 2004, Wagner, 2004]. Unfortunately the complexity of the double description method is poorly understood and the time complexity of extreme ray enumeration is still an open question in computational geometry [Terzer, 2009]. Indeed, the number of intermediary rays during the computa- tion can grow exponentially compared to the final number of extreme rays. Moreover the algorithm is known to be very sensitive to constraint ordering [Fukuda and Prodon, 1996, Terzer and Stelling, 2008]. However even with the improvement that have been made in this filed (also in Terzer and Stelling [2008], Lee et al.[ 2004]), the computation of a genome scale network is still intractable and the number of the pathways set growth exponentially with the network size [Terzer et al., 2009]. Moreover the number of extreme pathways in a metabolic network is only yet estimated [Yeung et al., 2007]. In their paper the authors proposed an estimation of the extreme pathways number which fall within a factor 10 of the true number. They found that the number of extreme pathways has a relationship with the degree and the clustering coefficient of active reactions in the network. To assess the quality of their measurement they were obviously limited by the available of metabolic networks whose extreme pathways were tractable. Nevertheless their estimations for large networks seem to corroborate previous estimations for E. coli and H. sapiens networks.

Metabolic pathways are often used to simplify the description and analysis of metabolic networks. Indeed, usually the metabolism is not studied or taught as a whole, but rather as separated metabolic pathways, such as the glycolysis, the Krebs cycle or the urea cycle. Even if pathways computation on large metabolic network is not yet achievable, extreme pathways analysis have been successfully applied to simpler networks to assess the potential of the pathway analysis. For example pathways computation of H. influenzae [Papin et al., 2002] and H. pylori [Price et al., 2002] were conducted and allow the study of pathway redundancy, number of internal state in amino acid production. It has also allow the discovery of system-wide properties in those organisms and thus providing a better understand the way they function. Also singular value decomposition has been applied to the extreme pathways matrix of the human red blood cell [Price et al., 2003]. This has allow to define eigenpathways that correspond to metabolic split regulated in the cell.

29 2 metabolic network models

Figure 13: Example of a directed (a) and undirected graph (b). Metabolic networks are usually represented as directed graphs.

2.2 graph modeling

Another convenient tool to represent a metabolic network is the graph. We quickly recall that a graph G is a mathematical object composed of two sets V and E. Some pair of object in V are connected by a link and those pairs compose the set E. The object in V is called a vertex and a pair in E is called an edge. The edges can be directed, which means that the order in the pair is important. In a directed edge (u, v), this means that the vertex u is linked to the vertex v, but not that v is linked to u. In an undirected edge we say that u and v are linked together. The graph is a very convenient object to represent complex system graphically or by a matrix, called the adjacency matrix. Figure 13 represents two graphs, one directed and one undirected with their adjacency matrix. A metabolic network can be represented with a graph in the following manner: the vertices are the reactions and the chemicals. An edge only links a reaction to a chemical or a chemical to a reaction. Thus the graph is directed. If a reaction r transform the chemical ci to the chemical cj, there is an edge from ci to r and an edge from r to cj. Such a graph is called a bipartite graph, because the vertices set V can be divided into two subsets R and C such as V = R ∪ C and R ∩ C = ∅, with R the subset of reactions and C the subset of chemicals and an edge is always linking an element in C to an element in R, or the opposite. As the graph is directed we speak of bidigraph (bipartite directed graph). For example, let the following stoichiometric matrix S:

r1 r2 r3 a  −2 0 0  b  1 −2 0    S = c  3 −1 0  (21)   d  0 1 −1  e 0 0 2 it represents a chemical system that can be modeled by a graph. The set of vertices is the union of the set of chemicals C = {a, b, c, d, e} and the set of reactions R = {r1, r2, r3}. The set of edges are pairs composed of a chemical and a reaction: E = {(a, r1), (r1, b), (r1, c), (b, r2), (c, r2), (r2, d), (d, r3), (r3, e)}. The graph corresponding to the matrix 21 is illustrated in figure 14 along with its adjacency matrix.

30 2.3 Metabolic network reconstruction

Figure 14: The directed bipartite graph corresponding to the stoichiometric matrix equation 21 in its adjacency matrix and graphical represen- tation.

Also it should be noted that sometime the unipartite versions of the bipar- tite network are used for analysis. The unipartite version of a bipartite graph, is a graph whose only one vertices subset is used (e.g. only the compounds in a metabolic network), and those vertices are linked by an edge if they share a common neighbor in the bipartite graph. However this may produces absurd networks as it could show inaccurate relation between the objects as illustrated in figure 15. Indeed, in this figure, the compound network states that the chemical b can be converted to the chemical c. However this is inaccurate because to produce the chemical d, the chemical b and d are needed. Such graphs have been studied either to characterize system-wide proper- ties either to try to discover important component in the network [Barabasi and Oltvai, 2004]. The analysis includes the study of the graph properties (i.e. scale-free structure) [Jeong et al., 2000, Strogatz, 2001], graph measures distribution (e.g. degree or centrality), definition and characterization of metabolite (e.g. hubs, peripheral or target vertices) [Guimera and Amaral, 2005, Guimerà et al., 2007] to graph clustering [Schuster et al., 2002b, Gag- neur et al., 2003, Ravasz et al., 2002]. However several studies on topological analysis of metabolic network should be taken with great care. Indeed, their review Lima-Mendez and van Helden have pointed flaws in some publica- tions. They question the fact that metabolic networks follow a power law distribution, are scale-free [Khanin and Wit, 2006], are small world (also in Arita[ 2004] for E. coli), are tolerant to random attack and are vulnera- ble to targeted attack. Also the preferential attachment model in network growth is questioned [Lima-Mendez and van Helden, 2009]. Even if those results produce interesting results that allow to a better understanding of the organization of such network, those graph analytical tools lack relevance regarding the analysis of pathways [Terzer and Stelling, 2008]. Indeed, paths in a substrate level network does not always correspond to pathways [Arita, 2004].

2.3 metabolic network reconstruction

A metabolic network can be seen as the set of biochemical reactions that may occur in a given organism at some time. Hence to reconstruct a metabolic network it is needed to regroup the information about all present biochemical reactions. There are several ways to obtain that information, and they are summarized in the following list (taken from Bernhard[ 2006]): 1. The biochemists can isolate an enzyme from an organism and demon- strate its function. Such information is not available for all organism. For example E. coli has been extensively studied but this is not the case for all organisms.

31 2 metabolic network models

Bipartite network A

Reactions network A Compounds network

Bipartite network B

Reactions network B Figure 15: Transformation of a bipartite metabolic network (bipartite network A) into unipartite reactions (reactions network A) and compounds networks. This lead to an inaccurate picture of the metabolism. For example in the compound network the chemical b can be converted into chemical c. However this is inaccurate because to produce the chemical d, the chemical b and c are needed. In the reaction network the reaction r1 is linked to the reaction r2. Again this is an inaccurate picture because r2 can not produce its chemical (i.e. d) without the product of r4 (i.e. b). Moreover the unipartite representation is ambiguous as the bipartite network B produce the same compounds network as the bipartite network A. However it produces a different reactions network (reactions network B).

32 2.3 Metabolic network reconstruction

2. The function of open reading frames can be assigned based on DNA or protein sequence homology. This is also a strong evidence that reaction is present in an organism. 3. The are also physiological evidences that the cell is able to produce some chemical components in order to achieve an observed function. Hence some unobserved reaction must be present to fulfill the function. The process is called gap analysis. This list is ordered from the strongest to the weakest evidence level. Several databases are available to fetch data that allow the reconstruction of metabolic network. The most complete at the beginning of this thesis was the Kyoto Encyclopedia of Genes and Genomes (KEGG) [Kanehisa and Goto, 2000]. It is a knowledge base of gene functions for all the completely sequenced genomes and some partial genomes. There is also functional information is stored as pathways which regroup cellular processes such as metabolism or signal transduction. This for a variety of organisms. The BRaunschweig ENzyme DAtabase (BRENDA) is database of enzymes and metabolic information based on primary literature [Schomburg et al., 2004]. EcoCyc which is a specific organism database built on the genome sequence of E. coli K-12 MG1655. The functions of individual gene products is based on information found in the experimental literature [Keseler et al., 2013]. We can also cite reactome which is curated and peer reviewed pathway database that provides intuitive tools for visualization, interpretation and analysis of pathway knowledge [Croft et al., 2014, Milacic et al., 2012]. Lastly there is also the knowledge base of Biochemically, Genetically and Genomically struc- tured genome-scale metabolic network reconstructions (BiGG 1). It integrates several published genome-scale metabolic networks into one resource with standard nomenclature which allows components to be compared across different organisms [Schellenberger et al., 2010]. This list is obviously not ex- haustive and there exist a dozen of database available for pathways, enzymes, proteins and other specific organism database. However several databases link each other and the information is shared between databases. A more complete list of these database is provided in the chapter 3 of Bernhard [2006]. For this work we mainly rely on KEGG. At the beginning of this study, KEGG was freely available and it was possible to fetch the complete database (in several files in XML format). It was very convenient because it allow us to easily fetch new version of the database to get the most recent network for any one of the organisms present in the database. So we wrote a tool that parses the database and creates a graph representing one of the KEGG metabolic networks. The process was the following one. First the chemical compounds, reactions, enzymes and genes files, and a list of pathways files were fetched. A reaction was described by the transformed chemicals (their id) and the stoichiometry. From this chemical id it was possible the fetch the details of the compound (e.g. formula, name, mass). The enzyme EC number that catalyzes the reaction was also specified, so it allow to found it in the enzymes file. This in order to get the list of genes id whose products are enzymes. Then in the genes file, it was possible to get information on the gene with the gene id, like its name and synonyms. With all these, a complete description of the reaction was available. However it should be noted that the genes were not provided as a rule. For example, let’s assume that an enzyme require two different subunits to function and that those subunits are encoded by two genes: g1 and g2. In KEGG it is not specified 1. At the time of the writing of this manuscript, the BiGG 2 was a beta release.

33 2 metabolic network models

Pathway Reaction ID Reversibility

Reaction Enzyme Compound Reaction ID Enzyme EC Compound ID Compound ID Gene ID Stoichiometry Formula Enzyme EC

Gene Gene ID Name Synonymes

Figure 16: A schematic view of the XML files and how they are linked. A pointed arrow indicates in which file the information is fetched. The number indicates the cardinality. For example, an enzyme in the enzyme file can contains n genes but each gene id correspond to one gene in the genes file. This representation is somewhat similar to a relational database, were the XML files plays the role of table. However this figure does not represent a formal table relationship diagram. that the enzyme require the two genes to function (i.e. like a logical and). It is equivalently stored as an enzyme that is encoded into two different genes and each gene product catalyzes the reaction (i.e. a like logical or). Finally a pathway is a set of reactions and the reversibility information of the reactions is stored in the pathway. The set of all pathways of an organism composes the metabolic network of the organism. The figure 16 provide a schematic view of the databases and how they are linked 2. This is valid for any organism in KEGG. Also it should be noted that the compounds, enzymes and reactions XML files are not organism specific. This is only the case for the pathways and genes files.

2. For the reader familiar with relational database, the figure does not represent a formal table relationship diagram.

34 HIERARCHICALCOMPUTATIONOFEXTREME 3 PATHWAYS

Enumerating the extreme rays of the flux cone as defined by the extreme pathways problem is a hard task. Currently the best approaches are based on the double description problem [Motzkin et al., 1953, Fukuda and Prodon, 1996] and for now there exists no better approach to solve this problem although improvements have been made for elementary mode computation [Wagner, 2004, Urbanczik and Wagner, 2005b, Terzer and Stelling, 2006, 2008] which is a similar concept. In this chapter we describe a methodology to compute extreme pathways but without tackling the rays enumeration problem by itself. The idea is to apply a preprocessing of the network in order to simplify the problem of enumeration. This is done by building a metabolic network with less reactions and metabolites having equivalent extreme pathways to the unprocessed metabolic networks. The goal is to remove as much as possible data to to improve the computation time of the extreme pathway. In this chapter a preprocessing is proposed and is called network packing. The extreme pathway computation of a packed network and the reconstruction of the solution of the unprocessed network is called extreme pathways unpacking. The preprocessing of metabolic network in order to simplify the problem of the computation of generating sets is no new thing. Indeed computation of a genomic scale network generating sets (extreme pathway or elementary modes) is not feasible in a reasonable time. Hence strategies to reduce computation time have been developed to allow application of pathways computation for large networks. Several approaches are used to ease the computation: identification of reactions having null rates before computation and division of the network into independent subnetworks or modules. The identification of reactions having null rate before computation is already well documented in the case of elementary modes [Gagneur and Klamt, 2004] and applicable to extreme pathways. This is briefly explained in the beginning of the chapter. The case of the division of the network have been tackled several times. In the case of extreme pathways computation this has been introduced and applied in [Schilling et al., 2000, Price et al., 2002] and has been studied several times when applied to the elementary modes [Schuster et al., 2002b, Dandekar et al., 2003, Wagner and Urbanczik, 2005, Urbanczik and Wagner, 2005a] which is a closely related concept. Another similar concept is the elementary fluxes patterns but it only provides which reactions in a subnetwork are part of an elementary mode in the whole network without providing the rates (or a multiple of the rates) Kaleta et al. [2009]. However as stated by Kaleta et al. in many cases of elementary mode study this information is sufficient Nuno et al.[ 1997], Gagneur and Klamt [2004]. Another approach is a method to classify the metabolite as external with the objective of minimizing the number of elementary modes Dandekar et al.[ 2003]. Two main approaches derive from the usage subnetwork. The first one is to found a way to compute the whole elementary modes with the subnetworks and the second one is focused on the analysis of the subnetworks (or modules) in the context of the whole system (that is to say studying pathways that are also part of the whole system).

35 3 hierarchical computation of extreme pathways

However it seems that a hierarchical computing of the extreme pathway has not been studied and described in details (implementation, formalization and experimentation). Hence the goal of this chapter is to fulfill this gap. This chapter starts with an overview of the approach followed with a quick review of network simplification. Then meta-reactions are introduced which is the main concept used to pack and unpack a metabolic network. Next a method to compute the extreme pathway in a hierarchical manner is detailed along with a mathematical description of the algorithm in term of matrix operation. Finally the chapter ends by a discussion on the advantage and the drawbacks of the algorithm.

3.1 overview of the approach

Our approach combines the following ideas: (i) splitting the metabolic network into subnetworks, (ii) solving each subnetwork as independent systems and (iii) combining the independent solution in order to recover the solution of the original (i.e. before the splitting) metabolic network. As stated in the introduction of this chapter, those have already been tackled or suggested as a potential improvement in extreme pathway or elementary modes computing. When a subnetwork is considered as an independent system, it can be represented as black box transforming its exchange metabolites. Those ex- change metabolites (the inputs and outputs) are defined by the subsystem extreme pathway. The idea is to do a total replacement of a subnetwork by the minimum number of reactions transforming the inputs into its outputs. This produces a new metabolic network with some of its components ab- stracted into black boxes. This new metabolic network can also be subject to subnetworks replacement, therefore adding a new level of abstraction of its subnetworks. Especially if a reaction is already a black box of a given subnetwork is part of the subnetwork. This newly produced network can be seen as a packed (or compressed) version of the original network. Thus producing a hierarchical metabolic network. Then the extreme pathway of the new network are computed and with this packed solution it is possible to compute the extreme pathway of the original network (the one without pack- ing) by combining the solutions of each level of the subsystems. Therefore the approach is named hierarchical extreme pathway computation.

3.2 simplifying metabolic networks

Before the network packing operation, the network is simplified by pruning reactions that cannot be part of an extreme pathway. This mean that in the extreme pathways matrix, the column representing such a reaction will always be the zero vector. Some of those reactions are easily identifiable by graph topology analy- sis and other need the use of the kernel matrix of the stoichiometry matrix (ker(S) = K). Regarding the usage of the kernel matrix, reactions correspond- ing to a zero vector in K can be removed from the metabolic network. Those reactions will never be part of any ray in the flux cone. The analysis of K does not allow to identify all the useless reactions. Hopefully those reactions can be detected by a simple graph analysis of network (or its stoichiometry matrix). With graph analysis, the most easily identifiable reactions to prune are those producing a chemical that is never consumed by any reaction (in

36 3.3 Meta-reaction

(a)A source metabolite. (b)A sink metabolite.

(c) A useless reverse reaction. (d) The loop dead-end.

Figure 17: Four cases of reactions which will never be part of an extreme pathway. The illustration of a source and a sink metabolite are showed in figure 17a and figure 17b respectively (the metabolite a). In figure 17c the reverse reaction R3 is bound to a type III extreme pathways as its production provide a chemical (i.e. a) that is only transformed by reactions that produce its substrate (i.e. c). In figure 17d there is a loop dead-end as the chemicals a and b are only consumed and produced by a reaction and its reverse. However this last case is a combination a useless reverse and sinks (or accumulation) for the chemicals a and b. graph theory such a vertex is named a sink vertex, figure 17b) and those using a substrate that is never produced by any reaction (a source vertex, figure 17a). In this case those chemicals are respectively only spontaneously created or accumulated by the system. This is clearly incompatible any steady state hence those reactions will never be part of any generating set. So the reactions producing or consuming those metabolite will certainly have null rates in the extreme pathway matrix. Useless reverse reactions can also be present in a network, that is to say a reverse reaction that only generates a type III extreme pathway. Reverse is used here in a more general meaning than a reversible reactions. Indeed it encompass reaction creating a loop with more than one irreversible reactions (figure 17c). In this case the reaction is also removed from the network. Finally there is the case of a reversible reaction consuming and producing (with its reverse) sinks chemicals (or source chemicals). However this case is a combination of a useless reverse and sinks (or accumulation) (figure 17d). This case can be identified by iteratively applying the pruning. First one of the useless reverse is detected then removed leaving a sink or a source. Then those metabolites are detected by a second application of the pruning. Ideally pruning operations are applied until the network undergoes no more removal of its components (reactions or chemicals). This short description of useless reactions detection is almost in par with the definition of redundancies as described in Gagneur and Klamt[ 2004]. The difference is the case of the uniquely produced or consumed chemicals. This case has not been implemented because we rather use meta-reactions (section 3.3 of the current chapter) in the packing operation (section 3.5 of the current chapter). To illustrate the efficiency of the pruning it has been applied to the reconstructed networks of H. sapiens and E. coli from the KEGG database. The reduction obtained are summarized in the table 8.

3.3 meta-reaction

An extreme pathway is a vector indicating at which rate each chemical reaction in the pathway converts the substrates into products with steady

37 3 hierarchical computation of extreme pathways

Table 8: The result obtained by applying the simplification algorithm to reconstructed networks of H. sapiens and E. coli. The reconstructed row indicates that the network is reconstructed from all the data available in the database (KEGG). The pruned type indicates that the simplification algorithm has been applied to the reconstructed network. Organism Type Reactions Fluxes Compounds Edges Reconstructed 1428 18 1079 5948 H. sapiens Pruned 503 16 254 2074 Reconstructed 2056 18 1541 8992 E. coli Pruned 513 16 275 2121

concentration of the chemicals during time. An extreme pathway can also be viewed as a single chemical process whose substrates and products are the compound transported by the non null exchange fluxes of the system. The substrates being the negative rates and the products being the positive rates. The non null internal rates are the reactions that are encapsulated or embedded in this new chemical reaction, hence the name of meta-reaction. Such abstraction is common in (bio)chemistry and a good example is the process of a nucleophilic addition on a ketone to produce a tertiary alcohol:

MgBr O acid hydrolysis R MgBr O HO R R

(22)

The previous equation shows a keton (R’C(=O)R”) and a Grignard reagent (R – Mg – X, X being an halogen like Br) producing a tertiary alcohol (RR’R"COH). This reaction is known as the Grignard reaction. It is usually seen as a single process in which alkyl, vinyl, or aryl-magnesium halides add in to a carbonyl group an aldehyde or ketone. However in the previous case, a study of the mechanism shows that there is a six-membered ring transition state:

Br

Mg R MgBr O acid hydrolysis O O HO R MgBr Mg Br R R R (23)

Thus the two descriptions are equivalent views of the same process but one summarizes the whole process by only considering the inputs and the outputs of the system. So having a reaction with the stoichiometry defined by the rates or a multiple of the rates of the exchange fluxes of an extreme pathway is a reaction mimicking the behavior of the system while remaining consistent with the network steady state and mass conservation constraints

38 3.3 Meta-reaction of the whole system. A meta-reactions is therefore a black box having the same behavior of a subnetwork.

In this section we illustrate the construction of meta-reactions with three examples. The first one is a case of a single meta-reaction replacing a simple linear chain of reactions. It has the advantage of being very intuitive. The next ones illustrate the construction of several independent meta-reactions and the case of a unique meta-reaction mimicking the behavior of dependent reactions.

This is illustrated in the first example and in figure 18 with the following chemical system: Fa : (in) −−→ a

r1 : 1 a −−→ 2 b

r2 : 1 b −−→ 4 c (24) Fc : c −−→ (out)

where we have a simple chain of two reactions r1 and r2. The only product (b) of the first reaction (r1) is the only substrate of the second reaction (r2). Such a simple system can be abstracted as a unique chemical reaction converting the chemical a into a chemical c:

rI : α a −−→ γ c (25) where α and γ being the new stoichiometric coefficients to be defined. Obvi- ously the stoichiometric coefficients of r1 for a and r2 for c cannot be used as it will produced the following reaction 1 a → 4 c which is in contradiction with mass conservation as one mole of a imply the production of more than four moles of c. Therefore there will be a loss of mass regarding the system. Indeed one mole of a the reaction r1 produce the needed quantity of b to produce eight moles of c with reaction r2. Hence to correct quantities summarizing the system of chemical equations is: 1 a → 8 c (α = 1 and γ = 8 in the reaction 25). In this case the stoichiometry do not loose or create mass and is consistent with the system it encapsulate. Intuitively if boundaries are added to the previous system and exchanges outside those boundaries are allowed for a and c, the coefficients corresponds to the external fluxes of the only extreme pathway of the system. The figure 18 illustrate the previous system of chemical equations (equations 24 and figure 18a) along with the network derived from the equations (figure 18c) and its extreme pathway (figure 18b). In conclusion a meta-reaction is the external fluxes of one of the extreme pathway passing through the encapsulated reactions.

The second example is illustrated in figure 19 where two reactions share a common substrate (denoted by b):

Fa : (in) −−→ a

r1 : 2 a −−→ 1 b

r2 : 2 b −−→ 1 c

r3 : 1 b −−→ 2 d (26) Fc : c −−→ (out)

Fd : d −−→ (out)

39 3 hierarchical computation of extreme pathways

r1 : 1 a −−→ 2 b R1 R2 Fa Fc  r2 : 1 b −−→ 4 c P = p1 1 2 −1 8 (a) The system of chemical equations (b) The extreme pathway when the representing the system (without compound a and c can respectively the exchange reaction defining the enter and exit the system. boundaries).

(c) On the left a graphical representation of the network derived from the system of chemical equations (a) with a, c and d as input (Fa) or output (Fc, Fd). On the right the meta-reactions derived from the extreme pathway p1 computed from the network (b).

Figure 18: Derivation of a meta-reaction from the extreme pathways of a simple metabolic network.

r : 2 a −−→ 1 b 1 R1 R2 R3 Fa Fc Fd   r2 : 2 b −−→ 1 c p 2 1 0 −4 1 0 P = 1 r3 : 1 b −−→ 2 d p2 1 0 1 −2 0 2 (a) The system of chemical equa- (b) The extreme pathway when the com- tions representing the system pound a, c and d can enter or exit the (without the exchange reaction system. defining the boundaries).

(c) On the left a graphical representation of the network derived from the system of chemical equation with a, c and d as input (Fa) or output (Fc, Fd). On the right the meta-reactions derived from the extreme pathways computed from the network.

Figure 19: Derivation of two meta-reactions from the extreme pathways of a metabolic network based from a system of chemical equations.

An erroneous meta-reaction derived from this example will be the reaction

α a −−→ γ c + δ d (27) but this meta-reaction is not equivalent to the system 26 as the system does not impose that the consumption of a implies the production of c and d. Indeed the mass of a converted by r1 can be converted in c or b by r2 or r3 respectively. So two meta-reactions need to be derived from the system to mimic the behavior of the system 26 and this is confirmed with the extreme pathways of the system (figure 19b) if a, c and d act as input or output of the system. Each extreme pathway corresponds to a meta-reaction and lead either to the production of c or d from a. In this case if one or both reactions convert a, whatever the rate is, it will stay consistent with the steady state and mass conservation of the original system or network. The system is illustrated in figure 19 (figure 19a), extreme pathway (figure 19b), network and meta-reaction (figure 19c) of this example.

40 3.3 Meta-reaction

r1 : 2 a −−→ 2 b + 1 c

r2 : 1 b −−→ 2 d R1 R2 R3 Fa Fd Fe  r3 : 2 c −−→ 1 e P = p1 2 4 1 −4 8 1 (a) The system of chemical equations (b) The extreme pathway when the com- representing the system (without pound a, d and e can enter or exit the the exchange reaction defining system. the boundaries).

(c) On the left a graphical representation of the network derived from the system of chemical equation with a, d and e as input (Fa) or output (Fd, Fe). On the right the meta-reaction derived from the extreme pathways computed from the network.

Figure 20: Derivation of two meta-reactions from the extreme pathways of a metabolic network based from a system of chemical equations.

The last example illustrates the case where a meta-reaction of the form

α x −−→ β y + γ z (28) is created (figure 20a). In this case a reaction produces two compounds, each one acting as a substrate for a reaction (denoted by b and c):

Fa : (in) −−→ a

r1 : 2 a −−→ 2 b + 1 c

r2 : 1 b −−→ 2 d

r3 : 2 c −−→ 1 e (29)

Fd : d −−→ (out)

Fe : e −−→ (out)

This imply that if the reaction r1 converts a it will produce b and c, so the reactions transforming consuming those chemicals have to convert them into d and e to stay consistent with the system 29. This is described by the extreme pathway (figure 20b) and therefore a meta-reaction converting a directly into d and e reflects the considered system (figure 20c):

4 a −−→ 8 d + 1 e.(30)

To systematically build meta-reactions from a set of chemical reactions (and from a set of exchange reactions defining the boundaries of the system) the extreme pathways of the system must be computed. Then for each extreme pathway a meta-reaction is built. The non null rates exchange fluxes are used as stoichiometric coefficients for the corresponding compounds. A negative rate corresponds to a substrate and a positive rate to a product of the meta-reaction. Hence if the considered system is composed of n extreme pathways, n meta-reactions have to be created to respect the steady state and mass conservation of the compounds transformed by the system. It should

41 3 hierarchical computation of extreme pathways be noted that n can be smaller, equal or greater that number of reaction in the system. Formally to build meta-reactions from the stoichiometric matrix of a chemical system the extreme pathways matrix P is computed. Then for each extreme pathway pi = (r1, r2,..., rn| f1,... fm), where the ri are the internal reaction rates and the fj are the exchange fluxes, a new reaction is created having as substrate the compounds corresponding to the negative fj and as product the positive fj. The stoichiometry of the reaction is the absolute value of those fj. The new meta-reactions along with their substrates and products replace all the reactions corresponding to the rates ri in the extreme pathway. This extreme pathways representation of chemical reactions allow the construction of reaction that still consistent with the steady state of the system it encapsulates. Meaning that if extreme pathways are computed in a system containing meta-reactions, the computed steady states is consistent with the steady state of the system prior the replacement of reactions by meta-reactions. In conclusion a system or network containing meta-reactions replacing subsets of reactions is equivalent in the case of extreme pathways computation.

3.4 metabolic subnetworks

In the previous section full system have been considered to build meta- reactions. However meta-reactions could be use to replace just a subset of chemical reactions in a metabolic network. To do this the subsystem on which the extreme pathways are computed and use to build the meta-reactions must be extracted from the network. Such a subsystem is called in this thesis a metabolic subnetwork. A metabolic sub-network is almost a subgraph of a bipartite digraph representing a metabolic network. With the bipartite digraph representing a metabolic network M = {C, R, F, E} where C, R and F are respectively the set of chemical compounds, reactions and exchange fluxes. The union of these three sets is the set of vertices (V = C0 ∪ R ∪ F). The set of edges is denoted by E. A metabolic subnetwork is also metabolic network M0 = {C0, R0, F0, E0} where C0 ⊆ C and R0 ⊆ R. Regarding F0 the exchange fluxes in the metabolic subnetwork are not the same as the metabolic network. Hence the set E0 will also differ. Therefore G0 is not a subgraph of G (as illustrated in figure 21). A metabolic subnetwork is extracted from a set of reactions R0. Then all the products and substrates, i.e. the neighbors of R0, form the set of compounds:

0 [ C = NM(ri),(31) 0 ri∈R where NM(ri) is the set of neighbors of the reaction ri in the graph M. The edges linking the compounds in C0 with the reactions in R0 in M are preserved in M0. To construct the set F0, every compound in C0 that is linked to a reaction in the graph M produces a new exchange reaction added to F0 in M0. Extra care must be taken when setting the exchange flux constraint. In the logic table determining the constraint in Schilling et al.[ 2000] there are 16 cases identified. However seven are of real importance here because some case are impossible after the simplification (or pruning) operation and because the zero constraint correspond in our case to an internal chemical and is not involved in the creation of a new exchange reaction. So nine rows of the table in Schilling et al.[ 2000] have no correspondence or are

42 3.4 Metabolic subnetworks

Figure 21: A network composed of seven metabolites (a, b, c, d, e, f , g, h), six reactions (the Ri) and four exchange fluxes (illustrated by arrow traversing the dashed box). Following the thick arrows, the two reactions to be extracted from the metabolic network highlighted with bold borders, propagation to the compounds that should also be extracted from the network, identification the new exchange fluxes and their direction and the extracted metabolic sub-network.

43 3 hierarchical computation of extreme pathways

Table 9: The criterion to decide the type of exchange flux. The semantic of the type is: out implies a positive constraint, in implies a negative constraint and in/out implies an unconstrained constraint. Subnetwork System Consumed Produced Consumed Produced Type yes yes yes yes In/out yes yes yes no Out yes yes no yes In yes no no yes In no yes yes no Out yes no yes yes In no yes yes yes Out

ignored for readability. The exchange fluxes of the subnetwork are not in adequation with the logic table determining the constraint in Schilling et al.[ 2000]. Indeed the semantic of consumed and produced by the rest of the system is not the same. In this case it is a network property defined by the edge linking a chemical to a reaction. If there is an edge from a chemical to a reaction it is consumed and if there is an edge from a reaction to a chemical it is produce (table 9). This is an important difference as this can lead to a wrong subnetwork producing improper meta-reactions. The error comes from the fact that the exchange fluxes of the subsystem may be not sufficiently constrained because the subnetwork is extracted without computing its extreme pathway and without knowing the extreme pathway of the rest of the system. This case happen only for the case in the first row of the table 9 as it should correspond to the second or third row. However those improper reactions are easily detectable after reconstruction of the whole system extreme pathway (see the cycle cleaner, section 3.6.4 in the current chapter). The figure 21 illustrate a metabolic subnetwork extracted from a metabolic network.

3.5 packing the metabolic network

The goal of the packing algorithm is to reduce the number of chemical reactions in a metabolic network, that produces equivalent extreme pathways to the unpacked metabolic network. Hence the next step is to find subsets of chemical reactions in a metabolic network that can be good candidates to be part of meta-reactions. We are interested in reducing the size of the network, hence good candidates are sets of chemical reactions that form a subnetwork whose number of extreme pathways is less than the number of chemical reactions. The hope of this operation is that reducing the number of chemical in the network may be help the computation of the extreme pathways as less chemicals will be required to be balanced by the algorithm. To pack the network the first step is to select a subset R0 ⊂ R of chemical reactions used to extract a metabolic subnetwork. Then compute the subnet- work extreme pathways. If the cardinality of the set of extreme pathways of type I and II is smaller than the cardinality of R0, use the extreme pathways of type I and II to build meta-reactions. Finally replace all reactions in R0 by the meta-reactions in G, hence producing a new graph G0.

44 3.6 Extreme pathways unpacking

The pseudo-code of this algorithm is given as the algorithm 1 and for now the process of selecting the subset of chemical reaction R0 is ignored. The pseudo-code focus on one step of replacing the reactions R0 by meta-reactions in the metabolic network. Ideally in a practical case this step is repeated until no more replacement of chemical reactions by meta-reactions as long as the size of the network reduces. It should be noted that a meta-reaction can also be packed in a meta-reaction as any other reactions, because it behaves like a reaction.

Algorithm 1 A single step of metabolic network packing. In practice this step is repeated until a given stop criterion is satisfied. Moreover the selection process of the chemical reactions used to extract the subnetwork is not explained here (i.e. the explanation of the input J) The packing produces a metabolic network with less reactions and compounds and an equivalent generating set. Require: A metabolic network M = (R, C, F, E) and a subset of reactions J ⊂ R Ensure: A packed metabolic network M0 = (R0, C0, F0, E0) with less reactions and compound than M T ← the metabolic subnetwork induced by J in M P ← compute extreme pathways from T P? ← remove the type III extreme pathways from P M0 ← (∅, ∅, F0, ∅) for all p ∈ P? do m ← create a meta-reaction from p add m to M0 (i.e. R0 ← R0 ∪ m) for all external fluxes f ∈ m do c ← the compound exchanged by the flux add c to M0 (i.e. C0 ← C0 ∪ c) if the flux is negative then add a directed edge from c to m else if the flux is positive then add a directed edge from m to c end if end for end for

Once the network is packed, the extreme pathway of the packed network are computed, thus producing the extreme pathways matrix of the packed network. This matrix and the meta-reactions can be used to compute the extreme pathway of the unpacked network. The operation that combines the packed matrix and the meta-reactions to produce the whole extreme pathway matrix is called extreme pathway unpacking.

3.6 extreme pathways unpacking

The unpacking operation is the operation that reconstructs the extreme pathways matrix P of the original metabolic network. This is done by combining the extremes pathways pi of the subnetworks and the extremes pathways of the packed network U. Although this operation is straightfor- ward it is mandatory to check the extreme pathways of the reconstructed matrix. Indeed some extreme pathway may lead to absurd solutions after reconstruction, the result being not equal to P.

45 3 hierarchical computation of extreme pathways

Figure 22: Following the thick arrows: the original metabolic network, the metabolic sub-network induced by R2, R3 and R4, and the packed metabolic network

3.6.1 The straightforward case

The simplest case is when the extreme pathway qi only passes by meta- reactions that do not share chemical reactions from the original metabolic network. In this case the rates of the meta-reactions only act as multiplicative coefficients of the encapsulated reactions. The external fluxes of the meta- reaction extreme pathways are dropped. The following example represents the case of a linear chained metabolic network between six chemical reactions (R1 to R6) with two external flux at the begging and the end of the chain (respectively Fa and Fg). A meta-reaction I replaces the chemical reactions R3 to R5 in the chained metabolic network, producing the extreme pathways U for the packed metabolic network. The vector of rates I is denoted by vI:

R1 R2 IR6 Fa Fg R3 R4 R5 Fc Ff T   U = p1 1 2 5 1 −1 1 vI = 1 3 2 −1 3 .

Thus the multiplicative coefficient for the reactions R3 to R5 is U1,3 (i.e. 5), so the reconstructed extreme pathways matrix P is given by:

R1 R2 R3 R4 R5 R6 Fa Fg T  P = p1 1 2 5 15 10 1 −1 1 when the column I is substituted by the reactions it encapsulates, with their rates adjusted. The intuition is that if the meta-reaction has to consume and produce a chemical at a given rate r to sustain steady state, all the reactions that it encapsulate must have their rates multiply by r as they are the ones allowing the transformation of the meta-reaction input into the meta-reactions outputs (i.e. the substrates and products).

3.6.2 The shared case

This case is a bit more complicated and rises when a unique extreme pathway of the packed network goes through several meta-reactions having at least one non-null rate for the same reaction. In this case, the rates of the extreme pathways of the common chemical reactions are summed when included in the extreme pathway matrix of the original metabolic network.

46 3.6 Extreme pathways unpacking

To make this clear an example case is illustrated in figure 22 where the packing steps are illustrated. The selected subnetwork produces the following extreme pathway matrix:

R2 R3 R4 Fb Fc Fe   u 1 0 1 −1 0 1 UT = 1 (32) u2 0 1 1 0 −1 1 and those two extreme pathways produce two meta-reaction (I and II) that replace the reactions R2,R3 and R4. The solution of the packed network is:

R1 IIIFa Fe T  Q = q1 1 1 1 1 −2 (33)

The matrix Q shows that the only extreme pathways has non-null rates for the two meta-reactions (I and II), and those two meta-reactions are extreme pathways (u1 and u2) that have non-null rates for the same chemical reaction, namely R4. In this case, to rebuild the extreme pathway (the matrix P), the rate u1,3 and u2,3 must be summed to obtain the correct rate for R4 in P:

R1 R2 R3 R4 Fa Fe T  P = p1 1 1 1 (1 + 1) 1 −2 .(34)

3.6.3 The encapsulated case

An encapsulated case is when a meta-reaction describes an extreme path- way having a non-null rate for another reaction embedded in a previous packing step. This case can happen if the packing process is applied on an already packed network. Indeed, nothing prevents the inclusion of a meta-reaction in the set of reducible reactions, so it is possible to have a meta-reaction that include other meta-reactions. These meta-reactions may also embed meta-reactions, hence we speak of hierarchical computation of extreme pathways. The unpacking operation should be applied until there is no row representing a meta-reaction in the reconstructed matrix. Thus a recursive application of the unpacking operations as described in previous cases until all meta-reaction have been unpacked is applied on the packed network.

3.6.4 Last independence check

The unpacking operation, as described in this chapter, can produce more extreme pathways than the real number of extreme pathways of the origi- nal (or unpacked) network. This problem happens when all the exchange fluxes of an extreme pathway in the metabolic sub-network are produced or consumed by a chemical reaction and its reverse. This case produces wrong extreme pathways containing cycles in the rebuilt extreme pathways. Those wrong extreme pathways are combination of a type I and a type III extreme pathway of the network. This arises because of improper constraint adjustment when a subnetwork is extracted and lead to unwanted extreme pathway. To correct this problem, the set of the rebuilt extreme pathways is simply filtered by removing all extreme pathways being a combination of a type I and a type III extreme pathway in U, the packed extreme pathways matrix.

47 3 hierarchical computation of extreme pathways

Figure 23: Example of a metabolic network that can produce a wrong extreme pathways matrix after rebuilding (see text for explanation).

Figure 24: Metabolic sub-network improperly extracted from the network in 0 0 figure 23, induced by the chemical reactions R2, R2, R3, R3, R4 0 and R4.

This problem is illustrated with the metabolic networks in fig. 23. This network is derived from the following stoichiometric matrix:

0 0 0 0 R1 R1 R2 R2 R3 R3 R4 R4 Fa Fb a  1 −1 1  b  −1 1 −1 1    S = c  1 −1  (35)   d  1 −1 −1 1 1 −1  e −1 1 −1 has only one extreme pathway p1 in the extreme pathways matrix P:

0 0 0 0 R1 R1 R2 R2 R3 R3 R4 R4 Fa Fb T  P = p1 1 0 1 0 1 0 2 0 −1 2 (36)

Now, let’s assume that a meta-reaction is built based on the following 0 0 0 chemical reactions: R2, R2, R3, R3, R4 and R4. This choice is a not bad one as the extracted metabolic sub-network is composed of six chemical reactions and contains only four extreme pathways. So the resulting packed metabolic network is the one in figure 25 and has the following stoichiometric matrix:

0 R1 R1 IIIIIIIVFa Fb a  1 −1 1  0 b  −1 1 −1 −1 1  S =   (37) c  −1 1 −1  e 1 1 −1 and produce the packed extreme pathway matrix U:

48 3.7 Matrix description of the complete algorithm

Figure 25: The packed metabolic network of figure 23, based on the extreme pathways of figure 24.

0 R1 R1 IIIIIIIVFa Fb   u1 1 0 1 1 0 0 −1 2 T U = u2  1 0 2 0 0 1 −1 2  (38) u3 1 0 0 2 1 0 −1 2

But this pathways matrix does not allow the reconstruction of the matrix P as it contains two faulty extreme pathway: u1 and u2. Then using the packed system’s extreme pathway matrix U and the subsystems extreme pathway to obtain the reconstructed extreme pathway matrix lead to:

0 0 0 0 R1 R1 R2 R2 R3 R3 R4 R4 Fa Fb   q1 1 0 1 0 1 0 2 0 −1 2 T Q = q2  1 0 2 1 1 0 2 0 −1 2  (39) q3 1 0 1 0 2 1 2 0 −1 2

The matrix Q is not equal to P as the extreme pathway q2 and q3 do not exists in P. However these two faulty extreme pathway are easily identifiable as they are a combined with a type III extreme pathway. Indeed q2 and q3 are each one a sum of q1 with a type III extreme pathway. Hence a simple check at the end of the computation allow their removal in order to obtain the right matrix (i.e. P). There is no need to apply this last check of independence to all extreme pathways. As stated in section 3.4, this case happens when an improper constraint is applied on some exchange fluxes of a subnetwork. The case that potentially cause this problem can be identified. They and correspond to the first row of table 9. When such case is detected, all the meta-reactions built from the extreme pathways subsystems are marked as potentially inaccurate. Then only the extreme pathways having non-null rates for those marked meta-reactions are check for independence. Thus avoiding useless check for ray that are surely to be systemic-independent.

3.7 matrix description of the complete algorithm

Until now the proposed algorithm has been presented in an intuitive way: encapsulating a subsystem and replacing it in the network by its inputs and outputs. However regarding the extreme pathways enumeration problem, it may not be obvious of how the packed system allow the reconstruction of the extreme pathways matrix of the original system. In this section we describe how the previous steps are related to matrices operations in order to clearly understand how the algorithm solves the problem of extreme pathways enumeration.

49 3 hierarchical computation of extreme pathways

The main idea is to split the whole system into distinct subsystems. For instance a system composed of two disconnected components (that is to say its graph is composed of two connected components) can be represented by a block diagonal matrix if rows and columns are correctly reordered:

S  v  Sv = 0 ≡ 1 1 = 0 (40) S2 v2 and in this case each subsystem can be solved independently to find the two 1 extreme pathways matrices P1 and P2 . The whole extreme pathways matrix P is therefore:  P 0  1 = P.(41) 0 P2 Unfortunately in real 2 metabolic systems, blocks are rarely disconnected so the stoichiometric matrix does not have a block diagonal form. However it is possible to rewrite the system such as to split it in pseudo independent subsystems and connect them by adding constraints that link the rates of the chemical reactions between the subsystems in order to impose a solution sustaining the steady state solution. In this case this allow to find the solution of all blocks and then solve the system imposed by the constraints to find the extreme pathways of the whole system.

The procedure to disconnect a subsystem is similar to the metabolic sub- network extraction. First the two or more reactions in different subsystems are connected by one or more chemicals. The separation to isolate a sub- system is made by adding new equivalent chemicals acting as new inputs or outputs of the subsystem with the constraints that the reactions transport those chemicals in or out of the subsystem. Then all the equivalent chemicals are linked by equality constraints. These new chemicals imply the addition of new rows and columns in the stoichiometric matrix along with new rows constraining the reaction rates. The matrix is made of an upper part being a block diagonal stoichiometric matrix B and a lower part being the constraints sparse matrix C:   S1  S2       ..  v1 !  .  B   v2  S    v = 0 or  n   .  = 0 (42) C    .  c c ··· c   11 12 1m v  . . . .  n  . . .. .  cl1 cn2 ··· clm

(i) (i) Then the extreme pathways matrix Pi = [p1 ,..., pk ] is computed for each subsystem Sivi = 0, i = 1, . . . , n. Hence each vi can be written as a positive linear combination of the columns of the extreme pathways matrix Pi :

(i) (i) (i) (i) v = α p + ... + α p ,(43) i 1 1 ki ki

1. That is to say the rays describing the intersection of the ker(S) and the positive orthant. 2. The reader should be alerted that this is a malapropism as there is no guarantee that a reconstructed metabolic network is a reflection of the chemical reality of a given organism.

50 3.7 Matrix description of the complete algorithm

(i) where each αj ∈ [0, ∞) and let’s add that the columns of Pi as chosen such (i) (i) as kpj k = 1, with j = 1, . . . , k. The next step is to compute the αj by solving the following under-determined system:   v1  .  C  .  = 0 or Cv = 0 (44) vn

(i) As the pj are known the problem can be rewritten as     (1) P1 α1  (1)  P2  α     2  C  .   .  = 0 or Kα = 0 (45)  ..   .   ( ) Pn α n kn

The Kα = 0 is also an extreme pathways enumeration problem and also represent a chemical system. Let’s denote its solution by U. This solution describes how the connecting reactions behave to sustain steady state. Finally the subsystems solutions need to be linked together to obtain the complete extreme pathways. The sub-solutions linkage is a simple product between the [...]. To do this let’s denote X? the matrix X without the rows corresponding to the connector reactions. Then by constructing and applying the following matrix multiplication:

 ?  P1 ?  P   2  ?  .  U = P (46)  ..  ? Pn and the extreme pathways matrix P obtained by solving the problem Sv = 0 can be computed from Pi and U. The next section illustrates these operations with a complete example and links the operations with the steps of the algorithm when described in intuitive terms.

3.7.1 The constraints matrix

The constraint matrix C is supposed to be correct but as seen in this chapter (section 3.4) some case can lead to subsystem construction with exchange fluxes not constrained enough. Consequently this can produce pathways not satisfying the independence of the solutions. However one can always construct a correct constraints matrix by computing the extreme pathways of the subsystem without constraining its exchange fluxes. Then use the connectors’ reactions to set the definitive constraints of each subsystem and use a proper subsystem solution with the logic table described in Schilling et al.[ 2000]. Therefore in the mathematical version of the proposed algorithm one can assume that the constraint matrix is always correct.

51 3 hierarchical computation of extreme pathways

Figure 26: The metabolic network corresponding to the stoichiometric matrix of equation 47 and a proposed division of the network. The boundaries of the considered parts are defined by the dashed boxes.

1

2

3 4

Figure 27: The chosen metabolic subnetworks. The circled numbers i indicate which subnetwork corresponds to which stoichiometric matrix Si in equation 48 (a to d).

3.7.2 Example

To illustrate the way the algorithm work in term of matrices, it is applied to the system or network of figure 26. This network corresponds to the following stoichiometric matrix:

R1 R2 R3 R4 R5 R6 R7 R8 Fa Ff a  −1 1  b  1 −1 −2    c  1 −1    d  1 −1  S =   (47) e  1 −2 2    f  1 −1    g  1 −1  h 1 −1

Now let’s assume the subsystem division as illustrated in figure 26. This produces the subsystem or metabolic subnetworks in figure 27. This produce

52 3.7 Matrix description of the complete algorithm a block matrix B composed with the following blocks:

Fa00 R1 R2 R3 R6 R7 Fd Fh   a −1 1   b  1 −1 −1      c  1 −1  S1 =   (48a) d  1 −1      g  1 −1  h 1 −1

Fd0 Fh0 R4 R8 R5 Ff 00   d0 1 −1   e  1 1 −1  S2 =   (48b) 0   h  1 −1  f 1 −1

Fa F 0  a  S3 = a0 1 −1 (48c)

F 0 Ff  f  S4 = f 0 1 −1 (48d) and the following constraints matrix C:

Fa00 R[1,2,3,6,7] Fd Fh Fd0 Fh0 R[4,8,5] Ff 00 Fa Fa0 Ff 0 Ff   Fa0 −1 1 Fd  1 −1    (49) Fh  1 −1 

Ff 00 1 −1

Note that for readability the continuous zero columns are labeled as R[x,y,...,z] to indicate the columns Rx, Ry, ..., Rz are zero columns. So the form of the system to be solved is:   S1 v   S2  1   v   S3   2 = 0 (50)   v  S4  3   v C 4

53 3 hierarchical computation of extreme pathways

By respectively denoting the extreme pathways of the subsystem S1, S2, S3 and S4 by P, Q, R, T, the solutions of the subsystems are:

p1 p2 F 00  1 2  q1 q2 a   R1  1 2  Fd0 2 0   r t R2  1 0  Fh0  0 1  1 1       ! R3  1 0  R4  2 0  Fa 1 Ff 1 P =  , Q =  , R = , T = R R  1 1  F 0 F 0 1 6  0 1  8   a 1 f     R  0 1  R5 1 1 7     Fd  1 0  Ff 00 1 1

Fh 0 1 (51) and each vi is written as:

v1 = π1 p1 + π2 p2 (52a)

v2 = σ1q1 + σ2q2 (52b)

v3 = ρ1r1 (52c)

v4 = τ1t1 (52d)

T Now the system to find the solutions for the vector (π1, π2, σ1, σ2, ρ1, τ1) can be constructed with the constraints matrix C. For readability the zero T columns in C and its corresponding rows in the vector (v1, v2, v3, v4) are removed. So the system is rewritten as:

  π1 p1,1 + π2 p2,1 F 00 F F F 0 F 0 F 00 F 0 F 0 a d h d h f a f π1 p7,1 + π2 p7,1     Fa0 −1 0 0 0 0 0 1 0 π1 p8,1 + π2 p8,1   Fd  0 1 0 −1 0 0 0 0   σ q + σ q     1 1,1 2 2,1  = 0 (53) Fh  0 0 1 0 −1 0 0 0   σ q + σ q   1 1,2 2 2,2  F 00 0 0 0 0 0 1 0 −1  σ q + σ q  f  1 1,6 2 2,6   ρ1r1,2  τ1t1,1 and with the solutions from equation 51, it can be rewritten as:   π1   −1 −2 0 0 1 0 π2    1 0 −2 0 0 0  σ     1  = 0 (54)  0 1 0 −1 0 0  σ   2  0 0 1 1 0 −1 ρ1  τ1 and correspond to the packed network in the algorithm described in this chapter. A graphical representation of the network is showed in figure 28. This produces the following extreme pathways matrix:

54 3.8 Efficiency of the network packing

Figure 28: The packed network as described by equation 54. The correspon- dence between the rate vector and the reactions is: Pa ↔ π1, Pb ↔ π2, Qa ↔ σ1, Qb ↔ σ2, R ↔ ρ1 and T ↔ τ1.

u1 u2   ? ? Fin 1 0 u1 u2   ρ1  0 1  ρ1 0 1   π1  0 1  π1  0 1      σ1  0 0  ? σ1  0 0  U =   and U =   (55) π2  0 1  π2  0 1      σ  0 0  σ  0 0  2   2 τ1  0 0  τ1 0 0

Fout 0 0 where U? is the matrix U without the row corresponding to the external fluxes. A matrix denoted by X? is the matrix X without the rows corre- sponding to the connector reactions. Then the extreme pathways matrix V corresponding to the system Sx = 0 is given by:

v1 v2   R1 2 1 R  2 0  2   R  2 0  3    ?    P R6  0 1  ?    Q  ? R7  0 1  V =  ?  U =   (56)  R  R4  2 0  ?   T R8  0 1    R5  1 1    Fa  2 2  Ff 1 1

With this formalization the description of our algorithm is complete. The next step is to evaluate its efficiency. We first present a performance study on network packing. That is to say we evaluate the reduction in size of the network of several approaches. This correspond to the selection of metabolic subnetworks and the replacement of those subnetworks by meta-reactions. Then the computational gain of the packing and unpacking operations are compared to the direct computation of extreme pathways. To purpose of all this is to assess the benefits of the packing and therefore the usability of the hierarchical algorithm on a genomic scale networks.

3.8 efficiency of the network packing

We define the efficency of the packing as the reduction of the cardinality of the sets (i.e. the vertices and the edges) composing the network. The gain in computing the extreme pathways of the packed network is not measured for now. The compression algorithm was evaluated on the E. coli metabolic network reconstructed form the Kyoto Encyclopedia of Genes and Genome database (KEGG). The size of the network (number of vertices and edges)

55 3 hierarchical computation of extreme pathways

Table 10: Size of the reconstructed network E. coli built from all the data avail- able in the KEGG and size of the network after the simplification steps. Organism Type Reactions Fluxes Compounds Edges Reconstructed 2056 18 1541 8992 E. coli Pruned 513 16 275 2121 are summarized in table 10. The date of the last fetch from the database is January 2011 3. The selected exchange fluxes are listed below. The input and output of the system are: — D-Glucose (carbon source), — water (H2O), — oxygen (O2), — phosphate (Pi, inorganic phosphate), — carbon dioxyde (CO2), — hydron or proton (H+). And the currency metabolites (see section 2.1.1) are: — D-Glucose 6-phosphate, — and L-Lactate, — adenosine triphosphate (ATP), — adenosine diphosphate (ADP), — nicotinamide adenine dinucleotide in the two forms: oxidized (NAD+) and reduced (NADH), — nicotinamide adenine dinucleotide phosphate (also in the two forms: NADP+ and NADPH). In this document, unless specified, the terms E. coli network refere to the metabolic network as defined above. Roughly a packing operator works as follow. From a metabolic network, it first selects a subset of chemical reactions. This subset of chemical reactions is used to extract a metabolic subnetwork and the extreme pathways in the metabolic subnetwork are computed. If the number of extreme pathways is smaller than the number of chemical reactions in the metabolic subnetwork, each extreme pathway induce a new meta-reaction. Then the new meta- reactions replace the subnetwork in metabolic network. Thus producing a smaller network in terms of vertices. Finally this whole procedure is repeated on the smaller network until a stop criterion is satisfied.

3.8.1 Random subnetwork packing

We start by describing the random subnetwork packing which is rely on random selection of chemical reactions. The idea is to select a subnetwork based on the forest fire method [Leskovec and Faloutsos, 2006]. This method use a vertex as the fire starting point and propagates the fire to the next vertices with a probability p f . Then from the next burned vertices, the fire can either burn a neighbor that has not been burned with probability p f . In the case of a metabolic subnetwork we propose a variation of this algorithm. First, this process is repeated until a sample size (number of burned vertices) goal is reached. Then only the burned reactions are taken

3. Shortly after this release, KEGG switches to a paying FTP access. Therefore no new release was extracted from the database.

56 3.8 Efficiency of the network packing

Algorithm 2 Iterative metabolic network packing. The procedure of metabolic subnetwork extraction is described in the section 3.4. Require: A bipartite digraph G = (V, E) representing a metabolic network Ensure: A bipartite digraph G0 = (V0, E0) representing a packed network G0 ← G while a stop criterion is not satisfied do U ← choose a subset of V0 R ← the set of all chemical reaction in U S ← the metabolic subnetwork generated by R extracted from G0 P ← computes the set of extreme pathways in S if |P| < |R| then remove all reactions R from G0 for all p ∈ P do mp ← builds a meta-reaction from p 0 add the meta-reaction mp in G end for remove all compounds with degree zero from G0 end if end while

into consideration to extract a valid subnetwork. Also this process is repeated until a sample size (number of burned vertices) goal is reached. When the targeted size is reached, the burned reactions are used to extract the metabolic 4 subnetwork. The selected value for probability p f is 0.7. The method process the graph in a random walk manner, but it allows the selection of multiple neighbors in the the forest fire manner. Two parameters are needed for this packing in order to setup the stop condition: the targeted size of the compressed metabolic subnetwork (m) and the number of steps tried when there is no improvement k. A step consists in the selection of a subnetwork and the replacement (if needed) of the subnetwork by meta-reactions. If there have been k successive steps without size reduction or if there are m burned vertices, the process stops. The table 11 shows the performance on the E. coli metabolic network with different values for the targeted subnetworks size. With the best compression cases (n between 20 and 75), we obtain a network composed of only ≈ 49% of the reactions and of ≈ 35% of the simplified E. coli network. The corresponding graph is composed of only ≈ 44% of the vertices and ≈ 54% of the edges. A question arises when observing the packed networks: does this stochastic random packing converges to similar packed metabolic networks? Indeed even if the selection of subnetworks is random during each reduction experiment, it is possible that (almost) always the same subsets of reactions produce the good meta-reactions. Therefore converging to similar packed networks. We can consider meta-reactions as a cluster of the reactions they encapsulate. Thus meta-reactions in a packed metabolic network define the partitions of the unpacked network. With this, we can measure the similarity between packed networks with the Fowlkes-Mallows index [Fowlkes and Mallows,

4. The influence of p f is not well understood in the case of the packing and does not seem to have an impact on the packing efficency.

57 3 hierarchical computation of extreme pathways

Table 11: Compression of the reconstructed E. coli metabolic network from the KEGG database. The parameters are n the targeted size of the subnetworks and the number of repetition that the algorithm do without improvement is 250. For each n the experiment has been repeated 100 times and the µ and σ rows are respectively the mean and the standard deviation of the sizes for the reactions (rns), the compounds (cpds) and the edges. Let’s recall that the simplified E. coli metabolic network is composed of 348 chemical compounds, 684 reactions and 2856 edges (table 8 in section 3.2). n Rns (µ) Rns (σ) Cpds (µ) Cpds (σ) Edges (µ) Edges (σ) 5 206.53 10.46 457.55 9.73 1932.5 32.77 10 133.68 5.81 363.17 10.09 1584.73 39.22 20 121.26 4.76 341.39 7.71 1521.63 26.37 50 122.7 6.7 341.53 11.3 1538.84 40.25 75 121.32 9.19 337.34 14.13 1539.13 45.18 100 170.17 19.89 403.58 25.26 1741.38 75.9 150 249.09 6.31 496.85 8.12 2052.45 28.53 200 253.5 1.85 502.38 2.29 2071.71 8.53

1983]. This index is based on the geometric means of the criterion proposed in Wallace[ 1983]: m W (C , C ) = , with i = {1, 2},(57) i 1 2 ( − ) ∑ki nki nki 1 /2 where m is the number of pairs that are in the same meta-reaction in the packed network C1 and C2, nki is the number of object in the meta-reaction k of the packed network Ci. Hence the Fowlkes-Mallows index is: q F(C1, C2) = W1(C1, C2)W2(C1, C2),(58) with F(C1, C2) ∈ [0, 1]. It should be noted that during the iterative process, a meta-reaction can already be part of the selected subnetworks 5. In this case the correspond- ing cluster is the flattened in order the get the reactions set. For example, let’s define the following meta-reaction: R1,2,3 = {r1, r2, r3}, R4,5 = {r4, r5} and R1,...,6 = {R1,2,3, R4,5, r6}, where the ri are reaction in the original metabolic network (i.e. the non-packed version) and the Rj are meta-reactions. The flattened cluster or set corresponding to the meta-reaction R1,...,6 is {r1, r2, r3, r4, r5, r6}. The rest of the network that has not been packed in a meta-reaction is considered to be part of a unique cluster. Hence the total number of clusters is the number of meta-reactions in the packed network plus one. The similarity between 50 randomly E. coli packed networks is illustrate in figure 29 and does not allow to conclude that randomly packed networks converge to a common network. In average, the F obtained has a mean of µ ≈ 0.53 and standard deviation σ ≈ 0.045. This means that in average the meta-reactions share common reactions but the packed network are not similars. A drawback of the random packing approach is the computation time. Indeed it is the slowest approach among the described approaches in this chapter. The problem is that the majority of the computation time is spent

5. As explained for the encapsulated case in the unpacking procedure (see section 3.6.3).

58 3.8 Efficiency of the network packing

Figure 29: Fowlkes-Mallows index between all pairs of 50 randomly packed E. coli networks (a). The distribution of the Fowlkes-Mallows index values (without the values when a network is compared to itself) (b). The mean is ≈ 0.53 and standard deviation is ≈ 0.045.

in selecting and computing extreme pathways of subnetworks that produce too much meta-reactions. Let’s remind that we defined a good subnetwork producing less meta-reactions 6 than its number of reactions. Most of the drawn subnetworks are discarded regarding our definition of good sub- network. To illustrate this, a numerical experiment shows that the ratio of rejected subnetworks (or reactions sets inducing the subnetworks) is 0.982 (with a standard deviation of 0.004). The experiment has been repeated for 100 trials. Those trials have been applied with the best targeted subnetwork size (n = 75) and for a number of steps of 250 (table 11). This ratio is reproducible in all subnetwork configuration (i.e. the subnetworks sizes). The distribution of the ratios of rejected sets is illustrated in figure 30. This experiment shows that most of the computation time is used to generate and test unusable subnetworks or reactions sets. In conclusion the result is good but the majority of computation time is wasted.

3.8.2 Hierarchical packing

The hierarchical packing is an approach selecting the reactions to be packed in a meta-reaction by using the clusters obtained with a clustering of the network. It is related to the hierarchical clustering approaches. The idea is to recursively split the network’s vertices set into two non-over- lapping subsets. The binary split of the vertices can be used to produces two subnetworks that can be split again (figure 31). This recursive splitting is done until a subset is composed of only one vertex. One done, the splitting can be viewed as a binary tree where each node represents a vertices set. This tree is then traversed and the extreme pathways of each subnetwork produced by the vertices’ set are computed. This allow to create meta- reactions and compute the reduction induced by the partition. Then, to select which partitions that are used for the packing, the tree is pruned from the bottom to the top. The pruning criteria is that at each node, if the reduction produced by the partition in the node is bigger or equal than the sum of the reduction of its children, the node becomes a leaf. Otherwise it is left as it is. Once the traversal is finished, the leaves of the tree contain the selected partitions that are used to reduce the network (figure 32).

6. This means less extreme pathways of type I and II than chemical reactions.

59 3 hierarchical computation of extreme pathways 0.15 0.20 0.00 0.05 0.10

0.970 0.975 0.980 0.985 0.990

Figure 30: Distribution of the ratios of the rejected partition during a com- pression step. The measure was done with 200 compression experiments with the parameters giving the best compression results: a sample size of 75 with minimum number of steps set as 250.

2 4 1

3 5

2 4 1

3 5

2 3 5 4 1

1 2

Figure 31: Binary split of the vertices of a graph. Each split lead to a subgraph that can be itself split until it is composed of only one vertex.

60 3.8 Efficiency of the network packing

2 2

4 2 4 2

2 0 3 1 2 2 0 3 1 1 2 1 2 1 (a) (b)

2 2

3 4 2 4 2 0 3 0 3 1 3 1

2 1 (c) (d)

2 2 4 4 2 4 4

3 1 3 1 (e) (f) 4 2 8 4

4 4 4 4

3 1 3 1

(g) (h)

Figure 32: The figure is read from left to right and top to bottom (a to h) and illustrates the pruning process and the selection process of the partitions. Each node of the tree represents a set of vertices and the numbers in the node represent a hypothetical reduction induced by the set in the node. The traversal start from the leaves and the values of the reduction are summed up to the parent node and in this case the sum of the reductions is bigger than the reduction induced by the parent node (2 + 1 = 3 > 2) (b). Thus the subtree rooted on the parent is kept as it is and its reduction become 3, symbolized by a black node (c). In the next step of the traversal, the values are again summed up in the parent but in this case the sum is lower than reduction induced by the parent node (3 + 0 = 3 < 4). Thus in this case the partition induced by the node is kept and the tree is pruned by converting the parent in a leaf (d). The rest of the figure illustrates the remaining steps to the final partition tree (e, f, g, h).

61 3 hierarchical computation of extreme pathways

A drawback is that some partitions produce subsystems that are not tractable. Hence, after a given timeout, the computation is stopped and the cluster is invalidated. That is to say the reactions of the subsystem are not replaced by meta-reactions in the newly produced network, thus this produces a reduction of zero. To produce a sufficiently fast method the timeout is set to less to one minute per partition. The approaches used to produce the binary splits are the spectral packing and the Schuster packing.

The spectral splitting This approach is based on the spectral clustering (Von Luxburg[ 2007] and chapter 14 in Hastie et al.[ 2005]). This approach is deterministic and parameterless. It only depends on the spectra of the network Laplacian and its results is summarized in table 12 for the E. coli metabolic networks. The purpose of clustering is to partition a set of points into different groups according to their similarity. In the case of a graph the equivalent problem is to separate the vertices into k groups such as the weights of the edges are low between groups and high within the groups. When only two groups are considered the problem is to find the min cut of the graph. The steps of a spectral clustering are: 1. compute the unnormalized Laplacian of the graph L ∈ Rn×n, see equation 59; 2. compute the first k eigenvectors of L (each eigenvector ∈ Rn); 3. arrange the k eigenvectors in a column matrix V ∈ Rn×k; 4. use a clustering algorithm (i.e. the k-means 7 algorithm, MacQueen et al.[ 1967]) to cluster the n points forming the row of V to producing k clusters. k Each one of the n points vi ∈ R in V (i.e. the row of V) corresponds to the vertex i in the graph. The order of the rows is the same as in the adjacency matrix of the graph used to compute the Laplacian of the graph. Hence the result corresponds to the clustering of the vertices. Lets recall that the unnormalized Laplacian of a graph is defined as:

L = D − W (59) where D is the degrees matrix. This is the diagonal matrix whose elements are: di = ∑ wij.(60) We add that in the case of k = 2, it can be showed that the spectral clustering i s an approximation of the solution of the min-cut problem (chapter 25, section 4 in Murphy[ 2012] or chapter 11, section 5 in Newman[ 2010]).

The Schuster splitting The Schuster splitting is based on the criterion of metabolite classification proposed in Schuster et al.[ 2002b]. The main idea is to convert internal metabolites into external metabolites based on the degree of the metabolite in order to minimize the potential number of extreme pathways or elementary modes in a subsystem.

7. It should be noted that the algorithm has a randomized selection of initial centers, therefore it may be considered as non-deterministic depending on the data. However there exists approaches to make the algorithm deterministic [Celebi and Kingravi, 2012].

62 3.8 Efficiency of the network packing

a a a

a a a

Figure 33: Conversion of one metabolite into an external metabolite (or an exchange flux). On the upper left, when the high degree metabo- lite (the circle) is an internal metabolite the number of extreme pathways through the metabolite is potentially ten, because five reactions produces the one on the left and two consume the one on the right. Then the worst case is 5 × 2 = 10 (lower left, the extreme pathways are the black arrows). On the upper right, when the metabolite is considered as external the potential number of pathways of the system is seven. Because five use the external flux to exit the metabolite and two use input flux to enter the metabolite. Thus the worst case is 5 + 2 = 7 (lower right, the extreme pathways are the black arrows).

The observation is simple and intuitive. When a metabolite has a high degree, it is produce or consumed by a lot of process. Therefore the potential number of pathways transforming this metabolite is the product of the number of process that consume and produce the metabolite (that is to say the in and out degrees of the metabolite in the network). Thus by making it external, that is to say to consider being converted into two exchange fluxes (one input and one output), the number of pathways is not the product but rather the sum. Indeed, the number of pathways are not combined anymore and are independently considered. This case and how a network is split into subnetworks is illustrated in figure 33. As this criterion has been proposed to minimize the number of elementary modes, it can be an interesting choice to build meta-reaction based on this criterion. Indeed this criterion may produce subnetwork having the minimum number of extreme pathway, thus producing few meta-reactions. This approach is very fast and deterministic (table 12).

3.8.3 Comparison of the results

The packing performance of each approaches are in table 12. The results show that all the approaches produce packed network having the similar sizes and thus one should choose the most time efficient. The fastest approach is the one using the Schuster criterion. The random approach is not worth it as it is the slowest and is approximately several order of magnitude slower that the Schuster criterion (up to 1000 times slower). Regarding the extreme pathway metric and the spectral clustering on the metabolic network, the

63 3 hierarchical computation of extreme pathways

Table 12: Packing efficiency on the reconstructed E. coli metabolic network from the KEGG database. The reduction is applied on the pruned network. The last row is the size of the unpacked network if the size after the pruning operation. The slow-down column indicates how much time the approach is slower than the Schuster criterion. Method Compounds Reactions Vertices Edges Slow-down Random 8 121 341 462 1522 103 Spectral 116 345 461 1600 102 Schuster 101 316 417 1477 1 Simplified 275 513 788 2121 performance is similar but the computation time is up to 100 times slower. Hence the choice of the approach is indisputable.

3.9 performance of hierarchical extreme pathways computa- tion

Unfortunately the best obtained reduction does not allow the computation and the reconstruction of the extreme pathways of the full E. coli metabolic network. However measure of performance has been applied to metabolic networks that are easy enough. In this case easy means a metabolic network whose extreme pathways can be computed in a reasonable amount of time (i.e. less than 24 hours). Otherwise they are considered as hard, that is to say intractable in 24 hours. This time problem us maybe due to the available amount of memory. Indeed, our software is executed on a Java virtual machine 9 and the automatic memory management by the garbage- collection algorithm may have drastically slowed-down the computation of the extreme pathways. Thus making the computation impossible. So metabolic subnetworks were extracted from the E. coli metabolic network. A total of 155 subnetworks were extracted and their sizes ranged from 600 to 650. We are aware that this test environment is not ideal as the packing operation may not have the same reduction ratio on network samples. Thus it may not produces a significant improvement in the time computation or memory usage. To have an idea of what are the real size of the metabolic subnetworks (the samples) and their packing performances, the figure 34 summarize with a boxplot of the number of edges of the networks (before and after packing) and shows the reduction sizes distributions for each component of the networks. We recall that a box plot is a convenient way to present data [Williamson et al., 1989]. The bottom and top of the box are the first and third quartiles. The band inside the box is the median. The ends of the whisker are the lowest and highest data that still in the 1.5 interquartile range of the lower and higher quartile. Then time measurements of the duration of extreme pathways computation are made to compare the computation time between packed and unpacked networks. To get a general idea of the performance, we plot the computation times of the uncompressed networks versus the compressed networks (fig- ure 35). The points seem to be spread along the y = x line, hence indicating that the compression does not help in accelerating the computation. How-

8. In the case of random approach the values are the means of the results obtained with the best parameters. 9. At the time this work was done, the Java virtual machine version was the 1.7.0_09 (icedtea).

64 3.9 Performance of hierarchical extreme pathways computation

Vertices Compounds

600 200

550 180

500 160

450 140

400

Vertices Vertices (packed) Compounds

Reactions Edges

240 1300 220 1200 200 1100 180 1000 160 900 140

Reactions Reactions (packed) Edges Edges (packed)

Vertices Compounds Frequency Frequency 0 10 30 50 70 0 20 40 60 80

10 20 30 40 50 60 70 80 0 10 20 30 40

Reactions Edges Frequency Frequency 0 10 20 30 40 50 60 0 10 20 30 40 50

5 10 15 20 25 30 35 40 20 40 60 80 100 120

Figure 34: Sizes of the sample networks used to assess the performance of hierarchical packing. The sizes are shown for the different components of the graph along with their packed version. The reductions were obtained with the random packing approach (a). Histograms of the reduction sizes for the different components of the networks between the networks and the packed versions (b).

65 3 hierarchical computation of extreme pathways 800 600 10000 400 5000 200 time on original network time on original network 0 0

0 5000 10000 15000 0 200 400 600 800 1000

time on packed network time on packed network 0.5 100 80 0.4 60 0.3 40 0.2 20 0.1 time on original network time on original network 0 0.0

0 20 40 60 80 100 0.0 0.1 0.2 0.3 0.4 0.5

time on packed network time on packed network

Figure 35: Plot of the computation times (in [ms]) on the uncompressed networks versus the compressed networks. The times on the packed network only concerns the computation of the extreme pathways and do not include the reconstruction step. The dashed line is the x = y line. The four plots are the same except the interval is reduced to make visible the distribution of the points for the shorter computation times.

66 3.9 Performance of hierarchical extreme pathways computation

1.0

0.75

0.5

0.25 Percentage of networks causing a timeout of networks Percentage

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1.00.9 Sample size in percent of the E. Coli network

Figure 36: Percentage of a 24 hours timeouts in function of the number of vertices (expressed in percentage of the E. coli network). The transition is steep hence the difficulty to obtain big network whose extreme pathways are still computable.

ever we recall that the compression did not reproduce the same compression ratio. But with only a visual inspection of the data, it is very hard to assess that the computation times are identical. Thus we conducted a Wilcoxon signed-rank test. This test is a non-parametric statistical hypothesis test used to compare two related samples to assess whether two samples come from repeated observations of the same event differ or not (if their population mean ranks differ or not) [Wilcoxon, 1945]. It has the advantage that it does not assume that the data follow any distribution 10. The test produces a p-value 11 turns out to be ≈ 10−5 ( than the 0.05 significance level) hence the null hypothesis, that the populations are identical, is rejected.

The extreme pathways computation times of a network and its compressed version differs. It seems that there is a speedup but to be able to make those measures we were limited by the computationally achievable networks. Those case were too easy, thus it is hard to deduce a relation between the compression and the improvement in computation time. It is also hard to find networks that take a computation time that is acceptable for empirical measure (i.e. less than 24 hours without timeout) and being fairly compress- ible. For example, the figure 36 shows the percentage of network samples that cause a one hour timeout and it also shows a steep transition around 0.5 (50% of the E. coli vertices). This illustrate the difficulty to generate samples that are big enough and easily computable (this without granting that those sample are well compressible). We also add that in average the compressed network are composed of ≈ 93% of the vertices of the uncompressed net- works and the computation times of the compressed networks take ≈ 95% of the time of the uncompressed network. However we will not draw any conclusion from this last statistic. The chapter conclusion will discuss again the problem of network size and extreme pathways computation.

10. It can be shown with a Q-Q plot that the measured times do not follow a normal distribution. 11. Lets recall that the p-value is the smallest level α0 such that it would reject the null hypothesis at level α0.

67 3 hierarchical computation of extreme pathways

Computation time of the network samples 1e+03 1e+01 Computation times in [ms] (log) 1e−01

450 500 550 600

Number of vertices in the network

Figure 37: Plot of the computation times (in [ms]) on the samples networks versus the log of the number of vertices in the sample. This plot shows a very weak correlation between the computation time and the number of vertices.

3.10 detection of intractable systems

The problem with network packing is that the packed problem may still be as hard as the unpacked one. Indeed, the difficulty of a network is not completely related to the size of the network, but also with its topology. When a network is packed, the number of extreme pathways remains the same and the elements in the matrix generating intermediate rays may not be simplified further. It is clearly not enough to assess the complexity of the problem to be solved by only observing the size of a network. There is no relation between the computational time of the extreme pathways and the number of vertices as this is illustrated in figure 37. For a given number of vertices, the computation can increase to a factor 1000 in the worst case. So this is why we think the packed matrix (the constraints matrix C in equation 42) still contains the structure that poses the problem and is probably linked with the degrees distribution of the metabolites. To illustrate the cause of the packing inefficiency, we first propose to study the degrees distribution of three type of networks. The E. coli metabolite network (refereed as full K12), the compressed E. coli metabolite network (refereed as packed K12) and a set of easy networks. The easy networks are networks whose extreme pathways are computed in reasonable amount of time 12 and are composed of a number of vertices similar to the number of vertices in packed K12. Those easy networks were extracted using the forest fire sampling method on the E. coli network. In the same manner as it was done in the random packing approach (see section 3.8.1). We add that the targeted size of an extracted subnetwork is the size after the subnetwork has been pruned (section 3.2). This procedure has produced 10 different easy networks. To see how similar are the easy networks, we plotted the distributions of the compounds degrees and reactions degrees.

12. This means less than 12 hours, however the majority is computed in less than 10 minutes.

68 3.10 Detection of intractable systems

● ● ● ●

1.00 ● ●

1.000 ● ● ● ● ● ● ● ● ● ● ● ● ●

0.500 ● ● 0.50 ● ● ● ● ●● ● ●● ●● ● ● ● ●●● ● ●● ● ● ● 0.200 ●● ●●●

● 0.20 ●●● ●● ●●●● ●●● ●●● ● ●●●●●● ● ●●●●● ● 0.100 ● ● ●●●●● ● ●●●●● ● ● ●●●●●● ●● ● ●● ●●●●● 0.10 ●●● ●●● ●●● ●●● ●●●●● 0.050 ● ●●● ●●●● ●●● ●●●●● ● ●●●●● ●●● 0.05 ● ● ●●●● ● 0.020 ●●●

● ●●● ●●● ●

0.02 ● 0.010 ● ● ● ● ●● ● ● ●●●●●● ● 0.005

0.01 ● ● 1 2 5 10 20 50 2 3 4 5 6 7 ●

Degrees Degrees

Figure 38: Plot of the log of the degrees distribution for the compounds (left) and reactions (right) of the 10 easy networks. On the left the compounds degree distribution is fitted with a power law function. This emphasis the similarity of the distribution as the slopes are similar (i.e. the γ in the function y = axγ). Each color represents one of the 10 easy networks.

The similarity of those networks in terms of degrees distribution is assessed in figure 38. Those plots provide a first clue that the easy networks share some common structure in the graph. Then we compared the K12 to the packed K12 networks (figure 39). Both networks are considered as hard problem for the extreme pathways computation. We first note that in the packed K12 network, the distribution is now fat-tailed. This is an intuitive consequence of the packing as a meta-reaction embeds several reactions and its substrates and products are the input and output of the system it represents. This system is also probably composed of reaction converting few different metabolites. Hence explaining the fat tail for the reactions’ degrees. Regarding the chemicals distribution, it seems that the constraint matrix has an up shifted distribution. This is also fairly intuitive. The small degree compounds have a higher probability of being embedded in a meta-reaction and thus be removed from the network. The reactions replacement by meta-reactions increases the proportion of reactions having a high degree (i.e. reactions converting a lot of chemicals). Visually it produces a fat tail in the graph. This means that some chemical are shared by a lot of reactions and meta-reactions and cannot be embedded in meta-reactions. Those metabolites are probably used in several subsys- tem abstracted by the meta-reactions and therefore cannot be part of any

69 3 hierarchical computation of extreme pathways compression. This conservation regarding the compression seems to concern mainly the high degree metabolites. Indeed the distribution of the packed K12 appears like a translation in comparison to the K12 network thus indi- cating only a reduction of the proportion of the low degrees metabolite (i.e. encapsulated in a meta-reaction). This tends to demonstrate that packing only concerns sparse parts of the stoichiometric matrix. The the distributions of the degree reactions are similar between the K12 network and the 10 easy networks (figure 39). This is fairly intuitive as both are a modeled from real metabolic network and reaction transforming around eight chemicals is rare 13. Regarding the distribution in the case of the compounds it is again intuitive that the low degrees metabolites have similar distribution. However in the easy network there is a significant difference regarding the distribution of the high degrees metabolites(i.e. a degree greater than fifty). Indeed those are even nonexistent in the easy networks. Thus confirming the conservation of a common structure in the K12 network that makes the extreme pathways computation not tractable and is not part of easy networks. The last plot shows the degrees distributions when only the conserved metabolites are taken into account. In this case the K12 and packed K12 distribution are identical. This shows that those metabolites are conserved during the com- pression (i.e. no lesser connection because of encapsulated reactions) and probably forms the hard part of the network that characterize its difficulty regarding the extreme pathways computation. Hence even if the network is reduced to size comparable to some easy networks, the packed K12 still not tractable because of this conserved structure. This argument is unfortunately not based on a mathematical demonstra- tion. However we are convinced that the hard metabolic networks have in common a structure (in the network, thus implying in the stoichiometric matrix) that is incompressible and makes them intractable, regardless the number of vertices in the network. Thus making the strategy of network com- pression inefficient and not worth of consideration. However the constraints matrix may be worth to be studied. Indeed, the compression or packing may be an indicator of extreme pathways computing feasibility. The intuition to do this comes from the formalization of the algorithm. Indeed, in the case of a hard network, once the matrix S has been block diagonalized, the con- straints part (i.e. the matrix C) corresponds to the compressed network and still as hard as the uncompressed version of the network. Hence the decision to study some statistics of the networks before and after the compression step. The idea is to spot a difference or a conservation in the structure of the network once it has been compressed in the hard or easy case. Regarding the preceding observation we decided try to measure the difference between an easy network and an hard network. To do this we generated network of size ranging from 650 and 700 vertices from the K12 network. This is done in the same manner it was done to extract the easy networks. We allowed a maximum computation time of 24 hours for the extreme pathways. If the extreme pathways can be computed within the time limit, a network is labeled as easy, if not it is labeled as hard. Then for each network, regardless its difficulty label, a compression operation is applied. First let’s define four types of networks: the easy uncompressed networks (identified only as easy), the easy compressed networks, the hard uncompressed networks (identified only as hard) and the hard compressed networks. The idea is to observe the degrees distributions in the networks and more specifically the degree of compounds. For example, let’s consider a very long linear

13. At least in the available biological databases.

70 3.10 Detection of intractable systems

Empirical CDF Distribution (log) Fn(x)

K12 Full K12 Full K12 Packed K12 Packed Easy Easy 0.0 0.2 0.4 0.6 0.8 1.0 0.005 0.020 0.100 0.500 2 4 6 8 2 3 4 5 6 7 8

Degrees Degrees

Empirical CDF Distribution (log) Fn(x)

K12 Full K12 Full K12 Packed K12 Packed Easy Easy 0.005 0.050 0.500 0.0 0.2 0.4 0.6 0.8 1.0

0 50 100 1 2 5 10 20 50 100

Degrees Degrees

Figure 39: The empirical cumulative function and log of the degrees dis- tribution for the reactions in all three considered networks are plotted in the first row of the figure. The same plots for the the compounds in the networks are in the second row.

71 3 hierarchical computation of extreme pathways

Empirical CDF Distribution (log) 0.20 0.50 1.00 0.6 0.8 1.0 Fn(x)

K12 Full K12 Full K12 Packed K12 Packed Easy Easy 0.0 0.2 0.4 0.01 0.02 0.05 0.10

0 50 100 2 5 10 20 50

Degrees Degrees

Figure 40: Empirical cumulative function and the log of the distribution of the common compounds degrees in the K12 networks. The common compounds are those that are also present in the packed network (i.e. compounds that are not encapsulated in a meta-reaction).

72 3.10 Detection of intractable systems

chain of non reversible reactions with one input flux to the reaction on one side of the chain and one output flux on the other side of the chain. This kind of network contains one unique extreme pathway which is quickly computed. Moreover this network has a high compression ratio 14. Indeed this kind of network can be compressed as a one reaction network (i.e. only converting the input and output of the system. Interestingly those two networks (uncompressed and compressed) have very similar compounds degrees distribution. In fact both networks have all their compounds having a degree of two. Thus the compression operation on an easy network may act as a scale factor on the network. To quantify this difference we supposed that the distribution of easy, easy uncompressed and hard network are generated from the same process, thus the degrees of those networks come from the same distribution. Then we suppose that the degrees of the hard compressed networks come from another distribution (or the same but with different parameters). When plotting the cumulative distributions we observe that the difference between the easy and easy compressed network is not as big as it is between the hard and hard compressed. To measure the distance between the two distributions P and Q, we use the Kullback-Leibler divergence:

P(i) K(PkQ) = P(i) ln .(61) ∑ ( ) i Q i

One may note that the divergence is not symmetric: K(PkQ) 6= K(QkP). As the distance is not symmetric, we compute K(PkQ) such as P is the non-compressed network and Q is the compressed network. For the easy cases the distance is in average 0.0286 and for the hard cases it is in average 0.1461. Hence the compression operation induces a more drastic change in an hard network. However this observation does not allow to decide if a network is hard or not. After a compression operation on a network, the divergence should be sufficiently large to deduce that we are working with an hard, but we do not know what sufficiently large is. Hence we used the two samples Kolmogorov-Smirnov test to decide if the vertices distributions before and after a compression operation are different [Young, 1977]. We recall that the test use the Kolmogorov-Smirnov statistic Kn,m, which is based on the empirical distribution functions of each sample (denoted F1,n and F2,m): Kn,m = sup |F1,n(x) − F2,m(x)| (62) x and test if: r n + m K > c(α) .(63) n,m n · m Where c(α) is a function of the confidence interval α. The ideal case would be that in the case of easy networks the distributions are the same and in the case of hard networks the distributions differ. The results of the tests are not ideal but there is a tendency indicating that in the case of an hard network with comparable size to an easy network has an higher probability to reject the hypothesis that the degree distribution of compounds are the same (figure 41). In the case of the easy networks, the ratio of networks that have the same distribution between the packed and non-packed is 0.94. In the case of the hard network, the ratio of network that have the same

14. That is to say number of vertices before the compression versus the number of vertices after the compression.

73 3 hierarchical computation of extreme pathways

Table 13: Performances obtained with the Kolmogorov-Smirnov statistic to identify tractable metabolic network for extreme pathways com- putation. The correctly identified values are the percentage of networks whose difficulty is correctly evaluated. Tractable networks Intractable networks Identified as tractable 179 51 Identified as intractable 11 120 Correctly identified 0.94 0.70

Figure 41: Two samples Kolmogorov-Smirnov tests for all pairs of network having similar sizes (≈ 650 vertices). A color indicates that the test is rejected and therefore the degrees’ distributions are not the same. The different colors indicate which group is compared. The blue ones are the easy networks, the green ones are the easy packed networks, the orange ones are the hard networks and the red ones are the hard packed networks. distribution between the packed and non-packed is 0.30 (table 13). That is to say if the test is used as a criterion to identify tractable systems, we obtain a sensitivity of 0.94 (equation 74), a specificity of 0.7 (equation 75) and a total accuracy of 0.83 (i.e. the percentage of correctness over the total number of networks). Hence this shows that the packing can be used as tool to produce a good criterion to identify the hard networks.

3.11 conclusion and perspective

Although several similar approaches have been proposed (e.g. in Schilling et al.[ 2000]), we are the first to provide an implementable algorithm with a mathematical formalization and performance analysis. Moreover, in the algorithm described in Schilling et al.[ 2000], it is needed to compute all the subsystems’ extreme pathways for a given division of the network, otherwise the exchange constraints cannot be correctly set. With improper constraints the union of the subsystems solutions will produce wrong ray in the fluxes cone. Depending on the division of the network this may lead to intractable

74 3.11 Conclusion and perspective partition and therefore it is impossible to set the constraints. Hence this approach may not be suitable for network packing as presented in this thesis. To circumvent this problem, in our algorithm the meta-reactions are still created using the extreme pathway of the purely local subsystems and are added to the network. The purely local phrase means that the extraction of a subnetwork (and its constraints) does not depend on the pathways of the rest of the system (or its division). This has the drawback to potentially produce subnetworks with exchange fluxes not constrained enough in the context of the whole system. Consequently this may add undesirable meta-reactions. However this case can only appear when a compound is consumed and produced by the system and the subnetwork (see table 9). This is not an issue as the erroneous extreme pathways produced by those reactions are easily identified by a last independence check at the end of the computation, thus allowing the exact computation of the whole system extreme pathways. Moreover, we have proposed a formalism of our algorithm (that can be easily extended to the SP-Algorithm) and in our knowledge a full description of this approach in term of matrix operations has not yet been published. This formalism allowed us to see that the packed network whose stoichiometric matrix is the product of the constraints matrix with the solutions of the decoupled subnetworks. The form of this new stoichiometric matrix seems just to be a more compact (or less sparse) version of the stoichiometric matrix of the network before the preprocessing. Unfortunately the core structure of the matrix that is the cause of the numerous intermediary non valid solutions in double description based methods may still be present. Therefore the enumeration problem of the processed network may stay as hard as the unprocessed one. A mathematic demonstration of the previous statement is beyond the scope of this project. However we have conducted experiments to empirically assess the usability of an approach based on subdivision and packing of the network. Unfortunately this approach does not seem to allow the computation of extreme pathways in large genome scale metabolic network (i.e. E. coli or H. sapiens). We have also proposed and studied the efficiency of several packing approaches. In the case of E. coli, almost all approaches reduce the size of the network with similar performances (more or less half of the vertices). We have compared them and propose that the best approach is the one based on the usage of the Schuster splitting in a hierarchical manner. This approach is the fastest to be computed and produced equivalent reduction to the other methods described and evaluated in this thesis. Nevertheless this study seems to be the first to formalize the algorithm based on the division of a system into subsystems to compute the extreme pathways. Also it seems also that it is the first to propose a study on the efficiency of the network subdivision and packing to compute the extreme pathways. Even if it has not lead to publishable results, our study has nu- merically demonstrated that this approach, although proposed in literature, should be given up. We have also shown an interesting result regarding the identification of non tractable network with the packing (or compression). It seems to be the first time that using network subdivision and packing has been made to identify non-tractable problems and can be a complementary analytical tool to decide if the extreme pathways computation should be ap- plied on a given metabolic network. We believe that this direction should be explored in order to describe the structural change between the non-packed and packed network. We also consider that investing efforts to improve the generating sets computation should be focused on the improvement

75 3 hierarchical computation of extreme pathways of canonical basis algorithms or the more promising nullspace algorithms. Finally we think that packing with meta-reactions can be an interesting tool for visualization as it offer a network with the same metabolic capabilities than the original one, but with less components. Thus displaying a simpler network than can be more easily interpreted and understood by researchers.

76 MODULEDETECTIONINMETABOLICNETWORK 4

The graph or network is a powerful and convenient tool to represent complex phenomena. Network analysis is widely used in many disciplines to understand complex system from biology to social science. Those networks are often composed of thousand of vertices and as much or more edges, thus a simple inspection of the raw graph does not provide information or knowledge about the phenomena model by the network. Hence the usage of mathematical tools to analyze the networks. This chapter is devoted to module detection in such a metabolic network. The continuity of this chapter with the preceding one is that we are still using the extreme pathways sets to discover modules of a metabolic network.

4.1 motivation

Network analysis is widely used in many disciplines as the graph is a convenient tool to represent complex systems. Example of complex system modeled by a graph are the power grid where analysis have been conducted to evaluate their reliability [Crucitti et al., 2004, Chassin and Posse, 2005, Kinney et al., 2005]. Social interactions can also be modeled by a network and are maybe the most understandable models for a human being as everybody is concerned by socialization [Borgatti et al., 2009]. Example of social network analysis is the development of tools for law enforcement (understanding criminal organization or terrorist groups) [Sparrow, 1991, Krebs, 2002, Xu and Chen, 2003, Ressler, 2006, Koschade, 2006, Perliger and Pedahzur, 2011]. Module detection consists in dividing the vertices of the network into groups 1. Ideally, the vertices that belong to a group should share some common or close characteristics. Typically the vertices in a group have greater similarity among the vertices in the same group than the ones in the different groups. In the case of a biological network, the vertices in a group should share some common biological, functional or chemical properties. For example in the case of the protein-protein interactions (PPI) and metabolic networks. Graph analysis and clusters are used to detect functional modules that are a set of biological objects that participate in a particular cellular process together at the same time. Such approaches are often used to predict the function of proteins in the network [Sharan et al., 2007, Samanta and Liang, 2003, Hu et al., 2011, Huynen et al., 2003, Pereira-Leal et al., 2004]. Network analysis is also used to discover and predict proteins complexes, which are proteins interaction to form a single molecular machine [Yu et al., 2006, King et al., 2004, Jung et al., 2010, Qi et al., 2008]. Detecting modules also allow obtaining a coarse representation of the system and thus a more convenient representation. Thus providing a better understanding of the role of the components that make the system. In this chapter we explore the detection of modules in a metabolic network. Indeed a metabolic network is a complex graph, composed of thousands of nodes and vertices. Usually when metabolism is studied or taught, the whole network is never used. Instead it is partitioned in several fragment that are more convenient to understand or interpret. In textbooks, the metabolism

1. Graph partitioning, communities’ detection or clustering, all refer to the same concept.

77 4 module detection in metabolic network is nicely split into clear parts. However in reality everything is entangled and hard to read. The usual proposed separations are maybe too artificial. Moreover, the modules themselves are loosely defined. For example, with social network, it is somewhat easy to find a meaning with community detection according some attributes describing the actors of the network. As the interactions happen from human social behavior, finding an explanation of what gives birth to the clusters is easy. In the case of metabolic networks the task is much harder. Moreover a biological module can be described for historical reasons (i.e. the order of discovery of the enzyme accomplishing the functions) or be influenced by the knowledge of the one defining the module. For example, one can think of an expert in the metabolism of a specific amino acid. The expert may have the tendency to aggregate some loosely related reactions or compounds to the module synthesizing or degrading this particular amino acid. Thus making hard to decide if a given cluster or module is correct. It can often be observed in a way it makes a biological sense in accord to some accepted and artificial splitting of the metabolism. Interestingly the authors in Papin et al.[ 2004a] speak of biased and unbiased clusters. An unbiased view is a mathematical approach to build a partition of the network and a quantitative criterion to assess the correctness of the clusters. Such approaches are described in this chapter in the following sections. The extreme pathways of a metabolic network, once computed, allow to describe the metabolic capacities of the network. The intuition is that chemical reactions belonging to common pathways are more likely to be part of a common function or involved in the same process. Hence the reactions in common pathways may be part of the same regulation process. The challenge is particularly interesting as it also try to answer the question, if the structure of the metabolic network allows to infer genetic processing. In the current case the structure is provided by the stoichiometry matrix and the pathways by the extreme pathways.

4.2 an extreme pathways similarity measure

There exists several approaches to detect modules in a network. The graph can be divided into groups by removing edges until the desired number of connected components appear. For example the authors in Girvan and Newman use the edge betweenness measure to successively remove edges in a network until the vertices are split in the desired number of connected components or modules [Girvan and Newman, 2002] . Spectral clustering also divides the graph in several components by applying a min-cut like criterion 2 to separate the vertices into modules [Pothen et al., 1990]. Also module can be found by searching for the partition that maximizes the network’s modularity. The modularity measure the strength of division of a network into modules. A high modularity indicates that vertices in a module have more connection within their module than with node in different modules [Guimerà et al., 2005, Newman, 2004, 2006]. Vertices can also be regrouped according a measure of similarity between vertices. For example, the cosine similarity [Salton, 1989] or the Pearson correlation coefficient between two rows of the adjacency matrix [Stigler, 1989]. However the classical measures may capture structural similarities that may not be of any biological significance.

2. It can be demonstrated that the spectral clustering correspond to a relaxed version on the graph min-cut problem.

78 4.2 An extreme pathways similarity measure

Several approaches have been proposed to cluster specifically metabolic networks. For example, one approach use the metabolites with the smallest degree to detect modules in a metabolic network [Gagneur et al., 2003]. But as the metabolism is strongly linked to the notion of pathways, approaches based on the metabolic capacity of the system have been proposed. For exam- ple, the so-called Schuster criterion has been used to divide a network into subnetworks [Schuster et al., 2002b]. The criterion is detailed in section 3.8.2 and is not based on any computation of fluxes or extreme pathways. One interesting idea is the concept of correlated reactions set (co-sets) [Papin et al., 2004a, Xi et al., 2011]. Their idea is to use the flux matrix to com- pute the correlation coefficient between each row. This provides a measure of similarity between the network’s reactions. The authors define several reactions sets, the perfect co-sets, the partial co-sets and the directional co- sets. The perfect co-sets are the sets composed of reactions having a square correlation coefficient equals to one. Then a positive but smaller that one square correlation coefficient regroups the reactions forming a partial co-sets [Burgard et al., 2004]. The authors stated that co-sets in metabolic networks identify functional modules but, as it has been reveled in this thesis, the modules are not necessarily intuitive. Nevertheless they identify 45 co-sets (of 66) where each contains at least half genes belonging to the same regulon [Papin et al., 2004a]. Another interesting approach is the use of so called t-invariants that comes from the petri net theory. But in the case of biological network analysis or simulation, the t-invariants corresponds to the elemen- tary modes [Schuster et al., 1996] and can be applied to metabolic networks. In Grafahrend-Belau et al.[ 2008] they propose to use the t-invariants to build clusters. Roughly they create groups by hierarchical clustering based on the Tanimoto distance between the t-invariants. They apply their approach to the pheromone response pathway in yeast and the gene regulation of Duchenne muscular dystrophy in human and were able to find clusters of biological meaning.

However those approaches require the computation of elementary fluxes or extreme pathways. Our goal is to propose an approach based on the co-occurrence of a pair of reactions in pathways. We also wanted to propose an approach that does not require the computation of all pathways as this problem can be intractable. Also we are not interested in the rate of the reaction but rather to the pathways the reactions is involved in. We only want to characterize a reaction (and by transitivity the enzymes that catalyze the reaction) by the extreme pathways in which the reaction participate and derive a metric or a distance that capture the similarity between reactions.

Those ideas are very interesting and we also tackle the problem to build a metric that can be used to cluster the network. In our case we focus on producing attributes for reactions and measure the similarity between those attributes to infer the similarity between two reactions. To do this we rely on the extreme pathways but we are not interested in the rates of a reaction, but rather in its participation to a given pathway. For a given reaction i, its value in an extreme pathway indicates at which rate the reaction convert chemicals. If the value is non zero the reaction participate in the transformation process of the input metabolites to the output metabolites, otherwise it is does not participate. Thus for a given row pi of P corresponding to the reaction i, all the non null values pij correspond to the boolean > (logical true) and the

79 4 module detection in metabolic network null value to ⊥ (logical false). Those booleans correspond to the attribute aij: ( > if pij > 0, ai j = (64) ⊥ otherwise.

This forms the attribute vector ai. Hence the attribute of a reaction is simply a set of indexed boolean indicating in which extreme pathway the reaction has a non-null rate. The similarity between a pair of reaction can be computed with the Jaccard index [Jaccard, 1901, pp. 241-272]. The index is a metric that compare the similarity between two samples. It is the ratio between the intersection cardinality and the cardinality of the union of the considered samples. The definition for two attributes vectors a and a is:

| a ∩ b | J = ∈ [0, 1].(65) a,b | a ∪ b |

The Jaccard index can be computed for two boolean attributes vectors a and b where the intersection cardinality is the number of position where ai ∧ bi is true and the the intersection cardinality is the number of positions where ai ∨ bi is true. This definition is also known as the Tanimoto similarity [Rogers and Tanimoto, 1960]. The Jaccard index can be easily transformed into the Jaccard distance:

| a ∩ b | d = 1 − J = 1 − ∈ [0, 1].(66) Ja,b a,b | a ∪ b |

The justification of the construction of such a similarity is that a set of reactions frequently participating the same extreme pathway (or elementary mode) may have a high probability to be part of an non splittable set of reactions. This may be better understood by considering the extreme cases. Suppose a pair of reactions i and j are always part of the same n extreme pathways. Then there exist no extreme pathway such as one of the reactions is zero and the other is not. This imply that in the considered network it has no sense to consider this pair of reaction independently has the network will never use those reaction independently. They will always convert their chemical together. Hence i and j are considered having a perfect similarity. The opposite is also reasonable. Suppose a pair of reactions i and j that are never part of the same extreme pathway. That is to say that there exists no pathway pk such as pi,k and pj,k are non null. Hence it makes no sense to observe those reactions at the same time to study them. Therefore i and j have no similarity. Finally this can be extended to any fraction of common participation to an extreme pathway. The more the reactions i and j participate to common extreme pathway the more i and j are similar. Hence for a network containing n reactions, producing m extreme path- ways (P ∈ Rn×m), the needed steps to compute a similarity matrix containing the pairwise distance between all reactions pairs are: 1. compute the matrix P; m 2. for each reaction i compute its attributes vector ai ∈ B (equation 64), where we define B as the set {>, ⊥}; 3. for each pair of reaction i and j, compute the distance with equation 65. This produces a value for each pair of reaction than can be arranged in a similarity matrix ∈ Rn×n. The next section illustrate a complete example of similarity matrix computation.

80 4.3 The ε-graph

Figure 42: The toy metabolic network, composed of two irreversible reactions 0 0 (r1 and r4), two reversible reactions (r2 + r2 and r3 + r3) and four exchange fluxes (represented by triangles and unnamed in the figure).

Figure 43: The extreme pathways of the above toy metabolic network (fig- ure 42). This figure only shows which reactions participate in an extreme pathways, but does not gives information about the reaction rates as those are unneeded in this case.

4.2.1 Example of extreme pathways similarity

We illustrate through the following examples how we canuse the similarity defined above. The example toy metabolic network is exposed in figure 42. It is composed of two irreversible reactions, two reversible reactions, five chemicals and four exchange fluxes. There are seven extreme pathways describing the capabilities of this network. The pathways are graphically represented in figure 43. The matrix Dep contains all the pairwise distances between the toy network’s reactions:

0 0 R1 R2 R2 R3 R3 R4   R1 0 1 1 1 1 1 R2  1 0 1 0.5 1 1  0   R2  1 1 0 1 0.5 1  Dep =   (67) R3  1 0.5 1 0 1 1  0   R3  1 1 0.5 1 0 0.5  R4 1 1 1 1 0.5 0

4.3 the ε-graph

The simplest usage of a dissimilarity measure is to construct the ε-graph of the reactions. Which is a graph whose edges are weighted according the similarity between a pair of vertices. Obviously a weight of zero means that the edge does not exist. Thus if all similarity are strictly positive the ε-graph is a complete graph. Then a clustering based on the spectra of the Laplacian of the graph can be computed to detect communities in the graph. The usage of the spectral clustering is that it naturally take in consideration as the

81 4 module detection in metabolic network

Table 14: Partitions produced by spectral clustering of the toy network (fig- ure 42). The clusters for k = 1 and k = 6 are not displayed as they are obvious (all in one cluster and everybody is a singleton cluster).

k C1 C2 C3 C4 C5 0 0 2 {R1}{R2, R2, R3, R3, R4} 0 0 3 {R1}{R2, R3, R4}{R2, R3} 0 0 4 {R1}{R2, R3}{R2, R3}{R4} 0 0 5 {R1}{R2, R4}{R2}{R3}{R3}

Laplacian of an undirected graph is computed with the weighted adjacency matrix. We recall that Laplacian L is defined as:

L = D − W,(68) where D is the diagonal degree matrix: ( deg(vi) if i = j dij = .(69) 0 otherwise

As we can see, this approach takes directly into account the edges’ weights. For example, with the matrix in the equation 67, the eigenvectors matrix is:

u1 u2 u3 u4 u5 u6  0.41 0.0 0.0 0.0 −0.0 0.91   0.41 −0.54 0.57 0.08 0.41 −0.18     0.41 0.30 0.41 −0.46 −0.57 −0.18  U =   (70)  0.41 0.30 −0.41 −0.46 0.57 −0.18     0.41 −0.54 −0.57 0.08 −0.41 −0.18  0.41 0.49 0.0 0.75 0.0 −0.18 where each column ui, (i = 1, . . . , 6), of the matrix U is an eigenvector, ordered from the smallest to the biggest eigenvalue (λ1 < λ2 < ... < λ6). Then if k clusters are needed, the first k eigenvectors are used to create k clusters (i.e. clustering six points ∈ Rk). The produced clusters are listed in table 14. This partitioning can be intuitively understood with this simple example (figure 44). Indeed, the first cut (k = 2) isolates the reaction R1 as it is the one sharing less extreme pathway with the other. This reaction makes the only internal flux of the extreme pathway. Then, when three clusters are built (k = 3), again one cut isolates R1, and the other cut separate the flows. One in direction producing the chemical “b” from the chemical “d”. The other from the chemicals “d” and “e” to produce “b”. The other cuts (k = 4 and k = 5) do not provide an interesting insight of the network structure. The reader should not that the cut with k = l is not the product of a split from one of the cluster produced with k = l − 1 (for example, when k = 4 and k = 5, table 14). This nice results in a toy network make sense and were a motivation to apply our metric with genome-scale network. We add that in the previous example we have taken into account the direction of the network. That is to say we have considered a reaction and its reversible as different reaction. It is also possible to detect modules without taking into account the direction of the network. To do this a reaction and its reverse is one reaction. Its set of pathways is the union of the sets of pathways

82 4.3 The ε-graph

Figure 44: The two clusters (k = 2) and three clusters (k = 3) produced by the spectral clustering algorithm.

83 4 module detection in metabolic network in which the reaction and its reverse are active. To see an application of this case, see section 4.6.

4.4 hierarchical clustering

The next approach to use our metric to detect modules in the network is the hierarchical clustering. In an agglomerative hierarchical clustering, the vertices are successively merged together to form modules. This is done until all vertices are part of one unique module. As the hierarchy is built, the agglomeration is not made only between vertices but between a vertex and a cluster or between clusters. However the metric is only defined between pair of vertices. Hence it is needed to define how the metric is used to compute the similarity with clusters. The selected criterion to compute the hierarchy is the unweighted pair group method with arithmetic mean (UPGMA). The general formula to compute the similarity between two clusters A and B is:

1 1 − J ,(71) | | · | | ∑ ∑ a,b A B a∈A b∈B

It should be noted that this measure may group pairs of reactions that are not direct neighbors (i.e. sharing common metabolites). However this is not a problem as we wanted to potentially catch similarities that are not directly visible in the graph topology. This choice is also validated on the E. coli network with a quantitative criteria to compare the UPGMA with the single linkage criterion: min 1 − Ja,b,(72) a∈A,b∈B and the Ward’s method [Ward Jr, 1963] (see section 4.7.1 and figure 60). Those are the most common linkage methods. The hierarchical clustering of the toy metabolic network is illustrated in the figure 45. This hierarchical decomposition of this simple network is similar to the one obtained by spectral clustering. When k 6= 3 the partitions are the same. When k = 3, the reaction r4 is merged with the 0 0 cluster containing r2 and r3. This merge is understandable as there is an extreme pathway converting “d” into “e”. There is a difference between the hierarchical clustering and the spectral clustering in terms of graph topology (see table 14). However from a biological point of view both 0 0 clustering are equivalent. Indeed, grouping {r4} with {r2, r3} or {r2, r3} is equivalent as this is the consequence of converting reversible reactions into a pair of irreversible reactions. Thus both a reaction and its reverse result from the action of the same gene product. Hence the groups are equivalent: 0 0 {r2, r3, r4} ≡ {r2, r3, r4}. Again this nice interpretation motivates the usage of our proposed extreme pathways distance to measure reactions’ similarities.

4.5 computation of the distance

The definition of the distance imposes the computation of all extreme pathways of a given network. This can be a drawback as one can be unable to compute the extreme pathways therefore the similarity between any pair of chemical reactions. As it was explained in chapter 3 this computation is unachievable for a genome scale metabolic network. Therefore one can not count the number of extreme pathways shared by a pair of chemical reactions.

84 4.5 Computation of the distance 0.8 0.9 1.0 Height 0.5 0.6 0.7

Figure 45: Hierarchical clustering of the toy metabolic network (figure 42) using the UPGMA agglomerative method based on the extreme pathways distance (the Dep matrix). The heights of the dendro- gram branches (or the u’s) give the distance between objects.

85 4 module detection in metabolic network

4.5.1 Approximation

To circumvent the problem of extreme pathways computing a method to approximate the distance is now detailed. The idea is to combine the computation of extreme pathways distances between pairs of reactions in metabolic subnetworks extracted from the metabolic network. If a given pair distance is computed in several metabolic subnetworks, the distance is averaged. This idea is somewhat similar to the approximation proposed in the algorithm 3.

Algorithm 3 Estimation of the extreme pathways similarity between two reactions Require: The extreme pathways matrix P and two reactions ri and rj Ensure: deij, an approximation of the EP distance between ri and rj deij ← 0 k ← 0 for 1 . . . n do p ← a random extreme pathway from P pi ← the rate of ri in p pj ← the rate of rj in p if pi > 0 and pj > 0 then k ← k + 1 end if end for deij ← k/n

This approximation (algorithm 3) shows that it is not required to have computed all the extreme pathway but a subset of them. Indeed, in our case, the computation of the similarity between two reactions is equivalent to compute the Jaccard distance between to set of extreme pathways. Hence a fair approximation can be obtained by having computed only a subset of the extreme pathways. However to have a subset of the extreme pathways matrix we do not want to provide to complete system or metabolic network. Hence we propose a Monte Carlo method to compute the extreme pathways distance between all possible pairs of reactions. The idea is to randomly select subnetworks in the studied system. The procedure of subnetwork extraction is the same as in section 3.4. Then the extreme pathways of the subnetworks are computed and the similarities (or distances) between all pairs of reactions are computed and the results are stored until the random Obviously the goal is to have a metabolic subnetwork whose extreme pathways are quickly computed, thus allowing the computation of the distance. The algorithm 4 gives a detailed view of the computation. The potential drawback of the algorithm 4 may come from wrong or missed similarities between a pair of reactions. Regarding the missed similarities it is easy to see that if two reactions indeed share common pathways, but are never part of the same sample, the distance estimation would be one instead of a given value ∈ [0, 1). The sampling may also produce wrong similarities. That is to say it produces extreme pathways coupling the reactions that are not part of any extreme pathways in the complete system. In our case it is the consequence of the subnetwork extraction during the sampling that does not properly constraint some external fluxes. This issue has been addressed in section 3.4 and a solution has been proposed. However on the case of the random sampling its correction requires the extreme pathways computation

86 4.5 Computation of the distance

Algorithm 4 Extreme pathways similarity approximation algorithm Require: A bipartite graph G = (V, E), with n reactions identified by a unique index Ensure: A matrix D containing all pair of distance between chemical reac- tions Rij ← [], an empty list for each reactions pair {i, j} while a stop criterion do S ← a random metabolic subnetwork extracted from G P ← compute the extreme pathways matrix from S for all reaction pairs {i, j} in S do dij ← compute the extreme pathways similarity add dij to the list Rij end for end while D ← a matrix filled with zeros ∈ Rn×n Dij ← compute the mean on Rij for all i, j of the rest of the system 3. But this is something we wanted to avoid as we wanted a fast network’s sampling and distance estimation. Nevertheless this is not an issue as we have empirically seen that it converge to the real matrix. The quality of the estimation of this approach is now demonstrated empir- ically. There are two parameters for this approximation algorithm because of the metabolic subnetwork extraction part. The number and the size of the metabolic subnetworks. To select the right parameters several values have been scanned and the quality of the approximation obtained with those parameters computed. Intuitively the more the subnetworks are sampled and the biggest they are the better the approximation should be. The reasons are that by augmenting the number of subnetwork the probability to pick a specific reaction is increased in order to approximate its distance with other reactions. By increasing the size of the subnetwork the probability to pick a specific pair of reaction is increased. Hence the increase in both value should lead to a better approximation of the distance. To experimentally measure the quality of the approximation of the extreme pathways distance in order to select the best pair of parameters, n metabolic subnetworks are extracted from the complete E. coli metabolic network with the forest fire like method (section 3.8.1). The reason is that it is impossible to compute the extreme pathways of a genome scale reconstructed network as we need it to compute the exact distance to compute the error of the approximation. Then for each of those subnetworks the approximation algorithm is applied with the following set of parameters. The subnetwork sizes: 10, 25, 50, 100, 200 and 300 and the number of subnetworks used to compute the distance between pairs of reactions: 1, 50, 100, 250, 500 and 100. The results in figure 46 show the Frobenius norm between the real distance matrix and the approximated matrix. They first show that if enough sample are drawn from the network as described in section 3.4, the approximated matrix converges to the exact matrix. Then they show that the rate of convergence depend on the size of the sample. The consequence of the sample size is better understood with figure 46b. If the sample contains almost all reactions, it has obviously a high probability to already contains all the extreme pathways. Thus producing a very good approximation

3. It can also been solved by considering a partition of the rest of the system into several subsystems. Then solve each subsystems and then compute the proper constraints.

87 4 module detection in metabolic network

Sample size 10 Sample size 25 Sample size 50 Sample size 100 Sample size 200 Sample size 300 Euclidian norm

1 100 200 300 400 500 600 700 800 900 1000 Number of subnetworks samples

(a) Measure of the Frobenius norm kD − DekF, with D being the reactions extreme pathways distance matrix and De being the computed approximation of D.

Sample size 10 Sample size 25 Sample size 50 Sample size 100 Sample size 200 Sample size 300 Fraction of common reactions Fraction

1 100 200 300 400 500 600 700 800 900 1000 Number of subnetworks samples (b) Measure of the fraction of common reactions taken from all the subnetworks for each approximation versus the complete metabolic network.

Figure 46: Measure of the quality of the extreme pathways distance approxi- mations. In this figure, a sample is a subnetwork extracted from the complete metabolic network and whose extreme pathways have been computed. A measure is made for the following sample sizes: 10, 25, 50, 100, 200 and 300 along with the following number of samples: 1, 50, 100, 250, 500 and 100. This experiment has been repeated 100 times and the plots shows the mean.

88 4.6 Red blood cell functional modules analysis

Table 15: The percentage of network components represented by a subnet- work of size n regarding the network it is extracted from. Reactions Compounds Vertices n µ σ µ σ µ σ expected 10 0.0173 0.0041 0.0621 0.0165 0.0475 0.0118 0.0250 25 0.0452 0.0069 0.1307 0.0228 0.1033 0.0166 0.0625 50 0.0928 0.0094 0.2257 0.0275 0.1843 0.0203 0.1250 100 0.1893 0.0126 0.3716 0.0328 0.3171 0.0240 0.2500 200 0.3898 0.0176 0.5970 0.0310 0.5394 0.0232 0.5000 300 0.5927 0.0192 0.7632 0.0321 0.7206 0.0240 0.7500

Table 16: The Standard deviations for each pair of parameters used in the subnetworks sampling. Those correspond to the deviation of the values presented in figure 46a. Sample 1 sample 50 sam- 100 sam- 250 sam- 500 sam- 1000 size ples ples ples ples samples 10 ≤ 10−4 ≤ 10−4 ≤ 10−4 ≤ 10−4 ≤ 10−3 ≤ 10−3 25 ≤ 10−4 ≤ 10−3 ≤ 10−3 ≤ 10−3 ≤ 10−3 ≤ 10−3 50 ≤ 10−4 ≤ 10−3 ≤ 10−3 ≤ 10−3 ≤ 10−3 ≤ 10−3 100 ≤ 10−3 ≤ 10−2 ≤ 10−2 ≤ 10−3 ≤ 10−3 ≤ 10−3 200 ≤ 10−4 ≤ 10−2 ≤ 10−2 ≤ 10−3 ≤ 10−3 ≤ 10−4 300 ≤ 10−2 ≤ 10−2 ≤ 10−3 ≤ 10−4 ≤ 10−4 ≤ 10−4 of the distance matrix. The plot in figure 46b also help to understand why by increasing the number of sample, the quality of the approximation increases. Indeed, the more samples are drawn, the probability that the samples cover all the network’s reactions. Thus computing pair distances that have not yet been computed and therefore increasing the quality of the estimated distances. The drawback of too big samples is that the extreme pathways computation of the sample may take too much time, or even may be intractable. The drawback of too small samples is that the approximation may take a long time to converge. Worse, it may never converge as if it may never capture some reactions’ pairs in the same sample. But the intuitive reason of success of this approximation lies in the fact that it is very unlikely to have a large number of extreme pathways that contains reactions that are far from each other in the network. Thus such long extreme pathways account for a similarity that is very small between such reactions. Most of the extreme pathways containing a given pair of reactions are probably confined in a small subsystem and is catch when the sample size is not too small. In the next sections we are going to assess the quality of our metric and its approximation by analyzing real metabolic network. The first one is the human red blood cell (erythrocyte) metabolism. As the network is small, no approximation is used because we are able to compute all the extreme pathways. Then we use our metric to the E. coli metabolism and propose a quantitative way to assess the quality of our metric and approximation.

4.6 red blood cell functional modules analysis

Red blood cells (or erythrocytes) contain hemoglobin which binds and allow the transport of oxygen to all tissues. Red blood cells require energy

89 4 module detection in metabolic network to maintain the following functions: glycolysis, synthesis of metabolite, like glutathione, purine metabolism, protection from oxidative denaturation and maintenance of the electrolyte gradient between plasma and cytoplasm. The human erythrocyte is anucleate once mature 4. Hence it produces its energy only by anaerobic glucose degradation. These features allows the erythrocyte’s metabolism to be represented by a simple network, illustrated in figure 47. This representation can appear more complex than those seen in literature or textbooks [Price et al., 2003, van Wijk and van Solinge, 2005, Kanehisa and Goto, 2000]) because we decide not to artificially duplicate chemicals (e.g. F6P and GA3P). The purpose is to highlight the complexity in manually identifying modules or pathways. Indeed, some chemicals are duplicated in the schema in order to clearly show the different modules. This has the disadvantage to provide a biased view toward the historical discovered pathways. We add that in the Embden-Meyerhof pathway (EM pathway or EMP), the reactions catalyzed by HX and PFK consume each one mole of ATP and that the reactions catalyzed by PGK and PK produce each two moles of ATP. In the pentose phosphate pathway (PPP), we also add that two molecules of nicotinamide adenine dinucleotide phosphate (NADP+) are reduced to (NADPH) by the reactions catalyzed by G6PDH and PDGH figure 47. The used abbreviations are listed in table 18 (enzymes) and table 17 (chemicals). The analysis of the red blood cell metabolism was already performed in several publication to test computational analysis of the metabolism, especially in the context of the extreme pathways [Wiback and Palsson, 2002, Price et al., 2003, 2004]. We decompose this metabolic network to produce modules with the dis- tance described above and then inspect them to assess the quality or correct- ness of the clustering. The goal is to do a qualitative evaluation of our metric. First we try to detect functional modules by enzymes rather than by reactions. Indeed an enzyme may catalyze a reaction in both direction. If a reaction is reversible, in the extreme pathways framework such a reaction is replaced by two irreversible reactions. Thus we have two possibilities to cluster reactions. (i) We can consider that the extreme pathways of a reversible reaction is the union of the extreme pathways sets of the reaction and its reverse. This can be seen as a clustering by enzyme catalyzing the reactions. (ii) We consider the reaction and its reverse as two distinct reactions. This variant takes the direction of the fluxes into account and produce a more fine clustering. We then use those sets to compute the extreme pathways metric. The hierarchical clustering is shown in figure 48 (a). Despite the fact that we are using an unbiased method to produce unbiased module (in the sense of Papin et al.[ 2004a]), the hierarchical clustering produces groups that are part of the classic or historical pathways. The first observation is the split into two modules: the glucose degradation and the nucleotide metabolism. These modules are labeled 1 and 2 in figure 48. From this point and for the rest of the paragraph, the circled number i are the modules labeled in figure 48. Then if we focus on the glucose catabolism 1 , the hierarchy shows several known groups. The entrance 4 to the PPP from nucleotide metabolism, the glucose catabolism and the Rapoport-Luebering shunt  1 and 3 . The splits yield the following modules: the Rapoport-Luebering shunt  3 , the majority of the Embden- Meyerhof pathway  5 and 7 , the the oxidative phase of the PPP  6 , and the non-oxidative phase of the PPP  8 . We also see the three parts of the

4. The gas transport requires erythrocytes to pass through microcapillaries and therefore constrains their size

90 4.6 Red blood cell functional modules analysis

G6PDH G6P Glu HX 6PGL PGI PGL F6P PFK 6PGC FDP TKII PDGH ALD TA E4P Xu5PE RL5P X5P DHAP

S7P GA3P R5PI TPI TKI ADE AMPase AMP R5P DPGM PRPPSyn 13DPG PRM PRPP PGK AdPRT 23DPG R1P ADO AMPDA 3PG HGPRT DPGase PGM IMP HX IMPase 2PG

EN ADA INO PNPase PEP

PK

Pyr

LDH Lac

Figure 47: Network representation of the erythrocyte’s metabolism. The dashed border represents the boundaries of the system (or the cell membrane). The gray boxed names are the enzymes that catalyze the reactions. The system use glucose (Glu) as input, 23DPG and HX as outputs. The adenine (ADE), inosine (INO), pyruvate (Pyr) and lactate (Lac) act as both input or output of the system. The currency metabolites are H, CO2,H2O, NH3 and Pi. The PGK and PK produce each two moles of ATP. The G6PDH and PDGH reduce two molecules of NADP+ to NADPH.

91 4 module detection in metabolic network

Table 17: List of chemical abbreviations used in the human erythrocyte metabolic network. Abbreviation Chemical 13DPG 1,3-diphosphoglycerate 23DPG 2,3-diphosphoglycerate 2PG 2-phosphoglycerate 3PG 3-phosphoglycerate 6PGC Phosphogluconate 6PGL Phosphogluco-lactone ADE Adenine ADO Adenosine AMP Adenosine mono-phosphate DHAP Dihydroxyacetone phosphate E4P Erythrose 4-phosphate F6P Fructose 6-phosphate FDP Fructose 1,6-bisphosphate G6P Glucose 6-phosphate GA3P Glyceraldehyde 3-phosphate Glu Glucose HX Hypoxanthine IMP Inosine mono-phosphate INO Inosine Lac Lactate PEP Phosphenolpyruvate PRPP 5-phosphoribosyl 1-pyrophosphate Pyr Pyruvate R1P Ribose 1-phosphate R5P Ribose 5-phosphate RL5P Ribulose 5-phosphate S7P Sedoheptulose 7-phosphate X5P Xylulose 5-phosphate

92 4.6 Red blood cell functional modules analysis

Table 18: List of enzyme abbreviations used in the human erythrocyte metabolic network. Abbreviation Enzyme ADA Adenosine deamidase AdPRT Adenine phosphoribosyl transferase ALD Aldolase AMPase Adenosine monophosphate phosphohydrolase AMPDA Adenosine monophosphate deamidase DPGase Diphosphoglycerate phosphatase DPGM Diphosphoglyceromutase EN Enolase G6PDH Glucose-6-phosphate dehydrogenase HGPRT Hypoxanthine guanine phosphoryl transferase HX Hexokinase IMPase Inosine monophosphatase LDH Lactate dehydrogenase PDGH 6-phosphoglycononate dehydrogenase PFK Phosphofructokinase PGI Phosphoglucoisomerase PGK Phosphoglycerate kinase PGL 6-phosphoglyconolactonase PGM Phosphoglyceromutase PK Pyruvate kinase PNPase Purine nucleoside phosphorylase PRM Phosphoribomutase PRPPSyn Phosphoribosyl pyrophosphate synthetase R5PI Ribose-5-phosphate isomerase TA Transaldolase TKI Transketolase TKII Transketolase TPI Triose phosphate isomerase Xu5PE Xylulose-5-phosphate epimerase

93 4 module detection in metabolic network

Undirected functional modules in erythrocyte

2 0.8 AK

1 ApK AMPase HGPRT PGI PGK Height HK R5PI DPGM

3 DPGase GAPDH AdPRT IMPase 0.0 0.2 0.4 0.6 AMPDA PRPPsyn TA PK EN TKI TPI TKII PFK ALD PGL PRM PGM

5 PDGH 7 8 Xu5PE G6PDH

PNPase 4 6

1 Glu G6P PPP (oxy) 6 ADE

PPP 8 3 R/L Shunt 7 Gly (mid) R5P 4 Nucleobases 2 INO (non-oxy)

23DPG 5 Gly (low) HX ADO

Pyr Lac

Figure 48: Undirected hierarchical clustering of the erythrocyte metabolism (a) and metabolism description through functional modules (b). The hierarchy shows a clear separation between the glucose catabolism and the nucleotides metabolism. Then several modules can be extracted in glucose catabolism. A great part of the EM pathway, the PPP (even a split oxidative and non oxidative phase) and the Rapoport-Luebering shunt. Those modules are labeled with circled number in the hierarchy and the functional view of the cell.

94 4.6 Red blood cell functional modules analysis

Table 19: List of outliers in the hierarchy for the considered human red blood cell states: healthy, GPDH deficiency and PK deficiency. State Outliers Healthy EN (rev), PNPase (rev), PGL (rev), PRM (rev), GAPDH (rev), TPI (rev), ADA, LDH (fwd), ALD (rev), PGM (rev), LDH (rev), PGK (rev) GPDH deficiency LDH (fwd), PGI (rev), EN (rev), PNPase (rev), PDGH, PGL (fwd), ALD (rev), PGL (rev), PRM (rev), GAPDH (rev), PGM (rev), TPI (rev), ADA, LDH (rev) PK deficiency LDH (fwd), ADA, PGK (fwd), GAPDH (rev), PNPase (rev), TPI (rev), EN (fwd), ALD (rev), EN (rev), PGL (rev), LDH (rev), PGM (fwd), PGM (rev), PRM (rev)

EM pathway correspond to the three stages of the glycolysis as presented in Berg et al.[ 2002]: (i) phosphorylation of the glucose (ATP consumption), (ii) cleavage of six-carbon sugar into two three-carbon sugars and (iii) oxidation of three-carbon fragments produces ATP. Also the isolation of the Rapoport- Luebering shunt was detected as a module and is an important path used by the cell to regulate oxygen affinity [MacDonald, 1977]. The figure 48 (b) shows a modular view of the mature erythrocyte metabolism produced with the computed hierarchy (with a manually chosen cut). However the metric has the advantage of producing directed clusters, as for a reversible reaction, we have to cluster two irreversible reactions. This is an interesting feature of our metric as it allows to describe in which way conversions are processed by the module. To illustrates this, we applied it to the red blood cell metabolism. It is very important that the reader realizes that some hypothesis drawn from the following experiments that are not backed by a published study stay purely in silico have not been observed in vivo or in vitro. Moreover they may necessitate further investigation. The first observation is that the tightly coupled reactions produce similar modules to the one obtained in the undirected case (figure 48). However the general organization of the modules is different. Indeed the non-oxidative PPP module is split into two functional modules and appears in well sepa- rated groups in the hierarchy. One balances a pool F6P to R5P and RL5P and the other balances a pool of RL5P to F6P and R5P (figure 49). This separation indicates that they are less likely to function together to sustain the steady state (i.e. working like a cycle). To study the modules and obtain a simplest view of the metabolism we applied the following procedure. First we produced a hierarchical clustering and removed the outliers from the hierarchy. The outliers for our considered cases are provided in table 19. For a given cut, we kept the clusters that contains more than two tightly coupled reactions. Then for each selected cluster we consider them as a functional module. Finally, to simplify the module, we applied a network reduction with meta-reactions (see chapter 3, section 3.3). To do this we keep all the input fluxes, output fluxes and currency metabolites. Also for every source or sink in the module (see chapter 3, section 3.2), a free exchange flux (i.e. input and output) is added for those chemicals. With this procedure, the hierarchy shows two large functional modules in figure 50. One describes how the glucose-6-phosphate (G6P) and hypox- anthine (HX) are balanced with ribulose-5-phosphate (RL5P) and adenosine

95 4 module detection in metabolic network

F6P

TKII

TA E4P Xu5PE RL5P X5P

S7P GA3P TKI R5P

FWD REV

F6P F6P

TKII

TKII TA E4P E4P Xu5PE Xu5PE RL5P X5P TA RL5P X5P

S7P S7P GA3P GA3P TKI TKI R5P R5P Figure 49: Separation of the non-oxidative PPP undirected module into two directed modules. The forward module (FWD) describes the reac- tions that are likely to function together to balance a pool of RL5P and R5P to F6P. The reverse module (REV) do the opposite and balances a pool of F6P to RL5P and R5P.

96 4.6 Red blood cell functional modules analysis

Directed clustering of healthy erythrocyte AK 0.8 1.0 PGK_rew HGPRT AMPase ApK_fwd R5PI_rew ApK_rew PGI_rew PGK_fwd Height DPGM PGI_fwd DPGase HK R5PI_fwd 0.0 0.2 0.4 0.6 AdPRT IMPase AMPDA PRPPsyn PK PFK GAPDH_fwd PDGH TA_fwd TA_rew G6PDH EN_fwd TPI_fwd TKI_fwd TKI_rew TKII_fwd TKII_rew ALD_fwd PGL_fwd PRM_fwd PGM_fwd Xu5PE_fwd Xu5PE_rew PNPase_fwd

Glu NADP + ATP

NADPH + ADP G6P INO R5P RL5P

RL5P HX R5P F6P ATP ATP

ADP

PRPP ADP ADE GA3P 23DPG HX ADP ADP ADO

AMP ATP ATP IMP Pyr

INO

Figure 50: The directed hierarchical clustering producing functional modules (a). Only the modules obtained with the first two biggest clusters are represented with meta-reactions in (b) and (c). The colored box around a module highlights its corresponding subtrees in the dendrogram. It should be noted that not all the reactions producing or consuming the component and chemicals are drawn in the schema (b) and (c). The arrows pointing to void products or from void substrates indicates if a chemical can exit or enter the system.

97 4 module detection in metabolic network

(ADO) with inosine (INO) (figure 50 (c)). This module may lead to think that one of the nucleotides’ function is to help to balance the concentration of G6P that are not converted through EM pathway. It should be noted that this module do not use the oxidative phase of the PPP as no NADPH is produced when those chemicals are balanced. The other modules describe the catabolism of glucose (figure 50 (b)). The glucose concentration is balanced by the production of pyruvate (Pyr) and 2,3-diphosphoglycerate (23DPG) in the Rapoport-Luebering shunt (R/L). It also yield the production of NADPH and a net production of two adenosine triphosphate (ATP). The module also shows that it balances the concentration of inosine (INO), when available in the cell environment, by producing hypoxanthine or passing through the production of the ribose-5-phosphate (R5P). The latter is then processed by the PPP to pyruvate or 23DPG. This is an interesting observation as it functionally implies that when inosine is present in the cell environment, it is more likely to produce R5P and hypoxanthine and function with the EM pathway instead of converting it back to adenosine. Hence the inosine is likely balanced with 23DPG and pyruvate. Although we find this surprising that those reactions are more likely to be activated together, the action of inosine has received much attention in blood banking. It appears that it improves erythrocytes preservation as it allows the production of ATP without using an ATP to phosphorylate the glucose [Gabrio et al., 1956, D’Alessandro et al., 2013]. To conclude this rough analysis, the dendrogram shows that the EM and the PP pathways are most coupled than with the reactions in the nucleotides pathways. This is known to be the way the glucose is processed under normal physiologic circumstances. However, despite these encouraging observations, we were not able to observe the fact that under normal condition the PPP only accounts for 8% of glucose metabolism, the rest is metabolized through the EM pathway. This is not reflected by the metric in the dendrogram. Only when the cell is under oxidative stress that 90% of the glucose is metabolized through the PPP. This is because the EM pathway is tightly coupled with the PPP pathway and without specific regulation the reactions of both pathways function together. To assess the potential information or knowledge that our metric can provide, we apply the previous procedure to study the two most frequent en- zymopathies in human erythrocytes: the glucose-6-phosphate dehydrogenase deficiency and the pyruvate kinase deficiency. To simulate an enzymopathy we delete the vertices that correspond the enzymatic reactions and we pro- ceed to a module detection. Then we compare the modules of an healthy erythrocyte with the detected modules of an unhealthy erythrocyte.

4.6.1 Glucose-6-phosphate dehydrogenase deficiency

The Glucose-6-phosphate dehydrogenase (G6PD) is an enzyme that cat- alyze reduces nicotinamide adenine dinucleotide phosphate (NADP+) to NADPH while oxidizing glucose-6-phosphate (G6P) to into 6-phosphoglucono- δ-lactone (6PGL), see figure 47. Deficiency in G6PD is the most common enzyme deficiency in human erythrocytes and we disturb the network by nullifying the G6PDH activity to study its impact [Frank, 2005]. We then pro- ceed to a detection of functional modules in the cell and we compare those modules with the ones obtained in an healthy erythrocyte. The figure 51 is the dendrogram obtained on the G6PD deficient cell and the figure 52 represents several modules which have been simplified with packing and

98 4.6 Red blood cell functional modules analysis meta-reactions (see chapter 3). The most striking observation is the vanishing of the oxydative PPP module (dashed branches in the normal erythrocyte dendrogram, figure 51). Hence the cell lost its capacity to reduces NADP+ as the PDGH is not part of any other module (the outliers are not shown in the dendrogram). This may have a impact on the cell life as this specific part of the PPP supplies reducing energy by maintaining the level of NADPH. The NADPH maintains level of reduced glutathione which is used to convert organic hydroperoxides which cause oxidative damage [Wamelink et al., 2008, Wood, 1986]. The production of glutathione and degradation of free radicals are not shown in the figure 47, but the reactions are:

2 GSH + NADPH + H+ −−→ GSSG + NADP+

GSSG + ROOH −−→ ROH + H2O + 2 GSH (73) where GSH is the glutathione, GSSG the glutathione disulfide, ROOH an organic hydroperoxyde and ROH an alcohol. In case of oxidative stress 5 there is a risk of hemolytic anemia, which is an abnormal breakdown of the cell [Fibach and Rachmilewitz, 2008]. Perturbation is also seen with the non-oxidative PPP modules (both di- rections). Indeed, they are not decoupled as much in the healthy erythrocyte and are therefore these modules are more likely to work like a reaction that continuously balance its chemicals without letting other reaction consum- ing those chemicals. Moreover each direction is now tightly coupled with the ribose 5-phosphate isomerase (R5PI). This reaction was coupled with the hexokinase to work in tandem to input and output the chemicals of the oxydative PPP. Now it is part of the non-oxidative PPP module and make the ribulose 5-phosphate (RL5P) an intermediary metabolite in the module (like the sedoheptulose 7-phosphate) instead of being a central (hub like) metabolite. This is illustrated when comparing the extracted modules in figure 51 and figure 52. Otherwise we do not understand why the similarity between reactions seems to shift toward the production or consumption of AMP in the modules because of the vanishing of the oxidative phase of the PPP. Unfortunately no references in the literature helped us to understand this shift.

4.6.2 Pyruvate kinase deficiency

The pyruvate kinase (PK) catalyzes the transfer of a phosphoryl group from phosphophenolpyruvate to ADP and produces an ATP molecule and a pyruvate. PK deficiency is the second most common enzymopathy in human erythrocytes. So to study the impact of such a deficiency, we also disturb the network by nullifying the PK activity. Then we apply a hierarchical clustering and proceed to a study of the functional modules and their likelihood to convert chemicals together. Like the previous enzymopathy, the most striking difference is the vanish- ing of a submodule of tightly coupled reactions (we recall that outliers are removed from the hierarchy). This submodule corresponds to the one that balances 3PG and pyruvate. In other word the cell is not able to refill its ATP reserve. The other difference is that the R/L shunt module is destroyed (nei-

5. Oxidative stress can result from infection or chemical exposure to drugs or foods [Bet- teridge, 2000].

99 4 module detection in metabolic network

Directed clustering of healthy erythrocyte AK 0.8 1.0 PGK_rew HGPRT AMPase ApK_fwd R5PI_rew * ApK_rew PGI_rew PGK_fwd Height DPGM PGI_fwd DPGase HK R5PI_fwd 0.0 0.2 0.4 0.6 AdPRT IMPase AMPDA

* PRPPsyn PK PFK GAPDH_fwd PDGH TA_fwd TA_rew G6PDH EN_fwd TPI_fwd TKI_fwd TKI_rew TKII_fwd TKII_rew ALD_fwd PGL_fwd PRM_fwd PGM_fwd Xu5PE_fwd Xu5PE_rew PNPase_fwd

Directed clustering of glucose-6-phosphate dehydrogenase deficiency in erythrocyte 0.8 1.0 PGK_rew AK AMPase ApK_fwd ApK_rew HGPRT Height PGK_fwd DPGM DPGase AdPRT PRPPsyn IMPase AMPDA 0.0 0.2 0.4 0.6 GAPDH_fwd PK HK PFK TA_fwd TA_rew EN_fwd TPI_fwd TKI_fwd TKI_rew PGI_fwd TKII_fwd TKII_rew ALD_fwd PRM_fwd R5PI_fwd R5PI_rew PGM_fwd Xu5PE_fwd Xu5PE_rew

PNPase_fwd ** Figure 51: Directed hierarchical clustering of an healthy and a G6PD deficient erythrocyte. The healthy hierarchy is the same as figure 50. Two functional large modules are highlighted in the normal erythrocyte and their rearrangement is reported in the dendrogram of the G6PD deficient erythrocyte. The dashed group in the normal erythrocyte emphases the vanishing of the oxidative PPP module. The asterisks emphases the relocation of the R5PI reaction and its reverse in new modules.

100 4.6 Red blood cell functional modules analysis

Directed clustering of glucose-6-phosphate dehydrogenase deficiency in erythrocyte 0.8 1.0 PGK_rew AK AMPase ApK_fwd ApK_rew HGPRT PGK_fwd DPGM DPGase AdPRT PRPPsyn IMPase AMPDA 0.0 0.2 0.4 0.6 GAPDH_fwd PK HK PFK TA_fwd TA_rew EN_fwd TKI_fwd TPI_fwd TKI_rew PGI_fwd TKII_fwd TKII_rew ALD_fwd PRM_fwd R5PI_fwd R5PI_rew PGM_fwd Xu5PE_fwd Xu5PE_rew

PNPase_fwd **

ATP Glu ATP R5P 2x F6P AMP ADP ADP 23DPG GA3P F6P GA3P ADP ADP PRPP IMP

R5P ATP ATP

INO Pyr PRPP ADE HX R5P AMP

Glu ATP

R5P

ADP PRPP F6P ATP

AMP

ADP ADE GA3P 23DPG ADP ADP

ATP ATP Pyr

Figure 52: The directed hierarchical clustering of a G6PD deficient erythro- cyte (a) and the three large functional modules (b), (c) and (d). Those modules are more likely to function as block, i.e. all reac- tions in a module convert their substrates.

101 4 module detection in metabolic network

Directed clustering of healthy erythrocyte AK 0.8 1.0 PGK_rew HGPRT AMPase ApK_fwd R5PI_rew ApK_rew PGI_rew PGK_fwd Height DPGM PGI_fwd DPGase HK ** R5PI_fwd 0.0 0.2 0.4 0.6 AdPRT IMPase AMPDA PRPPsyn PK PFK GAPDH_fwd PDGH TA_fwd TA_rew G6PDH EN_fwd TPI_fwd TKI_fwd TKI_rew TKII_fwd TKII_rew ALD_fwd PGL_fwd PRM_fwd PGM_fwd Xu5PE_fwd Xu5PE_rew PNPase_fwd

Directed clustering of pyruvate kinase deficiency in erythrocyte 0.8 1.0 AK R5PI_rew AMPase ApK_fwd ApK_rew PGI_rew HGPRT Height HK 0.0 0.2 0.4 0.6 AdPRT IMPase AMPDA PGI_fwd PRPPsyn DPGM * R5PI_fwd PFK GAPDH_fwd PDGH TA_fwd TA_rew G6PDH TPI_fwd TKI_fwd TKI_rew DPGase TKII_fwd TKII_rew ALD_fwd PGL_fwd PGK_rew PRM_fwd Xu5PE_fwd * Xu5PE_rew PNPase_fwd

Figure 53: Directed hierarchical of an healthy and a PK deficiency erythrocyte. The healthy hierarchy is the same as figure 50. Interestingly the nucleotide metabolism seems to be more well separated than in an healthy erythrocyte. The major difference in the organization are the transfer of the reverse non-oxidative PPP module into the glucose degradation module. Then the destruction of the Rapoport- Luebering shunt in the hierarchy (cf. asterisks) and the vanishing of the lower part of the glycolysis.

102 4.7 Cluster analysis of the E. coli metabolism ther reactions are outliers) and its reactions are coupled with other reactions. Now there is only a recycling of 13DPG. The extracted modules show a different dynamic. In the first one, adenine, inosine and hypoxanthine are balanced together. That is to say, compared to a normal cell, it is less likely that the cell is using nucleotides to produce a ribose 5-phosphate (the latter being potentially itself converted through the EM pathway). In the other extracted module, the cell is more likely to balance the glucose with 23DPG. Moreover it consume ATP and does not refill its reserve. We also note that the cell seems to be protected from oxidative stress, as it still supplies reducing energy. Hence, we assume that: 1. the cell is less likely to use inosine to produce a chemical that is balanced by the PRPP. 2. The cell balances the glucose only with 23DPG. Thus the deficient cell may have less affinity to oxygen than an healthy one. 3. The cell does not replenish ATP, and the balancing of glucose will deplete its ATP reserve. This will imply an incapacity of the cell to balance glucose with 23DPG, thus making the cell unable to control its oxygen affinity. Those results are very interesting. The first one indicates that in presence of nucleotides in its environment, the cell will balance the level of those compounds (by either inputing or outputting them) and use ATP. However this is not supported by experimental evidences and require in vivo experi- ments to confirm our assertion. However, it has been published that a PK deficient human red blood cell shows an increased 23DPG levels. This indeed decreases the affinity of hemoglobin with oxygen thus making it more readily transferred to tissue [van Wijk and van Solinge, 2005, Dabrowska, 1997]. Also there are also publications that supports our observations regarding the cell’s ATP reserve [Zerez et al., 1988, Baranowska-Bosiacka et al., 2004].

4.7 cluster analysis of the e. coli metabolism

We have applied our extreme pathways based distance approximation on the reconstructed network of E. coli (see method details in section 3.8 and section 3.2). The approximation shows that 70 % of the pairs have a distance equal to 1.0 (i.e. no extreme pathways in common). The pairs having a distance greater than 0.9 correspond to ≈ 99 % of all pairs. The figure 55 illustrate the distribution only for pairs having a distance lesser or equal than 0.9 and the hierarchy of the E. coli metabolism is provided in figure 56. We have randomly isolated one module in the hierarchy to see if it was bio- logically meaningful. The studied module is the one highlighted in figure 56 and is illustrated in figure 57 along with its network representation. This cluster is indeed of biological meaning as it isolate a part of the glutamate metabolism. More precisely this module correspond the degradation of sev- eral amino acids to an α-ketoglutarate via the glutamate (the α-ketoglutarate is an intermediary component of the citric acid cycle). First how glutamate, glutamine, aspartate and aspargine can be inter converted. Then how the glutamate is itself converted to 2-oxoglutarate (an α-ketoglutarate) and also produce two reducing agents: NADH and NADPH. The function of this module can be mainly interpreted as the catabolism of those amino acids and thus as the precusor steps to the glycogenesis. However it is also show how fructose 6-phosphate (through step of glycogenesis) can be metabolized

103 4 module detection in metabolic network

Directed clustering of pyruvate kinase deficiency in erythrocyte 0.8 1.0 AK R5PI_rew AMPase ApK_fwd ApK_rew PGI_rew HGPRT Height HK 0.0 0.2 0.4 0.6 AdPRT IMPase AMPDA PGI_fwd PRPPsyn DPGM R5PI_fwd PFK GAPDH_fwd PDGH TA_fwd TA_rew G6PDH TPI_fwd TKI_fwd TKI_rew DPGase TKII_fwd TKII_rew ALD_fwd PGL_fwd PGK_rew PRM_fwd Xu5PE_fwd Xu5PE_rew PNPase_fwd

ADE Glu ATP ATP + NADP PRPP AMP HX

ADP ADP + NADPH ATP INO 23DPG

Figure 54: The directed functional modules of an PK deficiency erythrocyte. Its functional description seems very simple: the balance of the nucleotide metabolism with its environment and the balance of the glucose with the 23DPG.

104 4.7 Cluster analysis of the E. coli metabolism

20

15

10 Percent of Total Percent

5

0

0.0 0.2 0.4 0.6 0.8 1.0 Dissimilarity < 1.0

Figure 55: Distribution of extreme pathways distances in E. coli for all pairs of reactions. The histogram shows only the distribution for the pairs that have a distance < 1.0. This corresponds to ≈ 30 % of the pairs. into glutamate. This first analysis shows that functional modules are indeed produced when our metric is applied to a genome-scale metabolic network. However proceed to a module analysis such as the one done with the red blood cell with a bigger network to validate the methodology maybe hard. In the case of the E. coli network an expertise in this organism’s metabolism is required to draw a conclusion on the function of the detected modules. Moreover, it was feasible to inspect the hierarchy as there are only few reactions but in the case of a genome scale organism it could be impossible to manually inspect the hierarchy. Hence, in this section we propose a quantitative study to assess the potential of our metric with the E. coli metabolism.

4.7.1 Detection of intra-operonic pairs of genes in E. coli

It is stated that the extreme pathways or flux coupling tend to match the co-regulation of metabolic genes [Notebaart et al., 2008]. Hence, the proximity defined by the co-regulation has often been used as a criterion to assess the quality of a grouping approach. Criteria based on regulation have been used to assess to quality of network clustering in [Papin et al., 2004a, Gagneur et al., 2003]. The reason is that operon genes are co-expressed and is the most studied unit of regulation in Prokaryota. When a pair of genes of the same operon is observed, there is a very high probability that both genes are co-expressed. To quantitatively assess the quality of our metric, we used a criterion based on the intra-operonic gene pair discovery in a partition. The genes in an operon are used to form a so called intra-operonic pairs. When applied to each operon, this produce the complete set of intra-operonic (genes) pairs. All the other possible pairs that can be formed from all the genes that compose operons form the so called extra-operonic (genes) pairs. We based the quality of a partition on its capacity to groups intra-operonic pairs without grouping

105 4 module detection in metabolic network

0.0 0.2 0.4 0.6 0.8 1.0

R00617R00615 R01975 R00238.2 R01976.2 R00238 R01975.2R01976 R02530.2R01736 R01016R00704 R00203 R03277 R08056.2 R02752 R02754R02754.2 R02752.2R08056 R04173.2 R03277.2 R00582R01513 R01512.2R01061 R01061.2 R01745 R01512R08572 R01518R01514.2 R01388.2 R00658.2 R01392 R01514 R01518.2R00658 R01745.2 R01392.2 R01388R01394 R00549 R00548 R00161R01123 R05884R08210 R00377R02326 R00291.2 R00289.2 R01092R00955 R01068.2R00762 R00578.2R00483 R00578 R00256.2 R00485R00093 R00253R00256 R00765 R00114 R00768 R00694 R00947 R00878.2 R01786 R01602 R01602.2 R00867 R01786.2R00878 R00028R05196 R00948R08639 R02421 R02110 R02111 R00801 R00803.2 R03920 R02108R00802 R08639.2R00959 R02739.2R03321 R02035R01528 R05605R02036 R02887R02736 R00026R01600 R02886 R04230 R03269 R03035 R00130 R01226 R03018 R02472.2 R02473 R01215 R01214.2 R00345 R00341.2 R00342 R00490 R00355.2 R00351R00341 R00230R00317 R00316 R00236 R00315.2 R00235 R00362R00489 R01201R02059 R00236.2R00316.2 R00230.2R00315 R00258.2R00707 R00216.2R00472 R00469.2R00342.2 R00268R01899 R00351.2 R00267 R01900.2R01325 R01325.2R01900 R00268.2R01899.2 R00214 R00472.2 R00479R00013 R07618 R00216 R02569.2R00014 R03270 R00405.2 R02570.2R00621 R00713R03316 R01648R00261 R00848.2R00842 R01257.2 R01082.2 R00408 R00405 R00412.2 R01082 R01248.2R01253 R00842.2R00848 R01257 R00258 R01214 R00408.2 R01215.2R00412 R03599R03601 R00694.2 R01432.2 R01373R01715 R01830 R01639.2 R01827.2R01641 R04457 R01529.2 R07281 R01049.2 R01056R01529 R01333R00465 R05850.2R03244 R01902 R01901.2 R01785 R01641.2 R01830.2R01827 R03084 R01826 R03083R01714 R03460R02413.2 R02412 R00465.2 R03244.2 R01333.2 R01901R05850 R01785.2R01902.2 R00985 R00220.2 R03509R03508 R01073.2 R01715.2 R01373.2 R00674 R00673 R02722 R00782 R02340.2 R00586R00897 R02340R00673.2 R02722.2 R00782.2 R01201.2R05199 R04779R01070 R02738.2R03232 R01070.2R02071 R04780 R00010 R02780R02778 R00443.2 R00289 R00837R02737 R00156.2 R00443 R00662 R00158.2 R00438R00156 R00438.2 R00158 R02889 R00512.2 R01876R00963 R01878R01799 R00513R01797 R00511 R00966 R00966.2R02137 R00974 R01876.2 R00964 R00442.2 R00568 R00571 R00570.2R00570 R00442R00512 R00440.2R00440 R02295 R01724.2 R03346 R00189 R03005R03004 R00103R00137 R02323 R00137.2 R02294R01268 R01724 R03346.2 R02295.2 R02142 R01228.2 R01229.2 R01677 R00426 R02147.2 R01228 R01227R02147 R02142.2R01768 R02719R02297 R01130R01230 R01676R01229 R01057.2R01056.2 R01132.2R01134 R01131.2R01863 R01770 R01049 R01126 R01561.2 R01051R01245 R01057 R00190.2 R01560 R01244 R00183R01561 R01863.2 R00430 R01131R01132 R01083 R00430.2 R01135 R00439 R00330 R00332.2 R00429R03409 R00336 R00332 R00330.2R00439.2 R01221 R00220 R01569 R00945 R01567 R02094.2 R00378.2R02093.2 R02235.2R00937 R02101.2R00936 R02100 R02331 R02098 R00378 R02093R02094 R00936.2R02101 R02098.2R02325 R02331.2R02018 R00376.2R00376 R02326.2 R00377.2 R02024 R00375 R01138R01137 R01138.2 R00375.2 R01137.2 R02023 R02017 R02022 R02014 R02020 R02016 R00945.2 R01917R01918 R01220R01655 R01655.2 R01220.2 R04325.2R04326 R01465.2R00371 R02557.2 R00228.2 R02088.2R01547 R01066.2 R02749.2 R03425 R00751 R03815R04125 R00497R00894 R00899R00494 R00371.2 R00751.2 R01465 R00228 R01066R02749 R02556 R01547.2 R02748 R02088 R01855 R02557 R01857 R01969.2 R01968.2R02090 R01857.2R02019 R02090.2 R01858 R01968R01969 R06987.2R00212.2 R00212 R00519.2 R00921.2 R00944 R01353R00926.2 R01354.2 R00921 R01353.2R00926 R01354 R00519 R06987R00996 0.0 0.2 0.4 0.6 0.8 1.0 Figure 56: Hierarchical clustering of the reaction in the E. coli metabolism. The hierarchy is computed with an approximation of the extreme pathways distance and the UPGMA. The highlighted cluster cor- respond to the one that is illustrated in figure 57.

106 4.7 Cluster analysis of the E. coli metabolism

Gln Asn

NADPH NADP+

2-oxoglutarate Glu Asp 0.6 0.8

NADH NAD+ Glucosamine6P R00114

R00256.2 F6P R00762

R01068.2 F1,6B R00578 R00483 R00256 R00253 R00578.2 0.0 0.2 0.4 GlyceroneP Glyceraldehyde3P R00093 R00485 R00765 R00768

Figure 57: On the left, the cluster isolated after a cut in the dendrogram at height 0.985. As shown on the right, this cluster correspond to a module (glutamate metabolism) converting glutamate, glutamine, aspartate and aspargine into an α-ketoglutarate (2-oxoglutarate). This module shows the centrality of glutamate in the production α-ketoglutarate and two reducing agents (NADH and NADPH).

Table 20: Confusion matrix for the genes pairs. An intra-operonic pair is a pair of genes that are in the same operon. An extra-operonic pair is a pair of genes that are in different operon. The Pair is in the same The Pair is in different cluster clusters Intra-operonic True positive False negative Extra-operonic False positive True negative extra-operonic pairs. With a partition, pairs can be constructed in the same manner with modules instead of operons (i.e. intra-cluster pairs and extra- cluster pairs). Then all intra-cluster pairs that are intra-operonic form the true positives set (tp). The cardinality of this set is the number of true positive. The same reasoning can be applied to the other outcomes. All extra-cluster pairs that are extra-operonic pairs form the the true negatives (tn) set. All extra-cluster pairs that are intra-operonic forms the false negatives ( f n) and all the intra-cluster pairs that are extra-operonic are the false positives ( f p) (table 20). We then can compute the sensitivity and the specificity:

tp sensitivity = (74) tp + f n and tn specificity = .(75) tn + f p To obtain the intra-operonic pairs we extract the information from RegulonDB which is the primary reference database of the regulatory network E. coli K-12. It contains the list of operons and their genes, but also several other regulatory information such as the interaction between transcription factors and operons or with other transcription factors [Gama-Castro et al., 2011]. We also recall that some reactions in the metabolic network will always have a null rate in every steady states. Such reactions will always be outliers in the hierarchy. They are aggregated in a module at the end of the hierarchy, hence their membership is meaningless. So, as we defined our E. coli network, this approach does consider all the network. Such reactions are removed from the network during the preprocessing step of the network (see pruning

107 4 module detection in metabolic network

Specificity

Sensitivity

Height 0.6 0.8 1.0 Score 0.0 0.2 0.4

0 100 200 300 400 500 Number of clusters Figure 58: Performance of the intra-operonic pairs detection in E. coli through the hierarchy. On the abscissa the number of clusters correspond- ing to the order in which the reactions are aggregated. The graph displays two scores: the specificity and the sensitivity and the height of the cut in orange. operations in section 3.2). Moreover we cannot compare all genes. On one hand we do not have a complete reconstructed metabolic network and on the other hand not all operon genes encode for enzymes that are part of the metabolism. Hence we are working with a subset of the operons genes. We add that we discarded operons that are composed of only one gene. The results are plotted in figure 58. Obviously we have a perfect specificity when all pair are in one unique clusters, as we never miss an intra-operonic pair and a perfect sensitivity when we are approaching a very large number of clusters because no extra-operonic pairs are regrouped when every object is in its own cluster. The best cut in the hierarchy produces 58 clusters (it is the point where sum of the two scores is maximized) and produce a sensitivity of 0.73 and a specificity of 0.72 (this corresponds to an MCC of 0.12). On a total of 98 790 pairs, 1 073 are true positives, 70 465 true negatives, 26 864 false positives and 388 false negatives. We then compared our approach with other clustering approach. One approach is based on the Schuster splitting (section 3.8.2 and Schuster et al. [2002b]) and another on the approach proposed in Gagneur et al.[ 2003]. We call this latter approach the Gagneur splitting. We recall that the Schuster and Gagneur splittings may not produce all partition sizes. Indeed the Gagneur splitting can merge several groups in one step and the Schuster splitting can also create several groups in one split. We also compare our approach with a random clustering and the spectral graph clustering. The random clustering is simply the construction of k random partitions of the vertices set. This clustering is done without taking into account the structure of the network. The algorithm 5 details its implementation. The spectral clustering is the partitions obtained on the ε-graph. The weighted graph is the complete

108 4.7 Cluster analysis of the E. coli metabolism

Table 21: The area under the curve (AUC) corresponding to the Receiver operating characteristic (ROC) curve in figure 59. Clustering Approach AUC Hierarchical with EPs distance 0.77 Schuster criterion 0.63 Spectral clustering with EPs distance 0.63 Gagneur clustering 0.52 Random clustering 0.50 graph of reactions and the weight of an edge linking two reaction is the value of our metric (see section 4.3 and 4.2 and Shi and Malik[ 2000]). To compare the approaches we use the receiver operating characteristic curve (ROC) which is a plot of the true positive rates (i.e. sensitivity) against the false positive rates (i.e.1 - specificity). Each point of the curve represent a binary classifier that in our case correspond to the different partition produced by a cut in the hierarchical clustering. Thus the cut in the dendrogram is a parameter that acts on the quality of the prediction like it has been illustrated in figure 58. In the ROC space, the perfect classifier is at (1, 0) 6. Random guesses produce the diagonal line (i.e. produces point on the y = x line) and correspond to the worst case for a classifier. From the ROC curve we can compute the area under the receiver operating characteristic curve (AUC). The AUC is a scalar that is commonly use to compare different approaches in machine learning. The perfect classifier produces an AUC of 1 and a random classifier produces an AUC of 0.5. More detailed information about ROC analysis and statistical properties and drawback of the AUC are detailed in Fawcett[ 2006].

Algorithm 5 Random metabolic network partitions Require: A graph G = (V, E) with at least k vertices and the number of desired clusters k Ensure: k set of vertices: P1, P2,..., Pk such as whose union is V and their intersection is empty 1: P1, P2,..., Pk ← ∅ 2: for all v ∈ V do 3: i ← uniformly draw an integer from ∈ [1, k] 4: Pi ← Pi ∪ v 5: end for

The comparison is illustrated in figure 59 and table 21. Although we are not producing the perfect classifier with our metric, we obtain better results that with other approaches. Also with the spectral clustering the results obtained are very close to the one based on the Schuster splitting in term of scores. We finally add that we try other linkage criteria (Ward and single) and the results were lower than the one obtained with UPGMA (figure 60). This experiment confirms two things: our approach still produces meaningful groups in a genome scale network and that extreme pathways better explain the co-expression of genes. Indeed, the other approaches are only based on the vertices’ degrees.

6. A classifier that produces points close to (0, 1) is also a good classifier but it predicts the opposite. To obtain a good classifier we only need to invert its outcome.

109 4 module detection in metabolic network 0.6 0.8 1.0 True-positive rate (sensitivity) True-positive Hierarchical with EP Schuster criterion Gagneur criterion Spectral clustering Random clustering y=x 0.0 0.2 0.4

0.0 0.2 0.4 0.6 0.8 1.0

False-positive rate (1 - specificity)

Figure 59: Receiver operating characteristic (ROC) curve for the detection of intra-operonic pairs in E. coli. The area under the curve (AUC) are provided in table 21. The parameter that allow the production of different scores is the cut in dendrogram. 0.6 0.8 1.0 True-positive rate True-positive

UPGMA Single

Ward 0.0 0.2 0.4

0.0 0.2 0.4 0.6 0.8 1.0

False-positive rate

Figure 60: Receiver operating characteristic (ROC) curve for the detection of intra-operonic pairs in E. coli. Three different linkage criteria are compared: the UPGMA, the single and the Ward criterion. As illustrated in this plot, the UPGA linkage is the best criterion.

110 4.7 Cluster analysis of the E. coli metabolism

4.7.2 Exploring genes pairs in E. coli

Now we are interested on the pairs clustered together that do not form intra-operonic pairs. To narrow our search we focus only on the clusters that group fully similar reactions (i.e. a distance of zero with our metric). We then study some of these pairs and try to find if there already exist any biological link relating the genes in the pair. The pairs that have been found in the hierarchy are summarized in table 22. First we found two pairs that are related by a common transcription factor. We add that we do not take into account gene co-regulated by CRP or FNR, IHF , as those transcription factor regulated respectively and 551, 308 and 248 genes (totaling 1107). This account for ≈ 26% of all genes directly regulated by a transcription factor in E. coli. The pair purA and purB is co-regulated by the PurR and the pair glmS and nagB is is co-regulated by NagC. The following pairs are also interesting: gpp with spoT, ppx with relA, gpp with relA and ppx and spoT. This set of genes encode for the product that are part of the so called stringent response. In E. coli, the stringent response is a sophisticated that regulates several cell processes in response of harsh environment or condition. It downregulates proliferative processes like, among others, cell division, DNA replication and ribosome, protein, nucloetides synthesis. It upregulates stress response processes, like the amino acid synthesis, fatty acid synthesis and antitoxin systems. The system is potentiated by the ppGpp, a nucleotides that act as alarmones (or stress signals) in the bacteria [Braeken et al., 2006, Kanjee et al., 2012]. This molecule is produced by relA and spoT. The first one from GTP and respond to amino acid starvation: ATP + GTP )−−−−* AMP + ppGpp (76) The other one from GDP:

−−* ppGpp + H2O )−− GDP + Diphosphate (77)

and respond, among others, to fatty acid starvation, carbon source starvation, phosphorous limitation and oxidative stress. It should also be noted that spoT a also hydrolyzes alarmones, when the amino acid balance is restored. ppGpp activate relA and inhibits dksA. The ppGpp also inhibits ppx and is bound to dksA to produce a dksA-ppGpp complex. This complex in turn inhibits spoT. However the mechanism leading to SpoT activation is not know [Braeken et al., 2006]. Also the nucleoside diphosphate kinase (ndk product) enhances formation of GTP and ppGpp during starvation conditions in P. aeruginosa [van Delden et al., 2001]. As it is illustrated in figure 61, the reaction catalyzed by the ndk product is the next reaction added to the cluster. Also hpt and purA are grouped with the stringent response cluster when the partition is composed of 118 clusters. The cluster composition is {gpp, spoT, relA, ppx, gmk, ndk, pnp, deoD, gsk, hpt, purA, purB, pykA, pykF} (14 genes) and all these genes products are part of the purine metabolism (according to the KEGG pathway classification. Moreover, the purA and hpt products have affinity with (p)ppGpp. Those genes are also paired with the genes in table 23 and those genes product also interact with (p)ppGpp. The proteins interaction with ppGpp provides a central regulatory framework for many different types of processes. The table 23 shows other genes product that have (p)ppGpp affinity and are paired with hpt. We recall that some genes can be found in different cluster because of the reverse reaction

111 4 module detection in metabolic network

Table 22: Discovered pairs in the hierarchy. This table only shows the pairs that have a distance of 0 in the hierarchy. The first two column is the genes’ name that compose the pair. Dist (bp) is the distance between the genes in the chromosome counted in base pair and Dist (gene) is the number of genes between the pairs. The “++/- -” indicates if the genes are on the same strand (but not the strand itself). Both distance are only provided for information purpose. TF indicates if a transcription factor regulate both genes. Dist (EP) and H. are respectively our distance and the height where the pair is formed in dendrogram. Gene Gene Dist (bp) Dist (gene) ++/- - TF Dist (EP) H. gpp spoT 138237 118 no no 0 0 ppx relA 284761 256 no no 0 0 gpp relA 1049094 934 yes no 0 0 ppx spoT 1195744 1072 yes no 0 0 gshA gshB 275439 239 no no 0 0 mgsA aldA 461217 437 no no 0 0 sufS cysK 771886 696 no no 0 0 cysE cysK 1248361 1120 no no 0 0 pgl gnd 1300281 1214 no no 0 0 tdk surE 1575369 1424 no no 0 0 ggt pepN 2045451 1765 no no 0 0 gloB dld 1987403 1813 no no 0 0 cdh cdsA 726644 623 yes no 0 0 sufS cysM 778149 703 yes no 0 0 tdk ushA 786961 718 yes no 0 0 ggt pepA 897617 770 yes no 0 0 ggt pepB 928723 850 yes no 0 0 trxB nrdD 1110077 972 yes no 0 0 tdk yfbR 1114718 1006 yes no 0 0 gloB ldhA 1207072 1099 yes no 0 0 cysE cysM 1242158 1113 yes no 0 0 ggt pepD 1309088 1120 yes no 0 0 tdk yjjG 1325856 1192 yes no 0 0 idi ispH 1634316 1449 yes no 0 0 thiL rdgB 1980012 1741 yes no 0 0 cdh ynbB 2009193 1782 yes no 0 0 purA purB 1426283 1267 no yes 0 0 glmS nagB 1430795 1242 yes yes 0 0

112 4.8 Conclusion

R00330 R03409 R00429 R00336

0.0 0.2 0.4

GDP pppGpp

spoT R00336

ppx/gpp R03409

ppGpp relA R00429

Binding dksA- Reaction GTP ppGpp dksA

Figure 61: On the top of the figure the part of the dendrogram corresponding to the genes of interest (i.e. gpp, spoT, ppx and relA). On the bot- tom the network extracted according to the hierarchical clustering. Interestingly this correspond to the so called stringent response. The plain arrows represent the regulation and the dashed arrows the chemical reactions.

Table 23: Grouping in the same module for pair of genes encoding for pro- teins interacting with ppGpp. The number of cluster is the result of a cut in the dendrogram at the specific height. Gene Gene Dist. (EP) Height Number of clusters guaC hpt 1 0.33 326 purA hpt 1 0.83 174 guaC gpt 0.92 0.73 122 guaC apt 0.92 0.73 122 catalyzed by the product. Such case are illustrated in the application on the human red blood cell in section 4.6. Finally we add that for the tightly related pairs sufS, cysK and cysM, the three genes products are PLP dependent proteins. The pyridoxal 5-phosphate 7 (PLP) is the active form of the vitamin B6, a coenzyme . We also notice that other PLP dependent proteins are paired with sufS in the hierarchy (table 24).

4.8 conclusion

The goal of this chapter was to describe a new method allowing the detection of modules of biological meaning in a metabolic network that are unbiased (in the sense of Papin et al.[ 2004a]). We have proposed a new distance to measure the similarity between a pair of reactions that is based on the extreme pathways flowing through the reactions. However because of the definition of the distance, its computation may be intractable. Thus we

7. A coenzyme enhance the catalytic activity of an enzyme (apoenzymes). Often the coen- zyme is necessary for the catalytic activity.

113 4 module detection in metabolic network

Table 24: Grouping in the same module for pair of genes that encode for PLP dependent proteins. The number of cluster is the result of a cut in the dendrogram at the specific height. Gene Gene Dist. (EP) Height Number of clusters sufS cysK 0 0 416 sufS cysM 0 0 416 sufS gabT 0.88 0.6 152 sufS gadB 0.88 0.72 179

proposed a Monte Carlo approach to approximate the distance and we have empirically showed the quality of the approximation. We also showed that the metric can be approximated even for large networks. Thus the approach is applicable for genome scale metabolic networks. We then applied our method to real biological networks in order to assess the interest of the proposed approach from a biological point of view. We have shown that our metric combined with a hierarchical clustering produces functional modules that have a biochemical interpretation. It has also allowed to provide a module description of the metabolism of the human red blood cell, in an undirected and in an directed form. We also try to perturb the metabolism by in silico inhibiting enzymes to see if conclusions can be drawn from the discovered functional modules by comparing them to a normal erythrocyte. In the case of the PK and G6PDH deficiencies in human erythrocytes, we were able to drawn conclusion that were supported by experimental evidences. We were able to notice a cell reorganization by inspecting the dendrogram and to study the modules with a simpler representation through meta-reactions. This allowed us to infer the function of the module and understand the consequences of the cell metabolism reorganization because of the pathology. We are aware that, with the human erythrocyte model, similar conclusions can be derived by inspecting the network or by studying the extreme pathways. But because of the size of the extreme pathways set, the pathways are usually not directly inspected but rather are processed to be analyzed. Hence we stress that we did not inspect the complete network or the extreme pathways set to draw our conclusions. Indeed, one important feature of our approach is to avoid the complete extreme pathways computation (or any other generating sets, like flux or elementary modes). Finally we wanted to show that our metric produces valid results on a genome scale network. So we used the E. coli network and try to match operonic structure with success. We also try to analyze new tightly coupled genes pairs that are formed at the beginning of the hierarchy that are not part of an intra-operonic genes pair. Some of these pairs and the cluster in which they were regrouped were of some biological significance (e.g. the stringent response system). This method, like other methods based on metabolic network, depends on the quality of the reconstructed network. If some interaction are missing, the metric may not produces several important pairs in the hierarchy. Indeed, we have seen with the human erythrocytes that the deletion of one reaction may strongly perturbs the global organization of the hierarchy. Also clustering approaches based on generating sets suffer a major drawback compared to approaches based on topological feature of the network because several reactions are total outliers. They share no similarities with other reactions,

114 4.8 Conclusion because no extreme pathways use them. Those reactions are usually removed during the pruning operation (or redundancies elimination). Thus those reactions are not clustered and are meaninglessly added to the top of the hi- erarchy. It is possible that the efficiency of the pruning is due to an incomplete reconstruction of the network or a bad setup of the exchange fluxes. From this point of view, our approach may be less convenient to use depending on what part of the metabolism is analyzed. Nevertheless, we are convinced that this metric combined with a hierarchical clustering method produces biological modules that are meaningful. Validating such approach is very difficult and in this work we have focused on E. coli because it has been extensively studied. This organism has played the role of a validation of our metric by rediscovering already known biological facts. We think we have empirically demonstrated the power our metric and that it can be used to build new biological hypothesis, like complex formation, binding to specific molecule or co-regulation.

115

Part III

SEQUENCEANALYSIS

BACKGROUNDINPOST-TRANSLATIONAL 5 MODIFICATIONSCLASSIFICATION

In this chapter we briefly recall of what the problem of classification is and its potential issues. Computer scientists may skip this part as it is considered basic knowledge. This chapter is concluded with a quick review on the post-translational modifications and Nα-terminal acetylation prediction.

5.1 classification

In machine learning, or more precisely in supervised learning, the problem of classification is to identify the relationship between instances and classes. Each instance is a vector of m attributes and a class is a member of a finite set of cardinality k. For example, in a medical application, we can classify the patient in a group corresponding to its health risk level: low-risk, medium- risk and high-risk (the classes) according a list of attributes that describe the patient instance: sex, height, weight, age, blood pressure (the inspiration of this example comes from Clarke et al.[ 2009]). We suppose that a function relating those attributes and the health risk level exists. Thus the classification problem is to find a function that relates the attributes of a patient to the risk level. In machine learning, we need a so called training set which is a set of instances and their corresponding class. In the case of the medical application example, a training set is composed of a set of patients along with their health risk level. Then the learning algorithm tries to find the function that relates the patients and their health risk level. A goal of the algorithm is to be able to provide also a good classification for instances that were not part of the training set. This is called generalization. More formally we suppose that there exists a function f that relates the m attributes to one of the k possible classes:

f : Xm 7→ C (78) with X = {x1, x2,..., xn} the instances set and the class set C = {c1, c2,..., cq}. An instance xi is represented by the vector of attributes xi = {x1i, x2i,..., xmi}. The learning algorithm uses the so called training set which is a set of exam- ples of the form {xi, cj} (an instance with its corresponding class) to find the function g ≈ f : g : Xm 7→ C (79) The function g is searched such as it minimizes a given error function on the output cˆi = g(xi) on all training instances (i.e. for all i). Thus the error function indicates how well g fits the training set. For example in the case of a binary classification we may use:

∑ |g(xi) − ci| (80) i where the classes ci and the outcomes g(xi) can be replaced by zero or one. The function g is usually taken from a space of possible function G. Indeed, to approximate the function it is required to make an ansatz on f as we can not search g in the space of all possible functions. We also speak of model

119 5 background in post-translational modifications classification selection. Usually it is wise to choose the simplest model (Occam’s razor or lex parsimoniae). But other motivations can drive the choice of the model or the function. In the case of this work, we select the model that produces the classifier that is the most readable. By readable we mean that we are not only interested in the quality of the prediction but also in understanding on how the model use the attribute vector to produce good predictions. Later in the text we speak of white box models, in contrast to black box models that are very difficult to read or understand. For example, G can be the space of linear functions or the set of all decision trees. A potential problem happening with supervised learning algorithm is over learning. As stated before the algorithm use a training set to infer the function g. However, there are cases where the algorithm identifies relationship that is only present in the training set. The algorithm has a probability to adjust to attribute values of the training data that are not part of the relationship between the instances and their class. It occur for example when the training set is too small. This phenomenon is called overfitting . It can be quantified with the loss function: it is well minimized on the training set but not with another validation set. A very simple example can intuitively illustrate this issue. Suppose that the instances are again patients and classes their cardiovascular disease risk level (only two classes: low and high risk). Let’s say that the following attributes are used to represent a patient: age, shirt color, weight and blood pressure. It is more or less known that age, weight and blood pressure is correlated with cardiovascular disease but not the color of the patient’s shirt 1. Now let’s assume that in the training set almost all the high risk patients are wearing red shirts by chance. There is a non null probability that the learning algorithm use this random feature to predict patient with a high cardiovascular disease risk. Even if the function g will produce a loss close to zero with the training set, there is a high probability that the function will be used to predict or tested on the risk of a patient having a non red shirt. In this case the patient could be predicted having a risk of disease because he is not wearing a red shirt. It is possible to evaluate how a leaner will generalize to an independent data set by using a k-fold cross-validation. This process is widely used to evaluate generalization of a classifier, allowing us to estimate the average generalization error of a machine learning method [Hastie et al., 2001]. For a given dataset D, we split it in k subsets such as their union is D:

D = D1 ∪ D2 ∪ ... ∪ Dk (81) and their intersection is empty:

∅ = D1 ∩ D2 ∩ ... ∩ Dk.(82)

We add that the Di’s are stratified, that is to say each Di has the same classes distribution. Then we build k folds. A fold Fi is composed of a learning set Li and a test set Ti such as: ( ) [ Fi = Li = Dj, Ti = Di (83) j6=i

1. Such assertion is provided without any reference and comes from general knowledge. However I apologize to the reader if in a near future, the color of the shirt is indeed relevant in detecting people having a high risk of cardiovascular disease.

120 5.2 Prediction of Nα-terminal acetylation and for each fold Fi the algorithm is trained on Li and evaluated on Ti. Finally the classification results for each fold Fi are aggregated and provide a more accurate estimate of the model performance. Usual choices for k are 5 or 10. Machine learning is a very complex and complete field. Therefore it is meaningless to try to provide a more complete introduction in this document. The interested and motivated reader is encouraged to read Murphy[ 2012] or Hastie et al.[ 2001] to deepen its knowledge in this field.

5.2 prediction of nα -terminal acetylation

Numerous predictors for post-translational modifications have been de- veloped, based on different machine learning models. For example artificial neural networks have been widely used to predict various post-translational modifications, like phosphorylation [Blom et al., 1999], N-terminal myristoy- lation [Bologna et al., 2004] and C-mannosylation [Julenius, 2007]. More re- cently random forest have been successfully used to predict post-translational modifications sites, for ubiquitination [Radivojac et al., 2010], γ-carboxylation [Zhang et al., 2012] and glycosylation sites [Chuang et al., 2012]. Although some of these predictors provide very good prediction capabilities for the problem they tackle, they often are black boxes. Indeed, their mathematical complexity makes them hard to interpret in term of biological meaning [Berthold et al., 2010]. Unfortunately this restricts the application of these model for biological problems which, in our opinion, require a model pro- viding explanations for the prediction. In this work we focused on the prediction of initiator methionine cleavage and Nα-terminal acetylation. We propose a method to automatically build a predictor, using only the information contained in the protein’s primary structure and which can be interpreted by biologists [Charpilloz et al., 2014]. Several approach using machine learning have been proposed to predict the Nα-terminal acetylation. For example support vector machine where proposed in Liu and Lin[ 2004] but the complete article was unfortunately impossible to fetch. Moreover it was impossible to reproduce their results as in the provided abstract they were no information about the setup of the support vector machine. Then the proposed online service was unavailable and thus impossible to test the predictor. Their dataset was also not available, hence it is also impossible to compare our score to their score. Also, while in the abstract they mention to obtain a sensitivity of 0.93 on an independent mammalian dataset we have no idea of the composition of this dataset. With proteomic datasets care must be taken with validation or test set, as in the case of Nα-terminal acetylation only the N-terminal part of the sequence is used to train a learner. Thus it is hard to be sure that the training set was really independent from the so called mammalian set. Such claim on the score must also be taken with care. A well known predictor is NetAcet which is based on artificial neural network [Lars et al., 2005]. The performance are good, but the dataset is composed of proteins extracted from Polevoda and Sherman[ 2003] and the one from the Yeast Protein Map [Perrot et al., 1999]. Thus it is only composed of S. cerevisiae proteins, only the potential NatA substrates were taken into consideration and they truncate the sequence to the first 40 residues. The resulting size of the dataset was 137 sequences: 61 acetylated and 76 not acetylated. Also all non acetylated positions in the dataset were used as a negative example but only the one having serine (Ser), threonine (Thr),

121 5 background in post-translational modifications classification alanine (Ala) or glycine (Gly) in the first position. Indeed the other are considered as not being part of the NatA substrate. However for unknown reasons the cysteine (Cys) was discarded. Beside it is quite surprising to use other non-N-terminal region to build a negative dataset. Indeed the N-terminal annotation in UniProtKB concerns the first residue (or the second in initiator methionine cleavage occurs) and taking into account other non-acetylated residues may build an inaccurate or noised negative set of example. Nevertheless with a 3-folds cross-validation they obtain a Matthews correlation coefficient of 0.69. However it is claimed that they obtain good performances by testing their neural network on an independent mammalian dataset (94 proteins). But the dataset contains only acetylated proteins and thus provide no information about the specificity. Their supposition regarding the generalization of substrate specificities between S. cerevisiae and the mammalian may not be made on the basis this small set of proteins and should be taken with care. Also the predictor is limited to proposed NatA substrate, so it only predicts proteins that undergo initiator methionine cleavage and whose second residue is a serine, alanine, glycine, threonine, valine or cysteine. We found this very limiting regarding the diversity of the Nα-acetylated proteome. Non machine learning approaches have been used to predict Nα-terminal acetylation or describe its proteome. For example Martinez et al.[ 2008], Cai and Lu[ 2008] and more recently Bienvenut et al.[ 2012], predict post- translational modifications using manually extracted rules based only on the information provided by the first two or three amino acids in the se- quence. They produce an online predictor (TermiNator 2) based on their patterns [Martinez et al., 2008]. The claimed performance are good and are reproducible. We also try to evaluate their predictor on our datasets fetched from UniProtKB, a proteins database (see chapter 6) and score were consistent with their publication. However these performances should be discussed. Indeed, from a machine learning point of view their work seems to propose an overfitted classifier. They produced their motifs based on the full set of sequenced proteins they have. So it is very difficult to assess if their motifs generalize well for Nα-terminal acetylation or if they are just observing a bias because of their data. Indeed when we tested our datasets on TermiNator3 there is a high probability that common sequences were used to build their patterns. Also we have noted that TermiNator3 can not discriminate Nα-acetylated proteins that have a substrate that corresponds to NatB or a substrate that corresponds to NatC. Indeed those proteins are respectively always classified as Nα-acetylated and not Nα-acetylated. In fact their motifs take only into account, at most, the first three amino acids, but the information provided by these three amino acids is probably insufficient to decide if a protein undergoes Nα-terminal acetylation. However we per- fectly understand that it impossible to do a cross-validation in such situation. It is impossible to ask an expert to find patterns in a fold and validate them in a test set. Moreover an expert will not forget the knowledge it acquire with the preceding fold when working with the next fold. Nevertheless regarding the results, their patterns are not validated and the published error does not provide much information on the quality of the pattern regarding protein that were not part of their dataset. We also add that TermiNator3 predicts the initiator methionine cleavage and the N-terminal myristoylation. It also relates the predicted N-terminus to protein half-life.

2. Available at the time at: http://www.isv.cnrs-gif.fr/terminator3/

122 5.2 Prediction of Nα-terminal acetylation

In the case of Nα-terminal acetylation, as pointed by Eisenhaber and Eisen- haber, it is unlikely to discover a unique pattern describing the requirement of all enzymes, because there is no biological sense to build an acetylation predictor based on an average motif, as no enzyme recognizes this average motif [Eisenhaber and Eisenhaber, 2010]. Our main idea is to combine several discriminant motifs optimized with genetic algorithms. These motifs are combined using a binary decision tree. Our choice is mainly motivated by the need for a white box models that are interpretable by biologists, in order to help identifying the required biological features. For the inter- ested reader, Johnson made recently a review on the use of motif to predict post-translational modifications in his master thesis [Johnson, 2015].

123

PROTEINSDATASETS 6

In chapter 5 we have stated that a supervised learning algorithm infers a function from a set of example called the training data. In our case, the instances are the primary structure of proteins (chapter 1) and the classes are labels indicating if the protein undergoes a given post-translational modification or not. The proteins are fetched from the UniProtKB (Universal Protein resource) database [Consortium et al., 2014]. UniProtKB is a freely accessible proteins database where each entry contains the current known amino acids sequence along with various annotation (e.g. protein function, structure or post- translational modifications). Also the entries are linked to several other databases (e.g. genomic, structural or bibliographic). As of April 29, 2015, the database contains 548 454 sequence entries. Unfortunately, there is no way to directly fetch data with the correct output or label and a complex query must be submitted in order to get the datasets. Hence the construction of two queries, one to get entries undergoing a post-translational modification and one to get entries not undergoing a post-translational modification. This chapter describe those queries and justifies the selected criteria to build them. For the reader used to the work with UniProtKB, the table 25 summarize all the criteria used to build the queries and can directly jump to the section 6.5, p. 129 (Datasets composition).

6.1 general criteria

The first four criteria are applied to all entries, regardless if the entry undergoes or not a post-translational modification. All selected entries must be reviewed, this means that the record or the entry contains information extracted from literature and human-evaluated. Then the protein existence must be proven by experiment. Hence the entry must be qualified as ex- perimental evidence at protein level indicating that there is clear experimental evidence for the existence of the protein. We add that the experimental evidence at protein level qualifier is automatically added to all entries with at least one of the annotations listed in the table 26. The sequence must start with a methionine and the entry should contain a cross-reference to the EMBL database of genomic DNA or mRNA molecular type. These criteria allow to eliminate peptides that are only protein fragments or peptides that are not encoded into identified genes (i.e. peptides found in animals venom). Finally the entry should not be linked with mitochondrial or chloroplastic gene. In- deed, we are trying to identify proteins modified by specific enzymes during the translation and in the case of mitochondrial or chloroplastic genes, the biochemistry regarding the post-translational modifications may be different. Hence the selected entries may not reflect the studied process.

6.2n α -terminal acetylation criterion

The added criterion to select the Nα-acetylated entries is the necessity to have a posttranslationally modified residue starting with N-acetyl in the feature table (e.g. N-acetylalanine or N-acetylserine). The modification must

125 6 proteins datasets

Table 25: Criteria used to build the Nα-acetylated and the non Nα-acetylated datasets, specific to the UniProtKB database. Criteria used for both datasets 1 Entry must has been reviewed by UniProtKB curators. 2 Protein evidence must be at protein level. 3 Sequence must start with a Met. 4 Presence of a cross-reference to the EMBL database of genomic DNA or mRNA molecular type. 5 No mitochondrial or chloroplastic gene.

Criteria used for the Nα-acetylated dataset 6 A modified residue starting with N-acetyl at the first position if the initiator methionine is not cleaved or at the second position if the ini- tiator methionine is cleaved. The modification must be experimentally proven.

Criteria used for the non Nα-acetylated dataset 7 No modified residue starting with N-acetyl at the first or at the second position according the IMC. 8 At least one reference must indicate that the entry has been sequenced at protein level from the first amino acid if IMC does not occur, and from the second amino acid if the IMC occurs. 9 An entry with only the protein sequence of N-terminus term in ref- erences with an identified signal or transient peptide is subject to manual review to determine if it is a valid non-acetylated entry. 10 No peptide identification by the COFRADIC method. 11 If the first exposed amino-acid is annotated as blocked by an uniden- tified N-terminal blocking modification, the sequence is rejected. The first exposed amino-acid is either the initiator methionine if not-cleaved, or else the amino-acid at the second position.

126 6.2 Nα-terminal acetylation criterion

Table 26: The protein existence value is assigned automatically based on the annotation elements present in the entry. This table contains the criteria used to assign the experimental evidence at protein level qual- ifier. The line column identifies the line type as declared in the Swiss-Prot format files or documentation. Annotation Line Keywords Reference Posi- RP Characterization, amino-acid composition, cat- tion alytic activity, function as, function in, interac- tion with, subunit, identification in ... complex- binding, identification by mass spectrometry, mass spectrometry, crystallization, X-ray crys- tallography, electron microscopy, structure by NMR, cleavage, proteolytic processing, topol- ogy, disulfide, level of protein expression, PTM information. Comment CC Allergen (without the by similarity qualifier), bio- physicochemical properties, biotechnology, de- velopmental stage (with the at protein level quali- fier), induction (with the at protein level qualifier), interaction, mass spectrometry, pharmaceutical, tissue specificity (with the at protein level quali- fier). Database cross- DR PDB (without model category), the 3D macro- reference molecular structure protein data bank. Keywords KW Direct protein sequencing, disease mutation, proteomics identification. Feature table FT The presence of one or more residues attached to an oligosaccharide structure, an amino acid bonds posttranslationally formed, a disulfide bond, a covalent binding of a lipid moiety, a posttranslationally modified residue or a site which has been experimentally altered by muta- genesis. All these only with experimental quali- fiers.

127 6 proteins datasets be experimentally proven. This means that the modification annotation must be linked to a reference stating that the post-translational modification has been observed. As a post-translational modification annotates a residue, the modified residue should be at the first position if the initiator methionine is not cleaved (i.e. the initiator methionine is acetylated) or at the second position if the initiator methionine is cleaved. Hence, with the general criteria, this criterion allow to build the Nα-acetylated datasets.

6.3 non-nα -terminal acetylation criteria

Obviously the first criterion is that no modified residue starting with N-acetyl should be in the feature table. Then to ensure observation of the Nα- terminal acetylation, the entry must contains at least one reference indicating that the entry has been sequenced at protein level (table 26) from the first amino acid if the initiator methionine is not cleaved and the second position it is cleaved Then if the first exposed amino acid is annotated as blocked the entry is rejected. The annotation implies that the site bears unknown modifications. The first exposed amino acid is either the initiator methionine if not cleaved or else the amino acid at the second position. Finally the entries having peptides identified with COFRADIC method are rejected as this method does not allow to distinguish acetylated and non-acetylated N-termini [Gevaert et al., 2003]. However, we were not able to produce a rule that is able to fetch only the needed entries. Indeed the entries with only the protein sequence of N-terminus term in the reference with an identified signal or transient peptide is subject to manual review to determine its validity as a non-Nα-acetylated. We recall that a transit peptide is an N-terminal presequence which directs the protein to an organelle. A signal sequence is a peptide usually present at the N- terminus of a protein destined to be either secreted or part of membrane components. Normally, the signal sequence is removed from the growing peptide chain by specific peptidases. The reason is that this term may refer to the N-terminus sequencing of the mature protein. Hence if this term is part of the entry, there must be no transit or signal peptide in the sequence, otherwise the entry is rejected.

6.4 quality of the datasets

We now briefly discuss the quality of the fetched dataset from UniProtKB As with all machine learning approaches, the quality of the predictor depends on the quality of the dataset. By quality we mean that we are highly confident regarding the post-translational modification status assigned to an entry (i.e. Nα-terminal acetylation). There is two kinds of misannotation. The first one is when an entry is wrongly annotated as Nα-Ac. However this case is unlikely to happen because we rely on experimental evidences. The second one is when an entry is indeed processed but was never observed as such and thus is annotated as processed. This case will be more common therefore we need to apply extra care when extracting non processed entries. In biology negative sets are difficult to build because they rely on the non-observation of a phenomena which is not directly annotated in databases. No one can be sure to produce a 100 % free noise dataset, but hopes to have the less possible noise with the available tools or databases used to build a learning or training set.

128 6.5 Datasets composition

We have proposed rules that seem legitimate to confirm the non-acetylation of a UniProtKB entry. However errors in the UniProtKB database are possible and can add misclassified instances in the dataset. Moreover if there are any peculiarities with the sequencing method described in the entry’s references this cannot be detected by our query. Example of a potential misannotation in the UniProtKB entry NAGK_RAT. The entry is annotated as Nα-acetylated and the annotation is confirmed by a reference. However one can read from the reference: Peptide 4 was N-terminally blocked in Edman degradation; its sequence was obtained by mass spectrometry. (The post-source decay spectrum for the N-terminal part cannot be explained by the standard amino acids; it can be matched, however, with an N-terminal Ac-Ala residue.) Therefore illustrating an undetectable potential misannotation of the entry as the query does not parse or analyze the content of the linked publications. Although it is know that the Nα-terminal acetylation is not always a total modification, this fact is currently not taken into account in the available protein databases. Hence, we qualify a protein as acetylated if the post- translational modification was experimentally observed, regardless of the modification ratio.

6.5 datasets composition

The queries allow the extraction of two datasets, one for Nα-acetylated proteins and one for proteins that do not undergo Nα-terminal acetyla- tion. The extraction process was repeated for several taxonomic groups: Eukaryota, Metazoa and H. sapiens. The table 27 shows the sizes and the post-translational modification ratio of the datasets extracted depending on the chosen taxon. The majority of our experimentations (see chapter 8 and Charpilloz et al.[ 2014]) are based on a dataset extracted from the release 2012_07 of UniProtKB (11th of July, 2012). The datasets extracted from this release will be identified as 2012. At the end of this thesis we have also extracted a new dataset from the release 2015_07 of UniProtKB (24th of June, 2015) to redo some experiments to assess if our approach is still valid. The datasets extracted from this release will be identified as 2015. We also stress that the taxon datasets are not mutually exclusive. In the 2012 datasets, 78 % of the Eukaryota dataset is composed by Metazoa sequences and that 65 % of the Metazoa dataset is composed by H. sapiens sequences. In the 2015 datasets the Eukaryota is composed of 79 % of Metazoa sequences and Metazoa is composed of 45 % of H. sapiens sequences. The main differences between the datasets is that the numbers of proteins have more or less doubled between the releases for all taxa and the ratio of Nα-acetylated proteins have raised, mainly in Eukaryota and Metazoa. Regarding the Initiator methionine cleavage dataset, there is not specific query to build such dataset. Indeed, the initiator methionine cleavage datasets are extracted from the Nα-terminal acetylation datasets by checking the presence of the feature of type initiator methionine with the value removed. The criteria used for the Nα- terminal acetylation datasets imply experimental evidences for the initiator methionine cleavage too. This lead to the datasets whose composition is detailed in table 27. To provide an overview into the datasets composition, the following se- quence logos and frequency plot are displayed: the initiator methionine cleavage for the 2012 datasets (figure 62) and 2015 datasets (figure 63), and the Nα-terminal acetylation for the 2012 datasets (figure 64) and 2015 datasets

129 6 proteins datasets

Table 27: Number of sequences and content of the different datasets extracted from the two release of UniProtKB (2012_07 and 2015_07) for the initiator methionine cleavage and the Nα-terminal acetylation. The PTM and ratio are respectively the number and the ratio of proteins undergoing the corresponding post-translational modification. Init. Met cleavage Nα-Ter. acetylation Dataset Taxon Total PTM Ratio PTM Ratio Eukaryota 2558 1831 0.72 1603 0.63 2012 Metazoa 1991 1431 0.72 1404 0.71 H. sapiens 1291 887 0.69 1116 0.86 Eukaryota 6766 4507 0.67 5543 0.82 2015 Metazoa 5352 3491 0.65 4567 0.85 H. sapiens 2426 1532 0.63 2155 0.89

(figure 65). A sequence logo shows how residues are conserved at each posi- tion in a set of aligned sequences by displaying letters with different height. All the letter composing the sequences are stacked and the height of the stack is the information measured in bits. For a position i it is computed as:

Yi = log2(20) − (Hi + en), where Hi is the Shannon entropy of position i. The log2(20) because only the 20 standard old amino acids are considered. Thus the maximum value is ≈ 4.32. The different residues are scaled in the stack according to their frequency. The heights of each residue a at the position i are given by:

ha,i = fa,i · Yi, with fa,i being the relative frequency of the amino acid a at the position i and en being the small-sample correction for an alignment of n letters. The higher the letter is, the higher it is conserved at that position. Each logo and plot only displays the first six residues.

6.6 conclusion

Proposing biologically meaningful criteria for data extraction from the UniProtKB is a non-trivial task. First there is may be no annotation regarding the studied problem (e.g. non-Nα-acetylated). Then it requires to have a whole view on the technique and procedures applied to sequence a protein. Otherwise inadequate queries may be built and can lead to improper datasets construction (i.e. noise in the terminology of a learning set). For example with COFRADIC method that is based on acetylation of the N-terminal residue and thus does not allow to discriminate the sequencing of naturally Nα-acetylated proteins. That is why, during this work, we found that there is a lack of a criteria database. Such database could contains a description of the criteria to combine in order to obtain an entry that correspond to a given feature. Such a service could allow the incorporation of the improvement made by researcher in order to provide the state of the art criteria. We have provided what we think are the state of the art criteria to build a dataset composed of Nα-acetylated and non-Nα-acetylated entry from UniProtKB. The criteria were thoroughly described in Charpilloz et al.[ 2014].

130 6.6 Conclusion

4 3

bits 2 1 SA PE S GAD G A T A

AG 0 VT NM C 4 3

bits 2 1 E D N

L

P

T 0 L NM C

4 3

bits 2 1 A S P V GA E AA 0 TS NM C 4 3

bits 2 1 E D L N

Q 0 K NM C

4 3

bits 2

1 SA P

A VE A 0 G NM C 4 3

bits 2 1 E D L K

N 0 Q N C M1 2 3 4 5 6 Figure 62: Sequence logo for the initiator methionine cleavage of the 2012 dataset (extracted from the release 2012_07). There is one logo per taxon and per PTM status (i.e. undergoing initiator methionine cleavage or not).

131 6 proteins datasets

4 3

bits 2 1 A A E S

A D A P A SG TT G 0 V NM C

4 3

bits 2 1 E D N

L P

T 0 L NM C

4 3

bits 2 1 A

PS A GE S

A T A D

VG 0 T NM C

4 3

bits 2 1 E D N

L 0 T NM C

4 3

bits 2 1 A

G SA PS E TD

T 0 VG NM C

4 3

bits 2 1 E D N

L 0 A N C M1 2 3 4 5 6 Figure 63: Sequence logo for the initiator methionine cleavage of the 2015 dataset (extracted from the release 2015_07). There is one logo per taxon and per PTM status (i.e. undergoing initiator methionine cleavage or not).

132 6.6 Conclusion

3

2 bits 1 A E A D A A AA S G P GE S TS ES E L S G G L E M V N L V

P STL

G 0 V N C

2 M 1 bits P A G K VL K L TR L K S E E S

I A G V 0 E N C

3

2

bits A 1 E MA D A SAGA A S E S T E

P V

E S

G GL

L L T S N S V 0 L N C

2 M 1 bits P V A L L K A VE G L R L A T S S E T

G P H K

S KK A

L 0 D N C

3

2

bits A 1 ME A D S TAA GAA S G SEL

P

E N E

T S L V

S L 0 N C 2 M 1

bits P V A

GK K TL VL ST

R AS L L S E K A K 0 V N 1 2 3 4 5 6 C Figure 64: Sequence logo for the Nα-terminal acetylation of the 2012 dataset (extracted from the release 2012_07). There is one logo per taxon and per PTM status (i.e. undergoing Nα-terminal acetylation or not). The proteins having undergoing initiator methionine cleav- age have their initiator methionine removed in their sequence. Also the y-axis has been rescaled to provide a better view of the representation.

133 6 proteins datasets

3

2

bits A 1 E MD A S T A A AA

SG GP S

GE T E NS

G LS LE S L

P E 0 G N C

2

bits 1PM A G K L L R K A V L A T K G S 0 S N C

3

2

bits A 1 ME D A S T A S

A A TG G A G E S N E

S SS

P

L 0 L N C 2 M 1 bits P V A

L L K A L GR K S A V L A T S

L

S

S K TG 0 A N C

3

2

bits A 1 ME D A S T A S A G A G T A S S N E G E

P

L L S S 0 N C

2

1 M

bits A P V

GK L L T K R L A S V A

A T L L 0 S S N C 1 2 3 4 5 6 Figure 65: Sequence logo for the Nα-terminal acetylation of the 2015 dataset (extracted from the release 2015_07). There is one logo per taxon and per PTM status (i.e. undergoing Nα-terminal acetylation or not). The proteins having undergoing initiator methionine cleav- age have their initiator methionine removed in their sequence. Also the y-axis has been rescaled to provide a better view of the representation.

134 MOTIFS-TREES 7

In this chapter we describe the model used to predict the Nα-terminal acetylation and initiator methionine cleavage and its building process. The chapter starts by the definition of the biomolecular motif descriptors (or motifs) and how such motifs are used and discovered. Then the manner of combining motifs to create rules is explained, leading to the motifs-tree.

7.1 motivation

One of our goal is to propose classifier that allow to understand the enzymes sequence requirements to process the proteins. We decide to use the motif as it is the simplest way to describe these requirements. Motifs are already widely used and there exists databases devoted to motifs like PROSITE to identify specific site in proteins (e.g. the well known structural motif zinc finger) [Sigrist et al., 2013]. However a single motif may not be enough to predict the Nα-terminal acetylation. The reason is that there exist at least six N-terminal acetyltransferases (NatA, B, C, D, E and F). Hence, as pointed by Eisenhaber and Eisenhaber[ 2010], it is unlikely to discover a unique pattern describing the requirement of all enzymes, because there is no biological sense to build an acetylation descriptor based on an average motif, as no enzyme recognizes this average motif. Hence the motifs are combined in a decision tree manner as each motif can be used to compute a similarity score by matching the motif with an amino acids sequence [Gonnet and Lisacek, 2002]. These scores are then compared with cutoff values to discriminate sequences into two groups [Bucher et al., 1996]. Hence, these descriptors can be used in a binary decision tree where they correspond to the test nodes [Yan et al., 2011]. Decision trees still one of the preferred choice (top 10) in data mining as of 2008 [Wu et al., 2008]. Researcher and end- user of a machine learning approach tend to prefer a model that produces interpretable models or so called white box model. It is especially interesting in biology or bioinformatics where there is often a strong interest in being able to explain the feature or the rules discovered by the model. It could help researchers to understand the underlying phenomena. Both models, the motif and the decision tree, have the property of being highly readable. Thus having the most chance of forming a white box model when combined. We call such model a motifs-tree.

7.2 sequence motif

Families of proteins sharing similar biological functions often contains conserved domain of amino acids. For instance the aspargine glycolysation site motif 1 usually contains a subsequence of the following form:

N¬[P][ST]¬[P].(84)

Such regular expression is read as: an aspargine followed by any amino acid except a proline, followed by a serine or a threonine and finally followed by

1. This pattern correspond to the entry PS00001 in the PROSITE database.

135 7 motifs-trees any amino acid except a proline. Such expression is called a sequence motif and can match 722 sequences: the first position in the motif only matches the aspargine, the second position matches two amino acids and the third and fourth positions match 19 amino acids. There are several manners to describe motifs and they are often represented as a regular expression (like the aspargine glycolysation site) or position-specific weight matrix (PWM). Such a matrix contains log odds weights for computing a matching score which is used with a cutoff value to decide if a sequence matches the motif or not [Stormo, 1988].

Yet in our approach we decided to represent a motif like a regular expres- sion with the use of a cutoff value to decide if a sequence match the motif or not. Moreover we also want to align the motif because we have no clue about the positions of the residues that may have an influence on the enzymes that process the Nα-terminal acetylation. We are also not sure if the distances between relevant features are constant among all positive examples. When a motif is aligned with a sequence, the order of its elements are conserved but they do not need to be consecutive anymore. That is to say gaps are allowed. We think it is a desirable feature to let the elements of the motif slide on the sequence. To illustrate this feature, lets consider the following sequences: ASTVCYWQ, SAVTGEDA, RNASPTFF, FMCISATT, SHLKWCGT. Computing the frequencies of amino acids at each position does not provide specific information about the composition of those sequences. Now let assume that the specific feature to these sequences is that they are composed of a serine followed by a threonine and the more the serine and threonine are close, the more the feature is present. In this case the motif ST is useful for only one possible case, ASTVCYWQ and is not detected in the other cases because there is are amino acids between the serine and the threonine. We may consider the special token X{p,q} which allow between p and q amino acids and propose: SX{0,6}T. This motif matches all previous sequences but this motif is not penalizing residues that are too fare from each others. When considering an alignment it is possible to penalize sequences where the serine and threonine are too far from each other by introducing gaps. Gaps are elements that can be inserted between elements of a motif in order to match an amino acid in the sequence. For example, the alignment of the simple motif ST produces: ASTVCYWQ SAVTGEDA RNASPTFF FMSCIATT SHLKWCGT ST S––T S-T S–––T S––––––T and the alignment are represented by a dash. For example if a match count as 1 and a gap is penalized by 0.1, the alignments produce in order the scores: 2, 1.8, 1.9, 1.7 and 1.4. Each one of these score represents a level of detection of the motif ST in the sequences. When used with a threshold we can do a fine selection of the sequences that sufficiently match the motif. In the following section an algorithm to align a sequence and a motif and to obtain alignment scores is described.

7.2.1 Aligned motif

In the present work we use the Needleman-Wunsch algorithm to align an amino acids sequence and a motif. The Needleman-Wunsch is a well known algorithm in bioinformatics and widely used for optimal global alignment. We are going to provide a short textual description of the algorithm. The reader interested in the Needleman-Wunsch algorithm may find more de-

136 7.3 Tokens

tailed information in Needleman and Wunsch[ 1970] and more generally in sequence analysis in Durbin et al.[ 1998].

The idea is to build the best alignment by using the optimal solution of smaller fragment alignments (from the end of the sequences to the beginning). With two sequences x = (x1,..., xn) of length n and y = (y1,..., ym) of length m, we recursively fill the value of a (n + 1) × (m + 1) matrix F by using the following schema:  F + ϕ(x , y )  i−1,j−1 i j Fij = max Fi−1,j + γ (85)  Fi,j−1 + γ

with the function ϕ is a function of match between an element of xi and a yj and γ is the gap penalty. The gap penalty can be modeled by a fixed value (usually γ < 0) or by a function of the number of gaps. The latter may be used to penalize more the alignment if it uses too much consecutive gaps. As the model is additive, the best score is reached in Fn+1,m+1. Before applying the recursion the first row and column of F must be filled. This is done by applying the schema in equation 85 with F0,0 = 0. Thus filling the values Fi,0 = iγ and F0,j = jγ. Once the matrix F is filled the alignment is built by backtracking the path that produces the value in Fn+1,m+1 to F0,0. When a diagonal movement is made in the matrix, i.e. from i, j to i − 1, j − 1, the symbols xi and yj are aligned. When a vertical or horizon movement is made, i.e. from i, j to i, j − 1 or i − 1, j, a is a gap added in x or y. However in the case of a matching score we are only interested in the best matching score which is provided by Fn+1,m+1. The complexity of the Needleman-Wunsch algorithm is O(nm) when it aligns the sequences x and y (|x| = n and |y| = m). This complexity may be a drawback when there is a need to align thousand long sequences with a long motif.

7.3 tokens

With the use of regular expressions to represent motifs and with the Needleman-Wunsch algorithm, we can now describe the building block of a motif, namely the tokens. A token is a symbol with a scoring scheme attached. The scoring scheme allows to measure the token similarity with an amino acid. Hence a token can be viewed as a function (see equation 85):

ϕt : A 7→ R,(86)

with A being the set of amino acids (encoded with the one letter code):

A = {g, a, v, l, i, p, f, y, w, s, t, n, q, c, m, d, e, h, k, r}.(87)

Note that A is defined without the selenocysteine (Sec or U) and without pyrrolysine (Pyl or O). For instance, a simple token is the fixed amino acid which is a token imposing the presence of an amino acid. It can be viewed as character matching with the one letter code. This token produces a value of

137 7 motifs-trees one if the symbols match and zero if not. For example, the token imposing the presence of a cysteine (c):  1 if x = c  ϕc(x) = 0 if x 6= c (88)  γ if a gap is inserted with γ the gap penalty. A motif is then an ordered finite length list of tokens. The selected type of tokens to build a motif are listed in the following sections.

7.3.1 Any amino acid

The first one is the simplest is the any amino acid token. It is a token that always returns one. Its representation is the • and the corresponding function is ( 1 if x is an amino acid ϕ•(x) = (89) γ if a gap is inserted

7.3.2 Fixed amino acid

The second token is the one that has already been provided as a first example, the fixed amino acid. The fixed amino acid is a token that matches only one specific amino acid. Its similarity measure is always one for the specified amino acid and zero for the others. Its representation is the one letter code of the amino acid (e.g. W for the tryptophan) and its corresponding function is the one provided in equation 88 but can be also applicable for any one of the amino acid in A as defined in equation 87 in place of c (cysteine).

7.3.3 Included or excluded amino acids

The next two used tokens are the inclusion token and the exclusion token. Both are very similar but one is the complement of the other and represent a set of amino acids. The token produce a similarity of one if the aligned amino acid is in a given set of amino acid A (A ⊆ A) in the case of the inclusion token and zero otherwise. Its notation is [...] where between the braces lie the amino acids defining the set (e.g. [ACT]). The associated function is:  1 if x ∈ A  ϕA(x) = 0 if x ∈/ A (90)  γ if a gap is inserted

Regarding the exclusion token, its definition is the same except its rep- resentation which lists the amino acids producing a score of zero: ¬[...]. For example the token ¬[ACT] produce a score of one when aligned on the following amino acids A \{a, c, t} and zero with {a, c, t}.

7.3.4 Amino acid physicochemical similarity

The last type of token is the amino acid physicochemical similarity token which describe how similar an amino acid is to a reference amino acid according to a physicochemical property from the AAindex1 database [Kawashima

138 7.3 Tokens

Table 28: Hydropathy index (KYTJ820101) from the AAIndex1 database for the twenty amino acids. Amino acid a r n d c q e g h i Value 1.8 -4.5 -3.5 -3.5 2.5 -3.5 -3.5 -0.4 -3.2 4.5 Amino acid l k m f p s t w y v Value 3.8 -3.9 1.9 2.8 -1.6 -0.8 -0.7 -0.9 -1.3 4.2

Table 29: Summarized list of the type of tokens used to build a motif. Token type Description Representation Any Any amino acid • Fixed Only one amino acid A, match only the alanine (Ala or a) Inclusion Only the amino acid [MTV], matches only the me- contained in a set thionine (Met, m), threonine (Thr, t) and valine (Val, v) Exclusion Only the amino acid ¬[GAVLIPFYWS], matches not contained in a set only the Thr, Asn, Gln, Cys, Met, Asp, Glu, His, Lys and Arg Physicochemical Physicochemical simi- {F, KYTJ820101}, produce larity to a given amino higher score for the amino acid regarding a given acids having an hydropathy property index close to the one of the phenylalanine

and Kanehisa, 2000]. The values displayed in table 28 are an example of physicochemical property and correspond to the hydropathy index for the twenty amino acids. Those token are represented by the reference amino acid r followed by the AAindex1 p ({r, p}). For example {S, KYTJ820101} is a token where the amino acids with a similar hydropathy index [Kyte and Doolittle, 1982] (table 28) than serine, have a high similarity score. The similarity is computed as follows:

σ{r, p}, a = 1 − |p¯(r) − p¯(a)|,(91) where p¯(x) is the value of the property for x, normalized between 0 and 1. Hence the associated function is: ( σ{r, p}, x ϕ{r,p}(x) = (92) γ if a gap is inserted

All the described tokens are summarized in table 29. These tokens can be generated for any one of the standard amino acid or a combination of them. This lead to a total number of tokens that is the order of ≈ 106. The table 30 summarize how this order is computed.

139 7 motifs-trees

Table 30: Numbers of possible tokens generated, sorted by type. The total number of possible tokens is the order of 106. The total number of possible motifs is the order of 10n·6, with n the maximum length of the motif. Token type Number of tokens Explanation Any 1 Fixed 20 Only the old twenty standard amino acids are considered (i.e. A, equation 87) Inclusion 220 = 1 045 876 This is the number of all the pos- sible subsets of A (equation 87) Exclusion 220 = 1 045 876 Same as the inclusion token Physicochemical 544 · 20 = 10 880 The AAIndex1 database is com- posed of 544 entries (as of May 2015)

7.4 motif search by genetic algorithm

Several approaches have been proposed to discover motifs. Some search for enriched motif in families of proteomic or genomic sequences [Bailey et al., 2015, Do˘gruelet al., 2008, Roepcke et al., 2006, Thijs et al., 2002]. But the drawback is that there is an a priori knowledge that there is one or more shared motifs only in the sequences. This is not the case when we tackle the prediction of Nα-terminal acetylation because of the variety of the acetylated proteome. There are also approaches to discover discriminant motifs. Such methods use a positive and negative sequences to discover motifs that globally match the positive sequences and not the negative sequences. However they use a PWM like representation or regular expressions that are not meant to be aligned [Mehdi et al., 2013, Redhead and Bailey, 2007]. We search the best discriminant motif by genetic algorithm (GA). The genetic algorithm is well known method used to find good solutions to complex optimization problem [Goldberg, 1989]. It a solution to an optimiza- tion problem by mimicking Darwinian evolution (reproduction, inheritance, mutation and selection) to explore the space of admissible solutions. The idea is to reproduce a survival-of-the-fittest model, where several solutions of the problem are generated, then modified by bio-inspired methods, and the best ones are selected for the next round of evolution. In the genetic algorithm terminology, an admissible solution of the problem is called an individual, its representation is called a genome and its elements genes. All the individuals of the evolution process form the population. To understand how we used the genetic algorithm to approximate the best motif, we need to define what is an individual, what is the initial population, how the fitness function is computed, and which genetic operators are used.

7.4.1 Individual

An individual represents a candidate solution of the considered problem. In our setup, an individual is a tokens sequence of variable length. The initial population is randomly created by generating n individuals with m tokens,

140 7.4 Motif search by genetic algorithm

Figure 66: (a) The one-point crossover, as its name implies, uses only one point to select how the parents genes are recombined into off- spring’s genes. The dashed line represent the crossover point. (b) The cut and splice crossover where a different crossover point (black or white dashed lines) is randomly selected in each parent. Then the genes are swapped between the parents to form the offspring. randomly drawn from the category of token described in table 29, with m equal to length of the considered fragment in the dataset.

7.4.2 Genetic operators

The operators used are : the crossover (or recombination), the mutation and the selection and a more peculiar one, the plague operator. The crossover used is the cut and splice crossover. First the one-point crossover is described as the cut and splice crossover is a slight variation of the one-point crossover. In the single-point, one crossover point (or position) on both parents is selected. The genes on both side of the crossover point are swapped between the two parents, thus producing two new individuals forming the offspring. In the cut and splice variant, the difference is that one point is selected in each parent (figure 66). The mutation operator affect only one gene in the individual, that is to say a token. The mutation simulates the errors during the reproduction phase happening with a given probability. This operator helps the exploratory pro- cess to evade local optimum by jumping to another location in the solutions landscape. We used a one point mutation which change the value of one gene in the individual. Our mutation operator can insert a random new token t in the motif at a random position i (ζ(i, t)), delete token in the motif at a random position i (δ(i)) or substitute a random token in the motif at a random position i by a random new token t (ς(i, t)). The figure 67 illustrates the application of the three types of mutations used. It is interesting to note that for a token a substitution can be seen as the composition of consecutive mutations: a deletion followed by an insertion (ς(i, t) = δ(i) ◦ ζ(i, t)), or an insertion followed by a deletion (ς(i, t) = ζ(i, t) ◦ δ(i + 1) = ζ(i + 1, t) ◦ δ(i)). We also define a plague operator which is used to simplify an individ- ual (i.e. a motif). This operator removes the tokens that do not improve the quality of the solution, and simplifies inclusion and exclusion tokens. Therefore this operator improve the readability of a motif without altering its discriminant power. This step is important as we try to build an interpretable model. Hence the plague is an operator used to delete the so-called bloat, which consists in useless tokens in a motif that do not help to improve the quality of the solution [Banzhaf et al., 1998]. Individuals are sequences of tokens and it is possible that some tokens in the motif do not improve the discrimination capability of the motif. The figure 68 is an example of bloat in the case of a motif and a sequence aligned with the Needleman-Wunsch

141 7 motifs-trees

Figure 67: Example of application of the mutation operator used in our evolutionary algorithm setup. This diagram shows the application of the three type of mutation on an individual. It is interesting to note that the substitution is a strong mutation as it corresponds to the application of an insertion and a deletion.

MATWCKK MAT–-WCK MATWCKK -ATTTTW– -ATW–- SATWCAT SAT–-WCAT SATWCAT -ATTTTW–- -ATW–- SAPVRMC SAPVRMC––- SAPVRMC– -A––-TTTTW -A––-TW (a) (b) (c)

Figure 68: Example of bloat in alignments with a motif containing bloat (ATTTTW) and a motif without bloat (ATW). In this example, the alignment is a strict match between the sequence and the motif. The bloat is composed of three threonines (TTT). Both motifs pro- vide the same discriminant capacity: they align better on the first two sequences than the third sequence. In this context, the three Thr in the motif reduce its readability as only one Thr is enough in the motif. (a) The three sequences on which the alignments are performed. (b) The alignments of the motif containing bloat. (c) The alignments of the motif without the bloat.

142 7.4 Motif search by genetic algorithm algorithm. The bloat produces longer motifs leading to the two following issues: — computational problem, as an alignment with the Needleman-Wunsch algorithm has a time complexity of O(nm) with n the amino acids sequence length and m the motif length. So having a shorter motif improve the performance of the alignment, thus the fitness evaluation. — Readability problem because if a motif is too long, it can be hard to interpret, and consequently useless to make a biological interpretation in order to understand the underlying phenomenon. So it is necessary to have an operator that simplifies the motifs. After an evolution process (i.e. after the last iteration of the evolutionary algorithm) the plague is applied to the individual representing the best solution found over all iterations. The plague operation works as described below. 1. It randomly deletes a token from the motif and evaluate its fitness. If the fitness value is still the same or is improved, the token is definitely deleted, otherwise the token stays in the motif. This process is repeated while n successive attempts do not improve the fitness. 2. Every inclusion token or exclusion token goes in a simplification process where an amino acid in the set representing the token is removed. If the fitness stays the same or is improved, the amino acid is definitely removed from the set. Like the previous step, this operation is repeated while m successive attempts do not improve the fitness. 3. It randomly replaces a token by the any amino acid token. If the fitness still the same or is improved, the token is definitely replaced. 4. At the end every exclusion token or inclusion token are respectively transformed in an inclusion token or exclusion token if there are more than 10 amino acid in the set. For example the token described with the set: t¯ = { D, E, F, G, H, I, K, L, M, N, P,Q } excludes the presence of the above amino acids is equivalent as the token including the amino acids in the following set:

t = { A, C, R, S, T, V,W, Y }.

The first step improves computing performance, while the others improve the readability of the motif. Finally the selection operator is the operator that select the individual that will produce new offspring with the crossover operator. We use the k-tournament selection which draws k randomly chosen individuals and select the one having the optimum fitness value.

7.4.3 Fitness computation

The proposed fitness is based on the information gain [Russell and Norvig, 2010] and its computation is done in the same way as it is done in the C4.5 algorithm. However to understand how it is done, we start by defining what is considered as attributes. Indeed the evolutionary algorithm used should be seen as an attributes generator having only two possible values for any instance. Let’s first denote a motif-sequence matching or scoring function

ψ : An × Tq 7→ R,(93)

143 7 motifs-trees with A the set of amino acids and T the set of all possible tokens. Thus An is the set of sequence of length n and Tq the set of motifs composed of q tokens. Next, a pair composed of a motif m and a threshold τ is an attribute having only two possible values for an instance (i.e. a sequence): no match and match (respectively symbolized by {⊥, >}). The attribute value for the sequence s is computed as: ( ⊥ if ψ(s, m) < τ v(s, m, τ) = (94) > otherwise With the previous definitions and the next one, we can compute the informa- tion gain G for a given motif against a set of instances I:

S? = {(s, c) ∈ I | v(s, m, τ) = ?} (95) and " # S S G(I) = H(I) − ⊥ · HS  + > · HS  (96) |I| ⊥ |I| > for a given motif m, a given threshold τ, and where H(·) is the entropy. With X taking the values {x1,..., xn}, the entropy is computed as:

H(X) = − ∑ P(xi) logb P(xi) (97) i

2 where P(xi) is the probability of xi to occur . Usually we use the Shanon entropy, hence b = 2. The pair {m, τ} producing the highest gain is the one selected by the fitness function. The individuals in the evolutionary algorithm are only motifs but to compute the fitness it is needed to provide the alignment score thresholds. For a given motif, the thresholds are the matching score computed with ψ(·, ·). With n sequences there is obviously at most n different scores, thus n candidates to test to find the threshold. However several sequences can produce the same matching score for a given a motif. Hence in an evolutionary process with g generations, m individuals and a learning set containing n instances, the maximum number of attributes generated is at most bounded by gnm.

7.5 motifs-tree: motif combination

A motifs-tree is a combination of motifs arranged in a decision tree manner. This means that each intermediate node of the decision tree corresponds to a test based on a motif as described in the previous sections. Each leaf is labeled with a class. A sequence traverses the tree from the root to a leaf. When a sequence reaches a node (including the root node), it is matched with the node’s motif m to get a similarity score (equation 94). This score is then compared to the node threshold (cutoff values) in order to select the next branch to take. When a sequence reaches a leaf, it is classified as undergoing a specific post-translational modification or not, depending on the leaf label. This representation is highly readable as each path in the tree from the root to a given leaf can be represented as a logical clause in conjunctive normal form. Such a tree and how a sequence moves through the tree to its assigned label are illustrated in figure 69.

2. In the case of P(xi) = 0 the value of 0 · logb(0) is 0.

144 7.5 Motifs-tree: motif combination

MTSVD

M[ACGPS]....

no match match

M[TV]..¬[EK]

no match match

MTSVD

Figure 69: A graphical representation of a motifs-tree. Each node is a test, in this case a regular expression producing either a match or a no match. The bracket expressions match a single character that is contained within the brackets. A bracket expression starting with a ¬[...] matches a single character that is not contained within the brackets. A dot (•) matches any single character. The leaves are labeled with the class prediction, in this case PTM and No PTM. An input amino acids sequence, represented with the one-letter code (mtsvd), is matched with the expression in the root node. If it matches the sequence moves through the tree by following the right branch, otherwise it follows the left branch. As there is no match between the sequence and the expression in the root, the sequence followed the left branch and is tested against the second node. The sequence matches the expression and moves by following the right branch and end in a leaf labeled PTM, being the predicted class for mtsvd.

145 7 motifs-trees

7.5.1 Motifs-tree growth

The learning method used to build a motifs-tree is based on the well known decision tree growth algorithm C4.5 [Quinlan, 1992, 1986]. It consists on a recursive split of the learning set until a stop condition is met on the learning (sub)set or tree structure, like depth or number of nodes. Such stop condition is usually called prepruning. The idea is that each node composing the tree is built in the same manner but with a different subset of the learning set. Each node of the tree is a test that splits the learning (sub)set into two subsets. First, the root node motif is built on the full learning set. The root motif splits the learning set into two subsets. The two subsets imply the creation of two branches from the root node. Then for each subsets if the stop condition is met and the class distribution of the subset is used to build a leaf. Its function is to decide which class the tree will assign to the proteins or sequences reaching it. Otherwise the node building procedure is repeated. Hence this produces the two children that may lead to new splits of the learning subsets. This procedure is recursively repeated until a stop condition is met. In the case of a classification tree, the stop condition may be based on the learning subset (size or subset class composition). For example, if the composition of the subset shows a significant majority for one class, the subset is not split anymore and the class represented by the leaf correspond to the majority class. Or, if the subset is small enough, again the subset is not split anymore and the leaf class correspond to the majority class. A pseudo-code description is provided in algorithm 6.

7.5.2 Motifs-tree pruning

Once the motifs-tree is grown, the tree may be too specific to the instance set used to build it (i.e. overfitting). Hence it may not have inferred rules that are general to the tackled problem and may be inefficient in predicting the right label for unknown sequences (i.e. sequences not used during the growth phase). To minimize the error of the motifs-tree we first use prepruning to stop the growth and avoid full subtrees’ development. As an example of prepruning we can enforce the minimum number of examples required to do a split. Otherwise the split is not applied and the set of examples is used to build a leaf. The depth of the branch can also be used in prepruning. If the branch is too deep, a leaf is inserted and labeled with the majority class within the remaining instances. Then we also use postpruning that may performs a subtree replacement by a single leaf. The postpruning goes from leaves to root (bottom-up traversal) and for each internal node decide if the node is replaced by a leaf. The label of the new leaf is the majority class of the training set reaching the subtree’s root node. To make the pruning decision we rely on the estimated error of the tree and of the tree when a subtree is replaced by a leaf. We use the error estimation as described in Quinlan[ 1987] because it has the advantage to estimate the error with the training or learning set. If the estimated pruned tree error is lowered, the subtree is replaced by a leaf. If not, the pruning operator goes the ancestor node in the tree and tries to replace its subtree by a leaf. This final step concludes the description of the model used to predict post- translational modification. The reader interested in deepening its knowledge in genetic algorithm or decision trees growth and pruning may read the following references: Banzhaf et al.[ 1998] and Hastie et al.[ 2001].

146 7.5 Motifs-tree: motif combination

Algorithm 6 Motifs-tree growth

Require: a set of example I = {(si, cj)} where si is a sequence and cj is one of the two class label. The stop function is an implementation of a stop criterion depending on the instances set and the depth of the tree. Ensure: The root of a decision tree.

function partition(I, m, τ) L, R ← ∅ for (x, y) ∈ I do s ← the alignment score between m and x (equation 93) if s ≤ τ then L ← L ∪ (x, y) else R ← R ∪ (x, y) end if end for return (L, R) end function

function leaf(I) K ← the most frequent class in I return a leaf predicting K end function

function grow(I,d) if stop(I,d) then return leaf(I) else (m, τ) ← a motif searched by GA on I (L, R) ← partition(I, m, τ) l ← grow(L,d + 1), the child when a sequence matches m r ← grow(R,d + 1), the child when a sequence does not match m return a node using the motif (m, τ) with l and r as children end if end function

root ← grow(I,0)

147

MOTIFS-TREESPERFORMANCESANDPROTEOMIC 8 ANALYSISFOR H. SAPIENS

In this chapter the performances of the initiator methionine cleavage and Nα-terminal acetylation are exposed and discussed along with an an analysis of the two studied post-translational modifications on H. sapiens with the help of a generated motifs-trees. The performance and analysis are based on the 2012 dataset (see chapter 6).

8.1 initiator methionine cleavage

We start first by showing the classification score obtained for the initiator methionine cleavage. As stated in section 1.4 the Nα-terminal acetylation may occurs either on the initiator methionine or on the first exposed amino acid when this methionine is cleaved. Therefore there is a need, for a given protein, to either have the cleavage information on its initiator methionine or use a tool to predict it. We decided to use the motifs-trees to predict initiator methionine cleavage. However this problem has already been tackled and the prediction is yet quite successful [Frottin et al., 2006, Martinez et al., 2008]. We also use this post-translational modification as a validation as its human proteome is well characterized. Hence this allows us to assess if the motifs-trees can be stated as white-box models.

8.1.1 Parameters selection and classification performance

The setup of the algorithm is detailed in table 31. Was have tested a range of parameters by scanning. We obtained similar classification performances and there was no interest in tuning them. The search of a good discriminant motif does not seem to be sensitive to these parameters and this is probably because we are not using a sole motif but we are combining them in a tree. Indeed, if a path in the tree does not provide enough performance, another motif is added. Hence the new motif compensates the errors of the previous motif. However there is a parameter that impacts the quality of the prediction: the number of considered N-terminal residues (the number of amino acids parameter). The original choice of this parameter is six and it is the size that disambiguates the 2012 Nα-terminal acetylation Eukaryota dataset 1. Indeed even if we have 2558 different proteins in the Eukaryota 2012 dataset, considering only the first n residues reduce the number of possible polypeptides. For example with only the first two residues we theoretically have 400 (202) possible peptides. This will produce a dataset with a lot of repetitions and ambiguities. Indeed with so few possible peptides, some combination will certainly appear in both the Nα-acetylated and non-Nα- acetylated examples in the training set. In this condition it is very unlikely that the algorithm is able to find a correct relationship between the sequences and the acetylation. Hence such ambiguities are removed from the dataset. So the first justification of our choice to select the number of residue is that it makes all the Nα-acetylated sequences different from the non-Nα-acetylated

1. At the origin the goal was to predict Nα-terminal acetylation and this is why the parameters selection is mainly based on the Nα-terminal acetylation datasets.

149 8 motifs-trees performances and proteomic analysis for h. sapiens sequences. With six residues there are enough possible peptides (between 2 205 and 206). This lower bound for the number of residues is a legitimate choice in order to have a clean dataset. However we can question the upper bound because the amino acids that are further in the sequence may influence the acetylation of the protein by NATs. To do a parameter selection we apply a nested cross-validation on the 2012 Nα-terminal acetylation datasets with the following values: 6, 7, 8, 9, 10, 15, 20 and 40 residues. The nested cross-validation is composed of an outer cross-validation and an inner cross- validation. The outer cross-validation is composed of k-folds: Fi = {Li, Ti}, with i = 1, . . . , k, build on the available dataset. Then: 0 1. for a given Fi, the training or learning set Li is used to build k -folds: 0 0 0 0 Li = Fj = {Lj, Tj }, with j = 1, . . . , k . 2. Each parameter values (or combination if there are more than one 0 parameter) is trained and tested on the Fj (the inner step).

3. The parameter that minimize the error is selected to be train on Li and tested on Ti (the outer step).

4. These steps are repeated for all Fi.

5. When all Fi have been processed, the score on the Ti are aggregated. It should be noted that the best selected value is not necessarily the same during the inner cross-validations. Thus the Fi are not evaluated with the same N-terminal number of residues. We applied the outer cross-validation with k = 10 and the inner cross-validation with k0 = 10. The results of this procedure provide a good estimation of the error of the algorithm. The results are included in table 32 and are above the baseline for each taxon. The selected lengths after the inner cross-validation phase are: — for H. sapiens 6 (six times), 7 (three times) and 8 (one time). — For Metazoa 6 (four times), 7 (two times), 8 (one time), 9 (two times) and 10 (one time). — For Eukaryota 6 (six times), 7 (one time), 8 (one time) and 9 (two times). Most of the time a residue of length 6 offers the best performances. However to select the best parameter to build a model, these parameters are evalu- ated with another 10-fold cross-validation. This allow to determine which parameter is the best. The length minimizing the error is the selected value for the parameter. However from these results, we already see that there is no need to use a sequence length that is composed of more than 10 residues. The following residues add noise to the data and the algorithm looses its capacity to generalize. In the case of the Nα-terminal acetylation the selected value is 6 and is confirmed by cross-validation (see dedicated section 8.2.1 and table 34). The same parameter is used for initiator methionine cleavage. Two potential problems arise from the algorithm we used to build our classifier. The first is common to all machine learning algorithms and is the lack of generalization (see chapter 5 or Hastie et al.[ 2001]). The second problem is the stability of our model, that is to say the consistency of the results despite the stochastic nature the genetic algorithm. Indeed, we have no guarantee that every evolution will converge to a good solution. To evaluate the stability we simply applied 10 independent stratified 10-folds cross-validations on our datasets, combining the cross-validations results to obtain the average and the standard deviation. So if the cross-validations results have a high average classification score with low standard deviations,

2. The bounds are because the methionine and alanine are frequently found as the first residue.

150 8.1 Initiator methionine cleavage

Table 31: Parameters use to build the motifs-trees. It regroups parameters of the genetic algorithm, plague, individual, alignment and decision tree. Those parameters are use for motifs-trees predicting initia- tor methionine cleavage and Nα-terminal acetylation for the three considered taxa. Parameter Value Parameter Value Tournament size 5 Population size 250 Max generations 150 Mutation probability 0.75 Number of plague remove 20 Number of plague clean 100 Number of amino acids 6 Gap penalty -0.0625 Pruning factor (α) 0.5 Bucket size 6

Table 32: Results obtained by outer and inner cross-validation (k=10, k0=10) assessing the quality of the algorithm in predicting Nα-terminal acetylation on the 2012 datasets. BL stands for baseline and MCC for Matthews correlation coefficient. For a list of selected N- terminus length, see text. Taxon BL Accuracy Sensitivity Specificity MCC Eukaryota 0.63 0.84 0.87 0.79 0.66 Metazoa 0.71 0.86 0.90 0.77 0.67 H. sapiens 0.86 0.88 0.93 0.60 0.54

the method is stable and produces classifiers with a good generalization capability. As we are only taking several proteins can be represented by the same six amino acids. In order to avoid that instances in the training (or learning) set appear also in the test set, we have removed those duplicates from the training set of each fold. This ensures that the test set contains only sequences never seen during the learning phase. The prediction results are detailed in table 33 and the results are very good and we stress that the standard deviation of the score between the 10 cross-validation were ≈ 10−4. However, we must pay attention to the accuracy values. Since our training set class are imbalanced, a trivial classifier could easily reach a high accuracy (table 27). For instance 72 % of our eukaryotic proteins undergo initiator methionine cleavage in our dataset and a trivial classifier based on the majority that predicts all proteins as cleaved will obtain an accuracy score of 0.72. Nevertheless the motifs-trees greatly improve the accuracy over the baseline. For H. sapiens the baseline is 0.69 and the accuracy improvement is 0.26, for Metazoa it is 0.72 and 0.22, and for Eukaryota it is 0.72 and 0.23. These performances encourage us to produce a model for human initiator methionine cleavage prediction based on the complete human dataset. The purpose is to produce a MetAP specificity analysis based on the motifs-tree.

8.1.2 Human MetAPs specificity analysis

The human dataset is the only dataset we have dedicated to one organism. So to avoid problem of homologous in other organism we focus on H. sapiens. The analyzed trees is the product of a learning on the full datasets. We point out that the motifs found during different runs of training are close and combined in similar trees. The tree is illustrated in figure 70.

151 8 motifs-trees performances and proteomic analysis for h. sapiens A E S D P TA NE S VD

A G

E G A

S TA

LL

Q

F

K

E D N L Q F

E A

S KP A A A L

G

P ME I H

R

A E SS D

T PAG G V

T

TE E VE EP TA GDK E P E GLS TGSL D DL H Q A A QG G DK G QE Q T T RA G T G M V A VT ADA NYVS A P KFR G D L IC L L A T C V G MD N FR F PI NV N PIQ Y K S SSP W M VY SWVRR

Figure 70: The motifs-tree predicting the initiator methionine cleavage for from the H. sapiens 2012 dataset. This model has been built on the full H. sapiens dataset. Each node of the tree is represented by the motif used for its test. Leaves are represented using a sequence logo made with all the sequences ending in that leaf. The label under a leaf specifies the class corresponding to the prediction made at the leaf and its accuracy on the human dataset (this cannot be used to infer its error on unseen proteins, see table 33). The initiator methionine is always present in all sequences, but is not displayed in the logo as it does not provide any information. Moreover each sequence logo is rescaled according to its highest value (the maximum being 4.32 bits). The branches are labeled with the alignment score condition required on the test to follow the path indicated by the branch. The sequence logo on top illustrates the composition of the H. sapiens proteins extracted from UniProtKB (table 27).

152 8.1 Initiator methionine cleavage

Table 33: Results assessing the quality of the initiator methionine cleavage prediction on the 2012 dataset. Score values are the mean on 10 independent stratified cross-validations, each made with 10 folds. MCC means Matthews correlation coefficient. The score standard deviation between the 10 cross-validations is ≈ 10−4 for each taxon. Taxon Baseline Accuracy Sensitivity Specificity MCC Eukaryota 0.72 0.93 0.95 0.89 0.83 Metazoa 0.72 0.94 0.96 0.91 0.86 H. sapiens 0.69 0.95 0.96 0.93 0.89

We analyze the discovered motifs to propose assumptions about the sub- strates of the enzymes catalyzing the post-translational modification. We study how sequences are split at each node and we try to extract the features that separate the two set of sequences induced by the split. Let’s note that there are two genes encoding for MetAPs in human, MetAP1 and MetAP2 [Bradshaw et al., 1998], but the information about which enzyme catalyzes the cleavage is not known and is not taken into consideration in the model. First of all, we see that the motifs-tree is composed of three motifs and all of them leading to at least one leaf (i.e. a predicted class): — the sequences that do not contain the signal described by the first motif are classified as not undergoing the initiator methionine cleavage; — the sequences containing both the first two motif signals are classified as undergoing the initiator methionine cleavage; — the sequences reaching the last node are classified as not undergoing the initiator methionine cleavage if the signal of the third motif is detected in the sequence. Hence a match on the first two motifs induces cleavage of the initiator methionine. A match on the first two motifs and a match on the last motif do not induce cleavage. To understand what features are exploited by the motif tree to discriminate the sequences we will focus on the first node. The first motif is described by the following tokens sequence:

•{S,CHAM830104}{S,CHAM830104}•{F,GARJ730101}•• (98) and the analysis is split into two steps:

1. scores analysis and identification of discriminant tokens in the motif,

2. identification of position of interest in the amino acids sequences.

We begin by identifying the discriminant tokens (step 1). To do so, we compute the average motif score profile. The profile is computed for a set of sequences aligned on a motif. For a given alignment, each token contributes to the alignment score either by its similarity with the aligned amino acid, either by being gapped. If all contributions of each token on each sequence are summed and normalized, we obtain an average motif score profile. Formally, let m = (t1, t2,..., tk), a k token motif, and S = {sj} a set of amino acids sequences, sj = (aj1, aj2,..., ajn). The profile of m on all sequences in S, is a vector c = (c1, c2,..., ck), whose ci are given by :

| | 1 S c = σ(t , x ) (99) i | | ∑ i ji S j=1

153 8 motifs-trees performances and proteomic analysis for h. sapiens

where xj is the aligned sequence, i.e. sj with the alignment gaps. So xji is the i-th symbols in the sequence j which is aligned with m. It can be either an amino acid or a gap (σ with a gap always equals the gap penalty γ). So to identify discriminant tokens in the motif, we compute the profiles for the sequences following the left (cl) and the right (cr) branch and plot the following difference cr − cl. A positive difference points to a token that increases in average the score of the sequences following the right branch, a negative difference points to a token that increases in average the score of the sequences following the left branch. So, as we want to identify the features contributing to the signal strength, we are interested in the positive differences. In the case of the first motif, the profiles difference emphasizes the discriminant power of the tokens at position 2 and 3 in the motif (figure 71 (a)). The two tokens are the same, namely the token {S,CHAM830104}. This property is interesting because it gives the maximum similarity (i.e. 1.0) with the Ser and the following amino acids: Ala (A), Cys (C), Gly (G), Pro (P), Thr (T) and obviously Ser (S). It gives a similarity of 0.5 with the Ser for the other amino acids, except Leu (L) with which it has no similarity. So it clearly promotes the presence of A, C, T G, P, S and T. Regarding the amino acids producing a similarity of 0.5, it is interesting to note that the threshold is 5.4375, which is the maximum alignment score possible with the motif minus 0.5. So the use of this property in the first motif acts as a selector for the amino acids having a similarity score of 1. Now that we have identified two tokens having an impact on the align- ment score, we must identify where, in the protein sequence, the specificity induced by the token is discriminant (step 2). To do so, we rely on a plot showing how many time a token i is aligned with the residue at position j of the sequences following the right branch. This histogram shows that the two tokens of interest are mainly aligned on the second amino acid (the one immediately after the initiator methionine), and, in less extent, on the third amino acid (figure 71 (b)). This rough analysis allow to conclude that this node splits the protein set based on the presence of an alanine, cysteine, glycine, proline and serine immediately after the initiator methionine. Moreover, as this node lead to a leaf for the sequences in which the signal is not detected, we can observe that the proteins not having those amino acids at the second position do not undergo the initiator methionine cleavage. Therefore the following rule can be proposed: if a sequence start with M¬[ACGPSTV], the methionine is not cleaved. This has been experimentally observed [Burstein and Schechter, 1978, Meinnel et al., 2005] and is corroborated by our model. Moreover this rule is compatible with the pattern found by Martinez et al.(Martinez et al.[ 2008], table 1). If we take into account only the information regarding the initiator methionine cleavage in the cited publication, we can build the following rule: a match with M¬[ACGPST] for the first two amino acids imply no initiator methionine cleavage. The same approach can be used to extract information from the other motifs. We will summarize the main lines here. First, it is important to remember that we are going through a decision tree, and the alignments are applied on sequences that have been selected by the preceding motifs. The profiles difference of the second motif indicates that the token at position 10 has a major contribution in producing discriminant alignment score between proteins (figure 72). The token is [AFIKNQSW] and the histogram of aligned positions shows that it is almost always aligned with the second amino acid

154 8.1 Initiator methionine cleavage

Average profiles score difference 0.35

0.3

0.25

0.2

0.15

0.1

0.05 Average alignment score 0

−0.05

Token (position in the motif)

Left positions histogram Right positions histogram

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5 Alignment scores Alignment scores

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Token (position in the motif) Token (position in the motif)

residue 1 residue 2 residue 3 residue 4 residue 5 residue 6

Figure 71: (a) The motif score profile (equation 99) difference between the sequences achieving an alignment score less than or equal to the threshold and greater than the threshold. On this plot we can see that the tokens at position 2 and 3 in the motif have an important contribution in the alignment scores of sequences achieving a score higher than the threshold. (b) The normalized histogram of aligned position illustrates on which positions in the amino acids sequence a token is aligned. The sequences considered to build this histogram are the one following the right branch after the first motif. The colors of the stack indicate the position in the amino acids sequences. A stack lower that 1.0 reflect that the token is aligned with sequence gaps. E.g., a stack with a height of 0.4 means the token is aligned with an amino acid for 40% of the alignments, and is gapped for the remaining 60%.

155 8 motifs-trees performances and proteomic analysis for h. sapiens in the sequence. But we already know that the sequences reaching this node should be carry [ACGPSTV] as the second residue. So we can denoise this token by only considering the intersection between [AFIKNQSW] and [ACGPST], leading to a simplified form of the token: [AS]. So the motif seems to detect the presence of an alanine and a serine at the second position. Another token contributes well to the profiles difference, the token 13, which is a fixed amino acid token for the Ser. This token is mainly aligned on the second and third amino acid in the sequences. As a relevant match implies that the sequence undergoes the initiator methionine cleavage, this lead us to propose that sequences starting with M[AS] are cleaved. But the MA sequences are highly represented in the set of sequences having a relevant match with the motif (68% of the set), and may hide the contribution of other tokens. So we removed those sequences from the proteins set and produced a new profiles difference. These new profiles emphasize the contribution of the second token in the motif, which is a fixed amino acid token for the Pro, and is always aligned on the second amino acid in the sequence. So, considering the preceding motif and the information provided by the tokens at position 10 and 13, we can conclude that proteins starting with M[APS] undergo initiator methionine cleavage. Therefore proteins reaching the last motif should be mainly composed of sequences starting with M[CGTV]. Again the profiles difference indicates that the token 9 ([EK]) has the greatest difference in the profiles. The histogram shows that it is always aligned on the fifth residue (figure 73). This is a very interesting feature, because it shows that the MetAPs activity is not only influenced by the amino acid on the second and third position in the sequence, but also by amino acids farther in the sequence. In this case it is also interesting to note that these two amino acids are charged. Regarding the token 3 it select [AINSVW] and it is frequently aligned with the second residue, hence it can also be denoised with M[CGTV] and produce M[V]. But this account for approximately 55 % of the alignment. The second residue is also aligned in approximately 30 % of the case on the token 4 ([ETLPW]). So we can also simplify the motif: MT. We conclude that a protein starting by M[TV] does not undergo initiator methionine cleavage if a glutamic acid or lysine is present in the sequence at position 5. We have extracted simplified rules for each motif. These rules can be combined to produce the sequence requirement for the initiator methionine cleavage to occur or not: — a match with M[ACGPS] implies that initiator methionine cleavage oc- curs; — a match with M[TV]..¬[EK] implies that initiator methionine cleavage occurs; — otherwise the protein does not undergo initiator methionine cleavage. To conclude this analysis we wonder if the simplified rules perform as well as the motif tree. Indeed the rules produce a Matthews correlation coefficient (MCC) of 0.94 while the full motif tree produces a MCC of 0.97. This is a very good result: we have been able to use the model to infer good rules that allowed us to find the sequence requirement for initiator methionine cleavage in H. sapiens. We showed that the motif tree is very sensitive to subtle features that are hardly detectable by human, as the motif tree produce better classification scores than the inferred rules. Our model, and the tools proposed to analyze it, lead us to draw conclusion on human MetAPs substrates that are very similar to the experimental results published

156 8.1 Initiator methionine cleavage

Average profiles score difference 0.7

0.6

0.5

0.4

0.3

0.2

0.1 Average alignment score 0

−0.1

Token (position in the motif)

Left positions histogram Right positions histogram

1 residue 1 1 residue 1 residue 2 residue 2

residue 3 residue 3 0.9 residue 4 0.9 residue 4 residue 5 residue 5

residue 6 residue 6 0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5 Alignment scores Alignment scores

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 5 10 15 20 0 5 10 15 20 Token (position in the motif) Token (position in the motif)

Figure 72: (a) The motif score profile difference between the sequences achiev- ing an alignment score less than or equal to the threshold and greater than the threshold. On this plot we can see that the tokens at position 10 ([AFIKNQSW]) and 13 (S) in the motif have an im- portant contribution in the alignment scores. (b) The normalized histogram of aligned position. The token at position 10 is only gapped in 20 % of alignments when the motif is detected. When not detected it is gapped in 82 % of the alignments.

157 8 motifs-trees performances and proteomic analysis for h. sapiens

Average profiles score difference 0.6

0.5

0.4

0.3

0.2

0.1

0 Average alignment score −0.1

−0.2

Token (position in the motif)

Left positions histogram Right positions histogram

residue 1 residue 4 residue 1 residue 4 1 1 residue 2 residue 5 residue 2 residue 5

residue 3 residue 6 residue 3 residue 6 0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5 Alignment scores Alignment scores

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Token (position in the motif) Token (position in the motif)

Figure 73: (a) The motif score profile difference between the sequences achiev- ing an alignment. On this plot we can see that the tokens at position 3 ([AINSVW]) and 9 ([EK]) in the motif have an impor- tant contribution in the alignment scores. (b) The normalized histogram of aligned position. The token at position 9 are almost never gapped (less than 10 % in both alignments). However its contribution is very different and let suppose a misalignment in the case when the motif is not detected.

158 8.2 N-terminal acetylation

Table 34: Results obtained by a 10-fold cross-validation assessing the qual- ity of the Nα-terminal acetylation prediction for the selected N- terminus length (N-ter. len.). on the 2012 datasets. MCC stands for Matthews correlation coefficient. Taxon Accuracy Sensitivity Specificity MCC N-ter. len. 0.88 0.91 0.82 0.72 6 Eukaryota 0.86 0.88 0.82 0.70 7 0.87 0.89 0.81 0.71 8 0.89 0.89 0.81 0.70 9 0.88 0.92 0.79 0.71 6 0.88 0.91 0.79 0.70 7 Metazoa 0.89 0.92 0.81 0.72 8 0.89 0.93 0.79 0.72 9 0.87 0.88 0.78 0.69 10 0.90 0.94 0.67 0.57 6 0.92 0.95 0.67 0.63 7 H. sapiens 0.90 0.94 0.62 0.57 8 0.89 0.95 0.56 0.53 9 in literature [Burstein and Schechter, 1978, Meinnel et al., 2005, Frottin et al., 2006, Martinez et al., 2008, Xiao et al., 2010].

8.2n -terminal acetylation

Now we discuss the usage of the motifs-trees to predict the Nα-terminal acetylation and present their performance scores. Following the success of the analysis of the human MetAPs specificities, we apply the same approach to analysis of the human motifs-tree for Nα-terminal acetylation prediction.

8.2.1 Classification performance

We have discussed the choice the number of N-terminal residues taken into account and applied a nested cross-validation to have an (pessimistic) estimation of our approach and have a selection of potential good value (see section 8.1.1). To confirm the choice of the number of N-terminal residues we evaluated the performance of the classifier for the selected N-terminal lengths. The classification performance are summarized in table 34. For Eukaryota and Metazoa the scores are quite similar and for computational reason the smallest length is selected. In the case of human it seems that the classifier with a number of 7 N-terminal residues offer better performances. However we can question the significance of the difference between the classifiers (table 34). To do so we groups the result of classification on each Ti of each folds use them to create a vector a success or failure. If a the prediction is correct there is a one in the vector, else a zero. When done for each N-terminus length, is is possible to count the number of agreements and disagreement between two classifiers. Then we apply the McNemar’s test [McNemar, 1947]. This test allow to decide with a given p-value (usually < 0.05) if two classifiers offer performances that are significantly different. The test has been applied for each pairs of selected number of residues for each taxon (table 35). The results of the statistical tests indicates that the

159 8 motifs-trees performances and proteomic analysis for h. sapiens

Table 35: McNemar’s tests to assess if classifiers differ in classification per- formance for the Nα-terminal acetylation on the 2012 datasets. The only significantly different classifiers with p-value of ≤ 0.005 are when a N-terminus of length 6 and 8 are used in Eukaryota. Taxon Len. 7 8 9 10 6 0.151 0.001 0.204 7 0.757 0.916 Eukaryota 8 0.876 9 6 0.379 0.345 0.809 0.854 7 1 0.255 0.534 Metazoa 8 0.208 0.485 9 0.624 6 0.162 0.721 0.501 7 0.349 0.038 H. sapiens 8 0.301

Table 36: Performance for Nα-terminal acetylation prediction by TermiNator3 on the 2012 datasets. The predictions made with TermiNator3 are equivalent to the one obtained by the motifs-trees with the 10-folds cross-validations. MCC stands for Matthews correlation coefficient. Taxon Accuracy Sensitivity Specificity MCC Eukaryota 0.87 0.92 0.77 0.71 Metazoa 0.88 0.92 0.80 0.72 H. sapiens 0.90 0.91 0.82 0.63 only significantly different classifiers with p-value ≤ 0.005 are when an N- terminus of length 6 and 8 are used in Eukaryota. Thus choosing the smallest number of N-terminal residues is a reasonable choice (i.e. six residues). The performance of the motifs-trees are very good with Eukaryota and Metazoa. With H. sapiens the performance are less good because of a lower specificity with this organism. This is probably because the baseline of this dataset is quite high and that the non-Nα-acetylated proteins form more particular instances from which it is harder to generalize during the learning phase. Moreover we are on par with TermiNator3 when we compare our cross-validated performance with their results (table 36).

8.2.2 NatB and NatC potential substrates

Before the motifs-trees publication of [Charpilloz et al., 2014], the best available predictor was TermiNator3. They achieve good performances but we have showed that the motifs-trees outperform TermiNator3. Moreover it is the first predictor that achieve predictions better that random in cross-validation for potential NatB and NatC substrates. The patterns proposed in [Martinez et al., 2008] and refined in [Bienvenut et al., 2012] have the drawback to produce random prediction for the potential substrate for the NatB and NatC. As it was introduced in chapter 1, several enzymes catalyze the Nα-Ac, however the information regarding which Nat process a given substrate is rarely available. But by looking at the known

160 8.2 N-terminal acetylation

Table 37: Cross-validated scores obtained by Eukaryota classifiers versus Ter- miNator3 on the potentially sequences acetylated by NatB or NatC (i.e. by proposed substrates) in the 2012 dataset. Predictor Pot. sub. Nb. seq. N-Acet. Acc. Sen. Spe. MCC motifs-tree 0.89 0.93 0.39 0.31 NatB 384 0.91 TermiNator3 0.90 1.00 0.00 0.00 motifs-tree 0.66 0.55 0.73 0.28 NatC 100 0.38 TermiNator3 0.64 0.00 1.00 0.00

potential specific sequence, authors have proposed requirement to identify the Nat catalyzing the acetylation depending on the first two amino acids. See table 7 for NatB and NatC, or in [Polevoda et al., 2009].

Unfortunately, the number of experimentally identified substrates of those specific Nats is scarce. To estimate the capability of our classifiers regarding to NatB and NatC substrates, we built two new datasets: one for potential NatB and one for potential NatC. From the Eukaryota dataset, all proteins matching the theoretical requirements for NatB or C are considered as potential substrates and extracted into those new datasets. The proteins are extracted with their original class (i.e. Nα-acetylated or not Nα-acetylated), because not all proteins matching the substrates are acetylated.

We applied a 10-folds CV on the whole Eukaryota dataset. Then we measured the performance of the model only on the potential NatB and NatC. The results in table 37 display that the patterns used in TermiNator3 are too stringent (see chapter 5 and Martinez et al.[ 2008]). Indeed, the results show that if a sequence starts like the NatB proposed substrate, it is always classified as Nα-acetylated (the sensitivity is 1.0 and the specificity is 0.0). For the sequences starting like the NatC known substrates it is the opposite (the sensitivity of 0.0 and the specificity of 1.0) indicating that all these sequences are classified as not Nα-acetylated. Therefore, in both cases the MCC obtained is 0.0, meaning that in this case TermiNator3 performs no better than random prediction. The patterns used by TermiNator3 take only into account, at most, the first three amino acids [Martinez et al., 2008], but the information provided by these three amino acids is probably insufficient to decide if a protein undergoes Nα-Ac. Our model takes into account the first six amino acids and produces a cross-validated MCC higher that 0.0, therefore it performs better than random. So our model has been able to find specificities between proteins undergoing Nα-Ac, as it is showed by the increase of specificity in the case of the NatB substrates (+0.39) and the increase of sensitivity in the case of the NatC substrates (+0.55). We also tested our predictor on the 7 experimentally identified substrates of NatB and C in H. sapiens [Starheim et al., 2008, 2009]. As shown in table 38, all substrates were correctly predicted by the motifs tree. This shows that our model is able to discover subtle features specific to those proteins, even when they are accounting only for less than 20 % of the whole dataset. The error on unseen proteins is on par with TermiNator3 and when the models are train and evaluated on the complete 2012 dataset, the motifs-trees offer excellent performances (table 39).

161 8 motifs-trees performances and proteomic analysis for h. sapiens

Table 38: Predictions of acetylated proteins with known Nats using the Ter- minus H. sapiens classifier. The sequences are displayed from their initiator methionine along with the NAT catalyzing the Nα-terminal acetylation. The sequences marked with a † were included in the training. UniProt ID Taxon Sequence NAT motifs-tree TermiNator3 Q04206 H. sapiens MDELFPL B Ac-M(1) Ac-M(1) Q9NVJ2† H. sapiens MLALISR C Ac-M(1) M(1) P42345 H. sapiens MLGTGPA C Ac-M(1) M(1) P31943† H. sapiens MLGTEGG C Ac-M(1) M(1) P52597† H. sapiens MLGPEGG C Ac-M(1) M(1)

Table 39: Performance of the motifs-trees when the models are built on the complete 2012 datasets to predict Nα-terminal acetylation. We also show for comparison the results of TermiNator3. Obviously the motifs-trees outperform TermiNator3 in this setup. MCC means Matthews correlation coefficient. Taxon Predictor Accuracy Sensitivity Specificity MCC motifs-tree 0.99 0.98 0.99 0.97 Eukaryota TermiNator3 0.87 0.92 0.77 0.71 motifs-tree 0.99 1.0 0.96 0.96 Metazoa TermiNator3 0.88 0.92 0.80 0.72 motifs-tree 0.99 0.99 0.96 0.96 H. sapiens TermiNator3 0.90 0.91 0.82 0.63

162 8.2 N-terminal acetylation

Table 40: Results assessing the quality of the algorithm in predicting Nα- terminal acetylation based on the dataset provided in [Martinez et al., 2008] and by taking into account only the first three N- terminal residues. The validation is made on our Eukaryota 2012 dataset. BL means baseline. Predictor BL Accuracy Sensitivity Specificity MCC Motifs-tree 0.63 0.83 0.93 0.64 0.62 TermiNator3 0.63 0.85 0.92 0.77 0.70

8.2.3 Can a motifs-tree learn like Martinez et al.?

The TermiNator3 motifs achieve good performances but it is hard to com- pare the performance score of motifs discovered by machine learning with rules provided by human expert. We are not sure of the dataset used to build the motif, but after a contact with the authors of [Bienvenut et al., 2012], it seems that the motifs are mainly built from the dataset provided in the supporting information of [Martinez et al., 2008]. However the authors also specified that the motifs where updated according the study in [Bienvenut et al., 2012], but it is unclear what part of the sample they have was used to update their rules. In conclusion we suppose that their dataset was composed of a total of 297 proteins or fragments where 235 are Nα-acetylated and 62 are not. We call this dataset the TermiNator dataset. Despite this dataset is very small we decided to do the following experiment. We train a motifs-tree with only this small dataset, using the first three residues and we evaluate the produced tree on our Eukaryota dataset. The idea is to have the same setup as the authors of Martinez et al.[ 2008], Bienvenut et al.[ 2012] and we evaluate TermiNator3 also on the Eukaryota dataset. In other words this is the man versus the machine. The obtained results are in table 40. Those results are means on a 10 learnings and tests. The standard deviation of the results has an order of 10−2. The first observation, in both case, the scores are well above the baseline of the Eukaryota dataset and are good in both cases. But the machine learning algorithm has difficulties to cope with only 62 negative (or non Nα-acetylated) proteins. The specificity is lowered from 0.77 to 0.64 (-0.13). So the experts have a better capability to generalize from such a small negative dataset. However we want to stress that the machine learning has no a priori whatsoever knowledge about the phenomena of the Nα-terminal acetylation. The expert has knowledge about the enzymes that catalyze the reactions (the NATs). They have knowledge about some more or less basic properties of the amino acids that composes the sequences. For example, experts know that the active site of one of the enzyme is small and know that the alanine is a very small amino acid. The motifs-tree has no a priori knowledge before starting to learn from the sequences. Hence that is why, despite the lower specificity, we found the results very good in comparison to the expert. We also were surprised by the complexity of the motifs in the motifs-trees trained on the TermiNator dataset. For example, the following motif is the root of one of the built motifs-tree: ¬[ETGVQPK]HD{T,COSI940101}R{Y,KRIW790102}D[EM] (100) {D,WILM950104}

163 8 motifs-trees performances and proteomic analysis for h. sapiens

Table 41: Motif used in the H. sapiens motifs-tree for Nα-terminal acetylation prediction. The node column correspond the to the node path and label in the tree, see figure 74. The threshold (thresh.) is also shown and correspond to the score that an alignment must exceed to be considered as a match with the motif. We also recall that the used gap penalty is -0.0625 and that we are aligning six residues. Node Motif Thresh. Root [GVL]E{P,QIAN880129}{P,QIAN880129} 5.204 {R,MITS020101}KPR •••• 0 {S,AURR980120}{Y,FAUJ880110}DN{E,CHOP780207} 5.040 ¬[NYILKRD]SQ¬[YAK]¬[LKR]EQEA[AD]• 1 [TM]P[NYFMQLCK]RM{I,HOPA770101}KC[NYFMQLCK] 3.640 {I,LEVM780102}¬[YAMPHWD] 01 M¬TDK{K,DESM900101}{Q,RACS820105}[EA] 5.690 •¬[EL]{N,RICJ880110} 10 • [ENYQCD]{S,FINA910101}S¬[QWK]{I,COHE430101} 5.054 [GQL]•{L,NAKH900112}•{V,CHOP780209}NP {Y,ROBB760107}[TV]{P,PRAM820103} 100 •[GL][NTAIDS]{E,CHOC760102}[MLK][FIGQ][RD] 5.483 {F,FUKS010108} ¬[AIQ] 1000 [TA]•[TGD][EYI]• [TAD][EIG]C{S,NAKH900101} 5.181 P{S,NAKH900101}V

This motif seems rather complicated in comparison to the motifs used by TermiNator3. The motifs may be more efficient for more subtle cases, therefore we try to evaluate the performance of the motifs-trees only on the potential NatB and NatC substrates. The purpose was to evaluate if, despite its lower specificity, the motifs-trees have the advantage to be less stringent that TermiNator3. Regarding the NatB substrate the motifs-trees performs in average no better that TermiNator3 (i.e. a MCC close to zero on average). However in the case of the NatC substrates, the prediction is not random. Indeed, on average it achieve a score of 0.23 but with a standard deviation of 0.08. However this is still better than TermiNator3 and this emphasis again that the motifs-tree is able to find very subtle features in the sequence that are very hard to be detected by humans, even by experts.

8.3 human nats specificity analysis

Now we are going to study a motifs-tree that predict the Nα-terminal acetylation in H. sapiens. The construction of the tree is based on the 2012 dataset. Like our previous study with initiator methionine cleavage, we are limiting us to this organism as the other datasets correspond to taxa that contains several organisms. In this case we are more likely to propose feature to understand what drives or not the Nα-terminal acetylation without being biased by homologous in other organisms. Moreover this set was the largest single organism dataset obtained from UniProtKB. The tree is illustrated in figure 74 and its motifs are in table 41. The analysis is harder in the case of this tree as the motifs are more complicated. There are ≈ 75 % of the proteins with a free N-terminus that

164 8.3 Human NATs specificity analysis

root

1 0

Acet Acet 8 1 3 / 8 1 5 4 0 / 4 4

NotAcet NotAcet Acet 2 4 / 2 8 1 2 2 / 1 2 2 6/6

Acet 2 3 0 / 2 3 0

Acet NotAcet 2 3 / 2 3 1 8 / 2 1

Figure 74: The motifs-tree for the prediction of Nα-terminal acetylation in H. sapiens. This model has been built on the complete H. sapiens 2012 dataset. For clarity reason the motifs in the nodes are detailed in table 41. The node’s labels (the circles) allow to identify the motifs in the table 41. The leaf’s labels indicated the class (i.e. the prediction) and the number of instance from the H. sapiens dataset that reach the leaf. The ratio in the leaves indicates the number of correctly predicted instances.

165 8 motifs-trees performances and proteomic analysis for h. sapiens fall in the same leaf. The tree can be pruned and while still producing an acceptable score. Hence for this analysis, we are considering the pruned tree 3 illustrated in figure 75. Thus we are only studying the following motifs:

[GVL]E{P,QIAN880129}{P,QIAN880129}{R,MITS020101}KPR•••• (101) and the motif {S,AURR980120}{Y,FAUJ880110}DN{E,CHOP780207} (102) ¬[NYILKRD]SQ¬[YAK]¬[LKR]EQEA[AD]•

8.3.1 The root motif

We start with the root motif (equation 101). If we study the alignment score pattern we can see that there are only few positions or tokens of interest (figure 76). Indeed, almost all of the alignments have their last four amino acids aligned on the last four tokens. Which are any amino acid tokens. Hence on a total of six amino acids, we only have to focus on the first two N-terminal residues. Roughly we have two kind of alignment. The first one is when a perfect match on only the first residue is enough to achieve a high alignment score. This is the case the sequence starts either with Gly (G), Leu (L), Val (V) or Glu (E). The other kind of alignment is when there is a need to cumulate score from two chemical similarity tokens in positions 3, 4 and 5. This correspond to {P,QIAN880129} (two times) and {R,MITS020101} (equation 101). The description of the chemical properties are provided in table 42. We also note that having a Gly, Leu or Val as the first residue or a Lys (K) and Arg (R) as the second residue is specific to the proteins matching the motif. We can first split the protein into two categories: the ones undergoing initiator methionine cleavage and the one that do not. Those are characterized by proteins sequences starting with a methionine. So with a methionine as the first residue, it has to be aligned with either the tokens in positions 3, 4 or 5 as the first two tokens produce a score of zero for the methionine. Hence when a sequence starts with a M followed by any amino acids except K (Lys), P (Pro) and R (Arg), it is aligned with {P,QIAN880129}, otherwise the next amino acid cannot be aligned (or it is gapped). So the next residue is aligned with {P,QIAN880129} or {R,MITS020101}. To achieve a score over the threshold the alignment of the two residue must produce a score greater than 1.5789. This is because the threshold is 5.2039 and there are four any amino acids tokens at the end of the motif and six gaps must be inserted 4. The methionine produces a score of 0.618, thus the other residue must produce a score over 0.961. With {P,QIAN880129} this is only achievable with I (Ile) and P (Pro) and with {R,MITS020101} this is not relevant because it is only achievable with an R (Arg) but the token at position 8 imposes the presence of this amino acid. Consequently if a sequence stats with a methionine, the motif is detected only if it is correspond to the proposed motif: M[IPKR]. The other proteins that do not undergo methionine cleavage are considered as Nα-acetylated in this simplified motifs-tree. Next, for proteins that undergo initiator methionine cleavage, we start by focusing on protein starting with an alanine. Indeed, alanine is known to

3. The score of the pruned tree is still over the baseline. 4. We recall that the gaps have a fixed score of −0.0625 in our model

166 8.3 Human NATs specificity analysis

2.0 A bits 1.0 M E SA D PS G T T

VG 0.0 5

3.0 2.0 A bits E 1.0 A D MS T

G SN TL 0.0 5 Acetylated

P AK 1.0 G bits 1.0 P A bits GRA QAEE M M REDL DS QA K HDL TSG R STF V VL TVTD E EE G D Q PK V V P L I SFT G L L N G G L F F F A S P I Y D L Y LP E K P AN A S I V E N K VR V TM Y S Q PVH P

P S 0.0 0.0 5 5 Not acetylated Acetylated Figure 75: Manually pruned motifs-tree for Nα-terminal acetylation predic- tion in H. sapiens. Leaves are represented using a sequence logo made with all the sequences ending in that leaf. The label under a leaf specifies the class corresponding to the prediction made at the leaf. The score of this on the H. sapiens dataset are an accuracy of 0.96, a sensitivity of 0.99 and a specificity of 0.71. These score are the performance on the dataset. For an estimation of the error, see the score in table 34.

167 8 motifs-trees performances and proteomic analysis for h. sapiens

Table 42: Description of the used physico-chemical property token. The table contains the reference amino acid from which the aligned residue should be close, the index in the database and the description [Kawashima and Kanehisa, 2000]. Token Close to Index Description {R,MITS020101} Arginine MITS020101 Amphiphilicity index [Mi- taku et al., 2002] {P,QIAN880129} Proline QIAN880129 Weights for coil at the win- dow position of -4 [Qian and Sejnowski, 1988] {S,AURR980120} Serine AURR980120 Normalized positional residue frequency at helix termini C4’[Aurora and Rosee, 1998] {Y,FAUJ880110} Tyrosine FAUJ880110 Number of full nonbond- ing orbitals [Fauchère et al., 1988] {E,CHOP780207} Glutamate CHOP780207 Normalized frequency of C-terminal non helical re- gion [Chou and Fasman, 1978]

Table 43: Amino acid frequency for the second residue in the H. sapiens proteome. extracted from the 2015_07 UniProtKB release. Only the proteins sequenced from the first residue starting with an initiator methionine are taken into account (146 704 entries). AA stands for amino acid and freq. for frequency. AA Freq. AA Freq. AA Freq. AA Freq. AA Freq. MA 0.127 MC 0.008 MD 0.036 ME 0.056 MT 0.051 MF 0.017 MG 0.045 MH 0.008 MI 0.013 MW 0.009 MK 0.030 ML 0.042 MM 0.013 MN 0.033 MV 0.030 MP 0.040 MQ 0.017 MR 0.037 MS 0.068 MY 0.007

168 8.3 Human NATs specificity analysis

Average profiles score difference 0.3

0.25

0.2

0.15

0.1

0.05

0 Average alignment score −0.05

−0.1 2 4 6 8 10 12

Left positions histogram Right positions histogram

1 1 residue 1

residue 2 0.9 0.9 residue 3

residue 4

0.8 residue 5 0.8

residue 6 0.7 0.7

0.6 0.6

0.5 0.5 Alignment scores Alignment scores

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 Token (position in the motif) Token (position in the motif)

Figure 76: The average scores difference and histograms of aligned positions of the root motif the motifs-tree predicting Nα-terminal acetylation in H. sapiens. The histograms shows that it is only required to study the first two residues in the sequences. Also the average score differences shows that having a G, L or V as the first residue and also that having a K or R as the second residue is very specific to sequences matching the motif.

169 8 motifs-trees performances and proteomic analysis for h. sapiens

Table 44: Simplified rules of the second motif in the simplified motifs-tree for Nα-terminal acetylation in H. sapiens. The motifs that match the same set of sequences have been merged (e.g. A[KPR] with •[KPR]). W/o init. Met cleavage With init. Met cleavage M[IKPR] •[KPR], [GLVE]•, P¬W, K¬[WY] be the a very common residue at the second position [Tats et al., 2006]. To confirm that we have extracted the human proteome from 2015_07 UniProtKB release and only considered the proteins sequenced from the first residue starting with an initiator methionine. In the human proteome we have measured that having an alanine as the second residue account for more that 12 % of the proteins (table 43). Thus we are now going to study how these sequences are split by the motif. This residue is only aligned {P,QIAN880129} and produce a score of 0.066 or with {R,MITS020101} and produce a score of 0.65. Thus it can only achieve a score over the threshold if the next residue is a Lys (K), Pro (P) or R (Arg): A[KPR]. The other proteins starting with an alanine are considered as acetylated. Finally we characterize the pairs of amino acids that produces an alignment score high enough, except the one that starts with Gly, Val, Leu, Glu, Met or Ala as we already have characterized those sequences. The pairs that produce high scores in the motif are : [CHQW][EHKQR], [FIP]¬W, K¬[YW], [NR]R and Y[EHR]. Some of these combinations are rarely found in nature (see table 43). Indeed, having a Cys (C), His (H), Tyr (Y), or Trp (W) account for less that 1 % in the human proteome. And having a Gln (Q), Phe (F), Ile (I) and Met (M) less that 2 %. So we are more interested in the following motifs: [PK]¬W, K¬Y and [NR]R as matching the sequences. Finally when a Lys (K), Pro (P) or R (Arg) is the second residue, the motif is considered detected in the sequence: •[KPR]. The rest is considered as acetylated. All these extracted features are summarized in table 44.

8.3.2 The second motif

The profile of the second motif is more difficult to understand than the previous one. The first observation is that a match implies acetylation, otherwise it is classified as not acetylated. The we can see in figure 77 that the all the alignments align the first three residues on the first 10 tokens. Moreover approximately 85 % of the first two residues are aligned on the first six tokens with the first residue aligned on the first token in 85 % of the alignment. In this case the first token is paired with the token at position 2 (48 %), 3 (9 %), 4 (2 %), 5 (11 %) or 6 (18 %). As the threshold is 5.040 the score 5 to reach for the first two residues is 1.67. For the combination with the token 2 and 5, we start by studying the sequences that start with a Met (M) or an Ala (A). The pairs that produce a significant score produce the following motifs: [MA][ACEILMQSTY] or A[DFHKNR]. But in the case of the methionine it as to be intersected with M[IPKR] from the previous motif. Thus only the MI can produce a score high enough for motif detection, in contrast to M[PKR]. In the case of alanine it has to be intersected with K (Lys), P (Pro) and R (Arg), thus A[KR].

5. There are 10 gaps to insert and because a correct alignment since the token at position 6 produces a binary score (zero or one, hence a score of 4), the score to reach is 5.040+10·0.0625 ≈ 1.67 + 4.

170 8.3 Human NATs specificity analysis

Table 45: Score achievable by lysine, proline and arginine when aligned on one of the physico-chemical property tokens present in equation 102. Amino acid {S,AURR980120} {Y,FAUJ880110} {E,CHOP780207} Lysine 0.72 0.75 0.47 Proline 0.34 0.5 0.42 Arginine 0.72 0.75 0.61

We also have from the previous motif, the sequence starting like •[KPR]. As one of the residue of the pair is a Lys (L), Pro (P) or Arg (R) and that the pair must produce a score of at least 1.67, there are only few amino acids can produce a score over the threshold. Indeed when aligned on any of the three physico-chemical property tokens (motif position 1, 2 and 5), only the following pairs are possible: [MACS][KR]. When a proline is involved, there is no pair that can produce a score high enough. In table 45 we show the score achievable by those three amino acids with the physico-chemical property tokens. We quickly see that with some properties the threshold is unreachable as it will need a token producing a score over one. Moreover an alignment with {S,AURR980120} is not favorable because the first residue would be gapped, and therefore produce a score lower that a misalignment. As the methionine and alanine have already been discussed and as the cysteine has been considered as rare at that position, the new contribution is simplified to S[KR]. Next we focus on the sequences starting by a G (Gly), V (Val), L (Leu) or E (Glu). Regarding the Glu it can never produce a significant score in this case. Unless if the five other residues provide a score of 5 in the fixed amino acid part. That is to say that the Glu is aligned with the token 5 (best possible match). For the other amino acids the conditions are: [GVL][AEMSTY] or [GL]Q, or L[CIL]. Otherwise it is not acetylated. After this analysis it is already clear that producing a simpler construction based on simple motif will be more complex than the initiator methionine cleavage case. With the next tokens the motif is only detected in the sequence only when there is no misalignment.

The third residue is characterized mainly by the tokens at position 6, 9, 10 (and in less extent 7 and 8, i.e. a serine or a glutamine). This correspond to ¬[NYILKRD], ¬[YAK] or ¬[LKR]. It is interesting to note that the lysine (K) is present in all motifs and that the tyrosine (Y), leucine (L) and the arginine (R) are present in two tokens. Next for residues at position 4 and 5 in the sequence, the most frequent patterns are ¬[YAK]¬[LKR], ¬[YAK][EAQD] and ¬[LKR][EAQD]. The sixth residue has no interest as it is always aligned on the last any amino acid token.

The complexity is that any combination (that respect the order of the aligned residues) of these simplified sub-motifs produce an alignment that characterizes a sequence that is considered as acetylated. The so called sub- motifs are summarized in table 46. This correspond to 45 possible assemblies of the previously described sub-motifs. The simplified aligned motif can be itself described with a decision tree. Even if it provides information as for example the avoidance of positively charged amino acid between the third and fourth residue, or the favor of a negatively charged amino acid as the fifth residue, the possible combinations make the reading of the motif hard.

171 8 motifs-trees performances and proteomic analysis for h. sapiens

Average profiles score difference 0.3

0.2

0.1

0

Average alignment score −0.1

−0.2 2 4 6 8 10 12 14 16

Left positions histogram Right positions histogram

1 1 residue 1

residue 2 0.9 0.9 residue 3

residue 4

0.8 residue 5 0.8

residue 6 0.7 0.7

0.6 0.6

0.5 0.5 Alignment scores Alignment scores

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 5 10 15 0 5 10 15 Token (position in the motif) Token (position in the motif)

Figure 77: Average score difference and histograms of aligned position for the second motif in the simplified motifs-tree for the prediction of Nα-terminal acetylation in H. sapiens. We can see from the difference that the presence of an alanine or a aspartate as the fifth residue is specific to the sequences matching the motif. Also when observing the histograms we see that having the residues at positions 4 and 5 are composed of glutamate, glutamine, alanine and aspartate is also specific to proteins matching the motif. See text for a more detailed explanation.

172 8.4 Ensemble learning

Table 46: Simplified rules of the second motif in the simplified motifs-tree for Nα-terminal acetylation in H. sapiens. Any ordered combination of the column produces a rule that implies Nα-terminal acetylation as illustrated in figure 75. Residue 1 and 2 Residue 3 Residues 4 and 5 Residue 6 [ASDN][KR] ¬[NYILKRD] ¬[YAK]¬[LKR] • L[CIL] ¬[YAK] ¬[YAK][EAQD] [GL]Q ¬[LKR] ¬[LKR][EAQD] [GVL][ADEMNSTY] MI

Table 47: 10-folds cross-validated results for Nα-terminal acetylation predic- tion with the motifs-trees on the 2015 version of the datasets. BL stands for baseline and MCC for Matthews correlation coefficient. Taxon BL Accuracy Sensitivity Specificity MCC Eukaryota 0.82 0.91 0.95 0.73 0.70 Metazoa 0.85 0.93 0.97 0.70 0.69 H. sapiens 0.88 0.90 0.94 0.58 0.51

8.3.3 UniProtKB release 2015_07

Finally we present the scores obtained with the 2015 dataset (see chapter 6 for datasets composition). We use the same learning parameter as the one used on the 2012 dataset and we train motifs-trees to predict Nα-terminal acetylation (table 31). This dataset is more unbalanced that the 2012 dataset and this may makes the learning more difficult. The results are summarized in table 47 and are obtained without parameter tuning. Hence it may be possible to obtain better results. However the results are still on par with the one obtained on the 2012 dataset and are still above the baseline. We also evaluated TermiNator3 on the last version of the 2012 Eukaryota dataset and the results for Nα-terminal acetylation are lower that the one obtained with the 2012 dataset (table 48). Indeed the specificity is a bit low- ered, probably because the the dataset is very unbalanced toward acetylated proteins, the Matthews correlation coefficient drops. Thus there maybe a need to update the patterns used in TermiNator3.

8.4 ensemble learning

One hypothesis it that we have a variance problem. In machine learning the variance of classifier is is defined as error due to sensitivity to noise in the training set. If the learning algorithm align itself on these small variations,

Table 48: Results assessing TermiNator3 quality of prediction for Nα-terminal acetylation. The predictor has been evaluated in both version of the dataset. BL is the baseline. Dataset version BL Accuracy Sensitivity Specificity MCC 2012 0.63 0.87 0.92 0.77 0.71 2015 0.82 0.86 0.89 0.71 0.58

173 8 motifs-trees performances and proteomic analysis for h. sapiens this may alters the generalization capability of the classifier. Hence a high variance may cause overfitting because the algorithm modelizes the noise in the data. A way to reduce the variance of a classifier is to combine several classifiers. Hence we want to combine motifs-trees with the hope to improve performances for Nα-Ac prediction. Such approach is called ensemble learning. The references used to introduce the ensemble learning, bagging and the random forest come mainly from [Russell and Norvig, 2010, Han et al., 2006, Murphy, 2012]. For clarity those references will not be cited any more in the following sections. Intuitively, an ensemble learning method builds a complex classifier by combining a set of simple classifiers. To predict the output of an instance x, each one of the simple classifiers predicts its output for x. Then the output of the complex classifier is the output occurring the most frequently among the simple classifiers. In other words, the majority wins. So we have to build k simple classifiers (usually k is odd to avoid a draw). To train the k simple classifiers the complete instances of the training set is not used, but a fraction of randomly chosen instances from the set. Indeed, using the same instances set for all simple classifiers may lead to overfitting the training data. However if subsets of the training set are used, a variability between the simple classifiers is added. Here, we focus on bagging (bootstrap aggregation). As usual we have a training or learning set L and a test set T, but the training set is sampled into k smaller training sets Li. Each Li is composed of m = β|L| instances (rounded to the nearest integer), with 0 < β ≤ 1. Each one of the m instances in Li is drawn with replacement from L . That is to say that not all instances from L may be used as a training instance:

k [ Li ⊆ L.(103) i=1

It should be noted that some instances can be drawn several times. To goal of the bagging is to improve the stability and accuracy of the learning method and to reduce the variance [Breiman, 1996]. Then each one of these Li is used to train one simple classifiers and combine them to form the textitcomplex classifier. At the end of the learning phase, each instance of the test set T are tested on the complex classifier to evaluate its performances. These steps are illustrated in the figure 78.

8.4.1 Motifs forest

The motifs forest algorithm is inspired by the random forest method and is a variation of the motifs-tree algorithm. It uses bagging to generate k training sets and each of this set is used to grow a motifs-tree. Then the output of each motifs-tree is combined by a majority vote to produce the output of the motifs forest. The algorithm 7 illustrates the procedure to build a motifs forest. This approach is somewhat similar to random forest. A random forest is an ensemble learning method by combining k random decision trees to produce an enhanced classifier [Breiman, 2001]. Each one of the k tree is grow using a training set produced with bagging. But during the learning phase of a decision tree, the construction of a test node is not based on all available features, but only on randomly selected subset of these features. Then the info gain ration is computed to select the best feature from the

174 8.4 Ensemble learning

L1

L1 L2

L2 L L L3 T L3

Lk

(a) Lk

(b) T

(c) Figure 78: Construction of an ensemble classifier based on decision trees. (a) The dataset is split into a train set L and a test set T. (b) Then Bagging is used to build k training sets from the training set L. The Li are not simply a partition of L but drawn independently with replacement from L. Then each Li is used to train one of the k decision trees. (c) Then the trees are grouped in one complex classifier which is evaluated with the test set T. Each instance of T (an x in the figure) is predicted by each one of the k decision trees (the yi) and the most frequent outcome (the y?) is set as the outcome of the complex classifier.

175 8 motifs-trees performances and proteomic analysis for h. sapiens

Algorithm 7 Motifs forest algorithm Require: A learning set of instances L Ensure: A motif forest M? 1: M? ← ∅ 2: for i = 1 ← k do 3: Li ← bootstrap a sample from L 4: Mi ← build a motifs-tree from Li 5: M? ← M? ∪ Mi 6: end for subset to build the test node. For each test node construction, a new subset of feature is randomly selected. Otherwise the growth of the tree is similar to the C4.5 algorithm. In the motifs-trees forest, only bagging is applied to the motifs-tree but this ensemble approach with motifs-trees make it close to random forest. Indeed, for a motifs-tree, each split during the learning phase can also be seen as search for the best feature in a random subset of features. Considering that the feature space is the set of all possible motifs, the feature selection method of C4.5 was not applied due to the size of the space. The genetic algorithm setup used in motifs-tree learning is somewhere a search in a random subset of the possible motifs because: — the initial population on which all genetic operators will be applied is randomly drawn and is of finite size; — the number of generation is bounded by a parameter and the value used is much lower than the number of applications of genetic operator to generate all the motifs from the initial population (i.e. not an exhaustive search). So, the search is bound to a subset of motifs defined by the initial population, the number of generation and the genetic operators used 6. In other words, using bagging with motifs-trees is similar to a random forest with trees built on features included on the motifs space.

8.4.2 Classification performances of the motifs forest

As often observed by a bagging or random forest approach, the classifier has better performance for its prediction. However, the obtained improve- ments when evaluated with the 2012 dataset are not impressive. The complete cross-validated performances are provided in table 49. The parameters used to learn the forest are: — initial instances set (the one used with bagging): this is the same set as the one used to train a single motifs-tree; — number of motifs-trees: 99; — bagging size: 0.6; — the genetic algorithm parameters are the same as the ones used for a single motifs-tree. Moreover, the drawback of this is obviously that the white box property of the motifs-tree vanished. Analysis of a motifs-tree is not always obvious and combined in an ensemble makes very difficult to extract which features in an instance allow prediction to occur. Therefore the use of ensemble learning as described here is not the best way to improve performances.

6. However, all the possible motifs can be generated if the evolution runs for a sufficiently long amount of time.

176 8.5 Conclusion and perspective

Table 49: Cross-validated performances of the ensemble learning method on the 2012 datasets. MCC means Matthews correlation coefficient. The provided gain is the increase in MCC versus the cross-validated performances of a single motifs-tree. Dataset Accuracy Sensitivity Specificity MCC MCC gain Eukaryota 0.90 0.95 0.81 0.78 +0.01 Metazoa 0.91 0.97 0.76 0.76 +0.04 H. sapiens 0.94 0.99 0.59 0.69 +0.06

8.5 conclusion and perspective

In this chapter we demonstrate that the motifs-trees can be used to build a good predictor for Nα-terminal acetylation. Moreover in comparison to the tools currently published and available; we propose a new state of the art approach to predict Nα-terminal acetylation. There are mainly two available predictors NetAcet and the previous state of the art, TermiNator3. Regarding NetAcet, their limitations make the comparison useless as they are working only on NatA potential subtrates and only for S. cerevisiae (section 5.2). When compared with the state of the art, namely TermiNator3, we obtain comparable performances regarding the generalization. However the motifs used in TermiNator3 seem to need an update. Indeed when tested with the 2015 datasets, the performances of the predictor drop. But finding new motifs with no machine learning approach in a set composed of more than 6 000 proteins is a hard task. With the 2015 Eukaryota dataset, the motifs-trees still produce good results with the same generalization performances as with the 2012 dataset. The comparison of the N-terminal methionine cleavage prediction is not discussed because the motifs-trees and TermiNator3 produce excellent results. Another important contribution is the proposed criteria to build a clean set of Nα-terminal acetylation proteins and especially non-acetylated proteins. It is crucial for supervised learning to have a clean dataset. That is to say having protein correctly labeled as Nα-acetylated, which is an easy information to extract from UniProtKB, or having proteins that do not have the annotation of being acetylated that are really not acetylated. To obtain those non-acetylated proteins is the hard part in building a dataset. We hope that the criteria we have proposed will be used by other groups that tackle the problem of Nα-terminal acetylation prediction and even refined if needed. We have also tried to produce a white box classifier by using motifs. To describe a sequence a regular expression like descriptor seems to be the easiest model to use. In the case of the human initiator methionine cleavage we succeed in extracting features from the motifs in the tree. Regarding the Nα-terminal acetylation it was more difficult, even for the short H. sapiens tree. We realize that aligned motifs are able to detect very subtle features in the sequence but this can make them difficult to read or interpret. We think that the there are several potential improvements to do to. First the motifs are sometimes too complicated and this may be improved by allowing another plague operator (see chapter 7.4) that can slightly reduce the fitness of the individual. It can be done by accepting the modification of the operator on the motif if the decrease in fitness stays under a given percentage of the fitness of the unmodified individual. Another variant can be applied in a tree pruning manner. Usually a tree is pruned in a bottom-up fashion and for each node we estimate the error made when the node is converted into

177 8 motifs-trees performances and proteomic analysis for h. sapiens a leaf against the error when the subtree in conserved. This principle and estimation can be applied to a plague operator. At the end of the growth phase, we traverse the tree in a bottom-up manner and at each node we apply a plague operator that decreases the fitness of the individual. Thus we have two subtrees: one with the unaltered motif as the root node and one with the altered motif. Then the error estimation is applied to both subtrees and the keep the subtree. Then the selected subtrees is selected in the same manner the decision to replace a subtree by a leaf is made in the pessimistic pruning [Quinlan, 1987]. Such step may have the advantage of producing trees that generalize more and that are more readable by a human.

178 Part IV

CONCLUSION

CONCLUSION&PERSPECTIVE

In the first part we have started this work by subdividing the networks in order to compute the extreme pathways of large networks. We have experimentally demonstrated that for some dividing strategies this is not an approach to use, despite being proposed as a possibility to improve the computation in some publications. This negative result is still a strong result because it states that the problem should not be tackled in the presented man- ner. This work should be a base for anyone that wants to try to modularize the network to compute extreme pathways or any generating sets. Then we have proposed another metric based on the extreme pathways (that can be applied with other generating sets) to measure the similarities between object (reactions, enzymes, genes) in a metabolic network. This measure has been successfully applied to the human red blood cell metabolic network and allowed us to study in silico some enzymopathies. Our approach should be reevaluated on a more complete red blood cell network, like the one available in BiGG which is more detailed that the one presented in this thesis [Schellenberger et al., 2010]. Ideally the drawn conclusion should also be discussed with a hematologist. We have also been able to discover a part of the E. coli operon structure when applied on a genome scale network. However the data for the E. coli network have increased in size since the described network in this manuscript and the results on the new networks may differ. Moreover our knowledge about the regulation network also increases through time and new elements of the regulation (i.e. the transcription factors or the sigma units) may be used to better assess the quality of modules detection. This observation is applicable for all similar methods [Gagneur et al., 2003, Guimera and Amaral, 2005, Schuster et al., 2002b] because we do not have an idea of the validity of those methods through time. That is to say: does a reapplication of the approaches on more complete data will produce the same performance results ? Hence those methods should be periodically evaluated with the evolution of the datasets. Nevertheless it seems that assessing the partition of a metabolic network is either based on qualitative analysis and thus makes the comparison with other approaches difficult. When a qualitative criterion is used, it seems that the comparison is not systematic in the publications. The criterion is rather used as a self-justification to assess the approach but does not allow to really understand what are the advantages or improvements. We think that it can really improve the quality of the publications if such comparisons are systematically done. This will also help the researcher to select the method that suit the best their needs. In short some bioinformatics publications should be more methodology oriented instead of application oriented. In the second part we have proposed a classifier to predict the Nα-terminal acetylation and we have produced the state of the art for Nα-terminal acety- lation in Eukaryota. We have also proposed criteria to automatically build the most accurate dataset for Nα-terminal acetylation (positive and negative examples). However the observation made on the data for metabolic or regulation networks is also valid for sequence analysis. Indeed we are very dependent on the quality of the dataset, this also the case with the sequence analysis part. Between our publication and the end of this thesis, the size of the dataset have more or less doubled. Obviously we can also question the

181 conclusion & perspective validity of our method to solve the problem of Nα-terminal acetylation. In our case the results are stable between the UniProtKB releases. However the already published approaches are not reevaluated by the authors with the growing dataset. This make the choice of a tool very difficult for a researcher. When can greatly improve the quality of these tools if they are updated with the new datasets and their performances and new datasets are made public (without the need of publishing). Lastly, this is usually the case, but the datasets used to build any model based on sequence analysis should be available to compare an approach with another one already published. Computer science offers outstanding tools to help scientists to analyze the massive amount of data collected by biologists or biochemists. Even if some biologists are trained in computer science, a computer scientist will have a different angle to solve problems and have a wider view on how to tackle data problems and how to validate a methodology. Moreover computer scientists are also sensitive to high performances and may provide tools of high quality. This become also more and more important with new computer models used to understand biophysical phenomena. Therefore the integration of computer scientists in laboratories and train them to understand biological problems could be a valuable asset for a research group in life science.

182 BIBLIOGRAPHY

S. M. Arfin and R. A. Bradshaw. Cotranslational processing and protein turnover in eukaryotic cells. Bio- chemistry, 27(21):7979–7984, Oct 1988.

Masanori Arita. The metabolic world of escherichia coli is not small. Proceedings of the National Academy of Sciences of the United States of America, 101(6):1543–1547, 2004.

Thomas Arnesen. Towards a functional understanding of protein N-terminal acetylation. PLoS Biol, 9(5): e1001074, May 2011. doi: 10.1371/journal.pbio.1001074. URL http://dx.doi.org/10.1371/journal. pbio.1001074.

Thomas Arnesen, Petra Van Damme, Bogdan Polevoda, Kenny Helsens, Rune Evjenth, Niklaas Colaert, Jan Erik Varhaug, Joël Vandekerckhove, Johan R. Lillehaug, Fred Sherman, and Kris Gevaert. Proteomics analyses reveal the evolutionary conservation and divergence of N-terminal acetyltransferases from yeast and humans. Proc Natl Acad Sci U S A, 106(20):8157–8162, May 2009. doi: 10.1073/pnas.0901931106. URL http://dx.doi.org/10.1073/pnas.0901931106.

Rajeev Aurora and George D Rosee. Helix capping. Protein Science, 7(1):21–38, 1998.

Timothy L Bailey, James Johnson, Charles E Grant, and William S Noble. The meme suite. Nucleic acids research, page gkv416, 2015.

Wolfgang Banzhaf, Frank D. Francone, Robert E. Keller, and Peter Nordin. Genetic programming: an introduc- tion: on the automatic evolution of computer programs and its applications. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1998. ISBN 1-55860-510-X.

Albert-Laszlo Barabasi and Zoltan N Oltvai. Network biology: understanding the cell’s functional organiza- tion. Nature reviews genetics, 5(2):101–113, 2004.

I Baranowska-Bosiacka, AJ Hlynczak, B Wiszniewska, and M Marchlewicz. Disorders of purine metabolism in human erythrocytes in the state of lead contamination. Polish Journal of Environmental Studies, 13(5): 467–476, 2004.

Jeremy M Berg, John L Tymoczko, and Lubert Stryer. Biochemistry. New York, 2002.

P Bernhard. Systems biology: properties of reconstructed networks, 2006.

Michael R. Berthold, Christian Borgelt, Frank Hppner, and Frank Klawonn. Guide to Intelligent Data Analysis: How to Intelligently Make Sense of Real Data. Springer Publishing Company, Incorporated, 1st edition, 2010. ISBN 1848822596, 9781848822597.

D John Betteridge. What is oxidative stress? Metabolism, 49(2):3–8, 2000.

Willy V. Bienvenut, David Sumpton, Aude Martinez, Sergio Lilla, Christelle Espagne, Thierry Meinnel, and Carmela Giglione. Comparative large scale characterization of plant versus mammal proteins reveals similar and idiosyncratic n-δs-acetylationfeatures. Mol Cell Proteomics, 11(6):M111.015131, Jun 2012. doi: 10.1074/mcp.M111.015131. URL http://dx.doi.org/10.1074/mcp.M111.015131.

N. Blom, S. Gammeltoft, and S. Brunak. Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol, 294(5):1351–1362, Dec 1999. doi: 10.1006/jmbi.1999.3310. URL http: //dx.doi.org/10.1006/jmbi.1999.3310.

Thomas Blumenthal. Operons in eukaryotes. Briefings in functional genomics & proteomics, 3(3):199–211, 2004.

Guido Bologna, CÃl’dric Yvon, SÃl’verine Duvaud, and Anne-Lise Veuthey. N-terminal myristoylation pre- dictions by ensembles of neural networks. Proteomics, 4(6):1626–1632, Jun 2004. doi: 10.1002/pmic. 200300783. URL http://dx.doi.org/10.1002/pmic.200300783.

Stephen P Borgatti, Ajay Mehra, Daniel J Brass, and Giuseppe Labianca. Network analysis in the social sciences. science, 323(5916):892–895, 2009.

R. A. Bradshaw, W. W. Brickey, and K. W. Walker. N-terminal processing: the methionine aminopeptidase and n alpha-acetyl transferase families. Trends Biochem Sci, 23(7):263–267, Jul 1998.

Kristien Braeken, Martine Moris, Ruth Daniels, Jos Vanderleyden, and Jan Michiels. New horizons for (p) ppgpp in bacterial and plant physiology. Trends in microbiology, 14(1):45–54, 2006.

Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996. URL http://link.springer.com/ article/10.1007/BF00058655.

183 bibliography

Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001. URL http://link.springer.com/ article/10.1023/A:1010933404324.

Ali H Brivanlou and James E Darnell. Signal transduction and the control of gene expression. Science, 295 (5556):813–818, 2002.

JL Brown and WK Roberts. Evidence that approximately eighty per cent of the soluble proteins from ehrlich ascites cells are nalpha-acetylated. Journal of Biological Chemistry, 251(4):1009–1014, 1976.

P. Bucher, K. Karplus, N. Moeri, and K. Hofmann. A flexible motif search technique based on generalized profiles. Comput Chem, 20(1):3–23, Mar 1996.

Anthony P Burgard, Evgeni V Nikolaev, Christophe H Schilling, and Costas D Maranas. Flux coupling analysis of genome-scale metabolic network reconstructions. Genome research, 14(2):301–312, 2004. URL http://genome.cshlp.org/content/14/2/301.short.

Y. Burstein and I. Schechter. Primary structures of n-terminal extra peptide segments linked to the variable and constant regions of immunoglobulin light chain precursors: implications on the organization and controlled expression of immunoglobulin genes. Biochemistry, 17(12):2392–2400, Jun 1978.

Yu-Dong Cai and Lin Lu. Predicting n-terminal acetylation based on feature selection method. Biochem Biophys Res Commun, 372(4):862–865, Aug 2008. doi: 10.1016/j.bbrc.2008.05.143. URL http://dx.doi. org/10.1016/j.bbrc.2008.05.143.

H Yi Caroline, Heling Pan, Jan Seebacher, Il-Ho Jang, Sven G Hyberts, Gregory J Heffron, Matthew G Vander Heiden, Renliang Yang, Fupeng Li, Jason W Locasale, et al. Metabolic regulation of protein n-alpha-acetylation by bcl-xl promotes cell survival. Cell, 146(4):607–620, 2011.

M Emre Celebi and Hassan A Kingravi. Deterministic initialization of the k-means algorithm using hierar- chical clustering. International Journal of Pattern Recognition and Artificial Intelligence, 26(07):1250018, 2012.

Christophe Charpilloz, Anne-Lise Veuthey, Bastien Chopard, and Jean-Luc Falcone. Motifs tree: a new method for predicting post-translational modifications. Bioinformatics, page btu165, 2014. URL http:// bioinformatics.oxfordjournals.org/content/early/2014/03/28/bioinformatics.btu165.short.

David P Chassin and Christian Posse. Evaluating north american electric grid reliability using the barabási– albert network model. Physica A: Statistical Mechanics and its Applications, 355(2):667–677, 2005. URL http://www.sciencedirect.com/science/article/pii/S0378437105002311.

P. Y. Chou and G. D. Fasman. Prediction of the secondary structure of proteins from their amino acid sequence. Adv Enzymol Relat Areas Mol Biol, 47:45–148, 1978.

Gwo-Yu Chuang, Jeffrey C. Boyington, M Gordon Joyce, Jiang Zhu, Gary J. Nabel, Peter D. Kwong, and Ivelin Georgiev. Computational prediction of n-linked glycosylation incorporating structural properties and patterns. Bioinformatics, 28(17):2249–2255, Sep 2012. doi: 10.1093/bioinformatics/bts426. URL http: //dx.doi.org/10.1093/bioinformatics/bts426.

Bertrand Clarke, Ernest Fokoue, and Hao Helen Zhang. Principles and theory for data mining and machine learning. Springer Science & Business Media, 2009.

UniProt Consortium et al. Uniprot: a hub for protein information. Nucleic acids research, page gku989, 2014.

David Croft, Antonio Fabregat Mundo, Robin Haw, Marija Milacic, Joel Weiser, Guanming Wu, Michael Caudy, Phani Garapati, Marc Gillespie, Maulik R. Kamdar, Bijay Jassal, Steven Jupe, Lisa Matthews, Bruce May, Stanislav Palatnik, Karen Rothfels, Veronica Shamovsky, Heeyeon Song, Mark Williams, , Henning Hermjakob, Lincoln Stein, and Peter D’Eustachio. The reactome pathway knowl- edgebase. Nucleic Acids Res, 42(Database issue):D472–D477, Jan 2014. doi: 10.1093/nar/gkt1102. URL http://dx.doi.org/10.1093/nar/gkt1102.

Paolo Crucitti, Vito Latora, and Massimo Marchiori. A topological analysis of the italian electric power grid. Physica A: Statistical Mechanics and its Applications, 338(1):92–97, 2004. URL http://www.sciencedirect. com/science/article/pii/S0378437104002249.

A Dabrowska. Red cell pyruvate kinase maybe control of oxygen delivery from erythrocyte. Postepy˛ Higieny i Medycyny Do´swiadczalnej, 51(3):305–318, 1997.

Angelo D’Alessandro, Federica Gevi, and Lello Zolla. Red blood cell metabolism under prolonged anaerobic storage. Molecular BioSystems, 9(6):1196–1209, 2013.

Thomas Dandekar, Ferdinand Moldenhauer, Sascha Bulik, Helge Bertram, and Stefan Schuster. A method for classifying metabolites in topological pathway analyses based on minimization of pathway num- ber. Biosystems, 70(3):255–270, 2003. URL http://www.sciencedirect.com/science/article/pii/ S0303264703000674.

184 bibliography

Mutlu Do˘gruel,Thomas A Down, and Tim JP Hubbard. Nestedmica as an ab initio protein motif discovery tool. BMC bioinformatics, 9(1):19, 2008.

Richard Durbin, Sean R Eddy, , and Graeme Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998.

Birgit Eisenhaber and Frank Eisenhaber. Prediction of posttranslational modification of proteins from their amino acid sequence. In Oliviero Carugo and Frank Eisenhaber, editors, Data Mining Techniques for the Life Sciences, volume 609 of Methods in Molecular Biology, pages 365–384. Humana Press, 2010. ISBN 978-1-60327-241-4.

Rune Evjenth, Kristine Hole, Odd A Karlsen, Mathias Ziegler, Thomas Arnesen, and Johan R Lillehaug. Human naa50p (nat5/san) displays both protein nα-and n -acetyltransferase activity. Journal of Biological Chemistry, 284(45):31122–31129, 2009.

Jean-luc Fauchère, Marvin Charton, Lemont B Kier, Arie Verloop, and Vladimir Pliska. Amino acid side chain parameters for correlation studies in biology and pharmacology. International journal of peptide and protein research, 32(4):269–278, 1988.

Tom Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861–874, 2006.

Eitan Fibach and Eliezer Rachmilewitz. The role of oxidative stress in hemolytic anemia. Current molecular medicine, 8(7):609–619, 2008.

Theresa M Filtz, Walter K Vogel, and Mark Leid. Regulation of transcription factor activity by interconnected post-translational modifications. Trends in pharmacological sciences, 35(2):76–85, 2014.

Gabriella M A. Forte, Martin R. Pool, and Colin J. Stirling. N-terminal acetylation inhibits protein targeting to the endoplasmic reticulum. PLoS Biol, 9(5):e1001073, May 2011. doi: 10.1371/journal.pbio.1001073. URL http://dx.doi.org/10.1371/journal.pbio.1001073.

Edward B Fowlkes and Colin L Mallows. A method for comparing two hierarchical clusterings. Journal of the American statistical association, 78(383):553–569, 1983. URL http://amstat.tandfonline.com/doi/abs/ 10.1080/01621459.1983.10478008.

Jennifer E Frank. Diagnosis and management of g6pd deficiency. American family physician, 72(7):1277–1282, 2005.

Frédéric Frottin, Aude Martinez, Philippe Peynot, Sanghamitra Mitra, Richard C. Holz, Carmela Giglione, and Thierry Meinnel. The proteomics of n-terminal methionine cleavage. Mol Cell Proteomics, 5(12): 2336–2349, Dec 2006. doi: 10.1074/mcp.M600225-MCP200. URL http://dx.doi.org/10.1074/mcp. M600225-MCP200.

Komei Fukuda and Alain Prodon. Double description method revisited. In Combinatorics and computer science, pages 91–111. Springer, 1996. URL http://link.springer.com/chapter/10.1007/3-540-61576-8_77.

Beverly W Gabrio, Clement A Finch, and Frank M Huennekens. Erythrocyte preservation: a topic in molec- ular biochemistry. Blood, 11(2):103–113, 1956.

Julien Gagneur and Steffen Klamt. Computation of elementary modes: a unifying framework and the new binary approach. BMC bioinformatics, 5(1):175, 2004.

Julien Gagneur, David B Jackson, and Georg Casari. Hierarchical analysis of dependency in metabolic networks. Bioinformatics, 19(8):1027–1034, 2003. URL http://bioinformatics.oxfordjournals.org/ content/19/8/1027.short.

Socorro Gama-Castro, Heladia Salgado, Martin Peralta-Gil, Alberto Santos-Zavaleta, Luis Muñiz-Rascado, Hilda Solano-Lira, Verónica Jimenez-Jacinto, Verena Weiss, Jair S García-Sotelo, Alejandra López-Fuentes, et al. Regulondb version 7.0: transcriptional regulation of escherichia coli k-12 integrated within genetic sensory response units (gensor units). Nucleic acids research, 39(suppl 1):D98–D105, 2011.

Matthias Gautschi, Sören Just, Andrej Mun, Suzanne Ross, Peter Rücknagel, Yves Dubaquié, Ann Ehrenhofer- Murray, and Sabine Rospert. The yeast n(alpha)-acetyltransferase nata is quantitatively anchored to the ribosome and interacts with nascent polypeptides. Mol Cell Biol, 23(20):7403–7414, 2003.

Kris Gevaert, Marc Goethals, Lennart Martens, Jozef Van Damme, An Staes, Grégoire R Thomas, and Joël Vandekerckhove. Exploring proteomes and analyzing protein processing by mass spectrometric identifi- cation of sorted n-terminal peptides. Nature biotechnology, 21(5):566–569, 2003.

C. Giglione, A. Boularot, and T. Meinnel. Protein n-terminal methionine excision. Cell Mol Life Sci, 61(12):1455–1474, Jun 2004. doi: 10.1007/s00018-004-3466-8. URL http://dx.doi.org/10.1007/ s00018-004-3466-8.

185 bibliography

Michelle Girvan and Mark EJ Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821–7826, 2002. URL http://www.pnas.org/content/99/12/ 7821.short.

David E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Long- man Publishing Co., Inc., Boston, MA, USA, 1st edition, 1989. ISBN 0201157675.

Pedro Gonnet and Frédérique Lisacek. Probabilistic alignment of motifs with sequences. Bioinformatics, 18 (8):1091–1101, Aug 2002.

Eva Grafahrend-Belau, Falk Schreiber, Monika Heiner, Andrea Sackmann, Björn H Junker, Stefanie Grun- wald, Astrid Speer, Katja Winder, and Ina Koch. Modularization of biochemical networks based on classification of petri net t-invariants. BMC bioinformatics, 9(1):90, 2008.

Darina Gromyko, Thomas Arnesen, Anita Ryningen, Jan Erik Varhaug, and Johan R Lillehaug. Depletion of the human nα-terminal acetyltransferase a induces p53-dependent apoptosis and p53-independent growth inhibition. International Journal of Cancer, 127(12):2777–2789, 2010.

Roger Guimera and Luis A Nunes Amaral. Functional cartography of complex metabolic networks. Nature, 433(7028):895–900, 2005. URL http://www.nature.com/nature/journal/v433/n7028/abs/ nature03288.html.

Roger Guimerà, Stefano Mossa, Adrian Turtschi, and LA Nunes Amaral. The worldwide air transportation network: Anomalous centrality, community structure, and cities’ global roles. Proceedings of the National Academy of Sciences, 102(22):7794–7799, 2005. URL http://www.pnas.org/content/102/22/7794.short.

Roger Guimerà, Marta Sales-Pardo, and Luis A Nunes Amaral. A network-based method for target selection in metabolic networks. Bioinformatics, 23(13):1616–1622, 2007.

Jiawei Han, Micheline Kamber, and Jian Pei. Data mining: concepts and techniques. Morgan kaufmann, 2006.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2nd edition, 2001.

Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2):83–85, 2005.

Michael Hecker, Sandro Lambeck, Susanne Toepfer, Eugene Van Someren, and Reinhard Guthke. Gene regulatory network inference: data integration in dynamic modelsÂUa˚ review. Biosystems, 96(1):86–103, 2009.

Avram Hershko, Hannah Heller, Ethy Eytan, Gangadhar Kaklij, and Irwin A Rose. Role of the alpha-amino group of protein in ubiquitin-mediated protein breakdown. Proceedings of the National Academy of Sciences, 81(22):7021–7025, 1984.

Kristine Hole, Petra Van Damme, Monica Dalva, Henriette Aksnes, Nina Glomnes, Jan Erik Varhaug, Johan R Lillehaug, Kris Gevaert, and Thomas Arnesen. The human n-alpha-acetyltransferase 40 (hnaa40p/hnatd) is conserved from yeast and n-terminally acetylates histones h2a and h4. PloS one, 6(9):e24713, 2011.

Jolien Hollebeke, Petra Van Damme, and Kris Gevaert. N-terminal acetylation and other functions of nα- acetyltransferases. Biological chemistry, 393(4):291–298, 2012.

Lele Hu, Tao Huang, Xiaohe Shi, Wen-Cong Lu, Yu-Dong Cai, and Kuo-Chen Chou. Predicting functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid properties. PLoS One, 6(1):e14556, 2011.

Martijn A Huynen, Berend Snel, Christian von Mering, and Peer Bork. Function prediction and protein networks. Current opinion in cell biology, 15(2):191–198, 2003. URL http://www.sciencedirect.com/ science/article/pii/S0955067403000097.

Cheol-Sang Hwang, Anna Shemorry, and Alexander Varshavsky. N-terminal acetylation of cellular proteins creates specific degradation signals. Science, 327(5968):973–977, Feb 2010. doi: 10.1126/science.1183147. URL http://dx.doi.org/10.1126/science.1183147.

Paul Jaccard. Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr. Corbaz, 1901.

Hawoong Jeong, Bálint Tombor, Réka Albert, Zoltan N Oltvai, and A-L Barabási. The large-scale organization of metabolic networks. Nature, 407(6804):651–654, 2000.

Dimitrije Jevremovic, Cong T Trinh, Friedrich Srienc, and Daniel Boley. A simple rank test to distinguish extreme pathways from elementary modes in metabolic networks. Univ. of Minnesota, Computer Science and Eng. Dept. Tech. Rep, pages 08–029, 2008.

Steven Michael Johnson. Algorithms to detect motifs and predict post-translational modification sites. 2015.

186 bibliography

Karin Julenius. Netcglyc 1.0: prediction of mammalian c-mannosylation sites. Glycobiology, 17(8):868–876, Aug 2007. doi: 10.1093/glycob/cwm050. URL http://dx.doi.org/10.1093/glycob/cwm050.

Suk Hoon Jung, Bora Hyun, Woo-Hyuk Jang, Hee-Young Hur, and Dong-Soo Han. Protein complex pre- diction based on simultaneous protein interaction network. Bioinformatics, 26(3):385–391, 2010. URL http://bioinformatics.oxfordjournals.org/content/26/3/385.short.

Christoph Kaleta, Luís Filipe de Figueiredo, and Stefan Schuster. Can the whole be less than the sum of its parts? pathway analysis in genome-scale metabolic networks using elementary flux patterns. Genome research, 19(10):1872–1883, 2009. URL http://genome.cshlp.org/content/19/10/1872.short.

Masahiro Kamita, Yayoi Kimura, Yoko Ino, Roza M Kamp, Bogdan Polevoda, Fred Sherman, and Hisashi Hirano. N α-acetylation of yeast ribosomal proteins and its effect on protein synthesis. Journal of proteomics, 74(4):431–441, 2011.

Minoru Kanehisa and Susumu Goto. Kegg: kyoto encyclopedia of genes and genomes. Nucleic acids research, 28(1):27–30, 2000.

Usheer Kanjee, Koji Ogata, and Walid A Houry. Direct binding targets of the stringent response alarmone (p) ppgpp. Molecular microbiology, 85(6):1029–1043, 2012.

S. Kawashima and M. Kanehisa. AAindex: amino acid index database. Nucleic acids research, 28(1):374, January 2000. ISSN 0305-1048. doi: 10.1093/nar/28.1.374. URL http://dx.doi.org/10.1093/nar/28.1. 374.

Ingrid M Keseler, Amanda Mackie, Martin Peralta-Gil, Alberto Santos-Zavaleta, Socorro Gama-Castro, César Bonavides-Martínez, Carol Fulcher, Araceli M Huerta, Anamika Kothari, Markus Krummenacker, et al. Ecocyc: fusing model organism databases with systems biology. Nucleic acids research, 41(D1):D605–D612, 2013.

Raya Khanin and Ernst Wit. How scale-free are biological networks. Journal of computational biology, 13(3): 810–818, 2006. URL http://online.liebertpub.com/doi/abs/10.1089/cmb.2006.13.810.

Andrew D King, N Pržulj, and Igor Jurisica. Protein complex prediction via cost-based clustering. Bioin- formatics, 20(17):3013–3020, 2004. URL http://bioinformatics.oxfordjournals.org/content/20/17/ 3013.short.

Ryan Kinney, Paolo Crucitti, Reka Albert, and Vito Latora. Modeling cascading failures in the north american power grid. The European Physical Journal B-Condensed Matter and Complex Systems, 46(1):101–107, 2005. URL http://www.springerlink.com/index/V163831253T3M122.pdf.

Steffen Klamt and Jörg Stelling. Two approaches for metabolic pathway analysis? Trends in biotechnology, 21 (2):64–69, 2003.

Stuart Koschade. A social network analysis of jemaah islamiyah: The applications to counterterrorism and intelligence. Studies in Conflict & Terrorism, 29(6):559–575, 2006. URL http://www.tandfonline.com/doi/ abs/10.1080/10576100600798418.

Valdis E Krebs. Mapping networks of terrorist cells. Connections, 24(3):43–52, 2002.

Kelly Kruger, Paula J Grabowski, Arthur J Zaug, Julie Sands, Daniel E Gottschling, and Thomas R Cech. Self- splicing rna: autoexcision and autocyclization of the ribosomal rna intervening sequence of tetrahymena. cell, 31(1):147–157, 1982.

J. Kyte and R. F. Doolittle. A simple method for displaying the hydropathic character of a protein. J Mol Biol, 157(1):105–132, May 1982.

Kiemer Lars, Bendtsen J. Dyrlov, and Blom Nikolaj. NetAcet: prediction of N-terminal acetylation sites. Bioinformatics, 21(7):1269–1270, April 2005. ISSN 1367-4803. doi: 10.1093/bioinformatics/bti130. URL http://dx.doi.org/10.1093/bioinformatics/bti130.

Lie-Quan Lee, Jeff Varner, and Kwok Ko. Parallel extreme pathway computation for metabolic networks. In Computational Systems Bioinformatics Conference, 2004. CSB 2004. Proceedings. 2004 IEEE, pages 636–639. IEEE, 2004.

Tong Ihn Lee, Nicola J Rinaldi, François Robert, Duncan T Odom, Ziv Bar-Joseph, Georg K Gerber, Nancy M Hannett, Christopher T Harbison, Craig M Thompson, Itamar Simon, et al. Transcriptional regulatory networks in saccharomyces cerevisiae. science, 298(5594):799–804, 2002.

Jure Leskovec and Christos Faloutsos. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 631–636. ACM, 2006. URL http: //dl.acm.org/citation.cfm?id=1150479.

Mike Levine. Transcriptional enhancers in animal development and evolution. Current biology, 20(17):R754– R763, 2010.

187 bibliography

Gipsi Lima-Mendez and Jacques van Helden. The powerful law of the power law and other myths in network biology. Molecular BioSystems, 5(12):1482–1493, 2009.

Glen Liszczak. The molecular basis for amino-terminal acetylation by NAT proteins. PhD thesis, University of Pennsylvania, 2013.

Ying Liu and Yuanlie Lin. A novel method for n-terminal acetylation prediction. Genomics Proteomics Bioin- formatics, 2(4):253–255, Nov 2004.

Harvey F Lodish, Arnold Berk, S Lawrence Zipursky, Paul Matsudaira, David Baltimore, James Darnell, et al. Molecular cell biology, volume 4. WH Freeman New York, 2000.

Rosemary MacDonald. Red cell 2, 3-diphosphoglycerate and oxygen affinity. Anaesthesia, 32(6):544–553, 1977.

James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Pro- ceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA., 1967.

Matthias Mann and Ole N Jensen. Proteomic analysis of post-translational modifications. Nature biotechnol- ogy, 21(3):255–261, 2003.

Aude Martinez, José A. Traverso, Benoit Valot, Myriam Ferro, Christelle Espagne, Geneviève Ephritikhine, Michel Zivy, Carmela Giglione, and Thierry Meinnel. Extent of n-terminal modifications in cytosolic proteins from eukaryotes. Proteomics, 8(14):2809–2831, Jul 2008. doi: 10.1002/pmic.200701191. URL http://dx.doi.org/10.1002/pmic.200701191.

Quinn McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153–157, 1947.

Ahmed M Mehdi, Muhammad Shoaib B Sehgal, Bostjan Kobe, Timothy L Bailey, and Mikael Bodén. Dlocal- motif: a discriminative approach for discovering local motifs in protein sequences. Bioinformatics, 29(1): 39–46, 2013.

T Meinnel, Y Mechulam, and S Blanquet. Methionine as translation start signal: a review of the enzymes of the pathway in escherichia coli. Biochimie, 75(12):1061–1075, 1993.

Thierry Meinnel, Philippe Peynot, and Carmela Giglione. Processed N-termini of mature proteins in higher eukaryotes and their major contribution to dynamic proteomics. Biochimie, 87(8):701–712, Aug 2005. doi: 10.1016/j.biochi.2005.03.011. URL http://dx.doi.org/10.1016/j.biochi.2005.03.011.

Marija Milacic, Robin Haw, Karen Rothfels, Guanming Wu, David Croft, Henning Hermjakob, Peter D’Eustachio, and Lincoln Stein. Annotating cancer variants and anti-cancer therapeutics in reactome. Cancers (Basel), 4(4):1180–1211, 2012. doi: 10.3390/cancers4041180. URL http://dx.doi.org/10.3390/ cancers4041180.

Shigeki Mitaku, Takatsugu Hirokawa, and Toshiyuki Tsuji. Amphiphilicity index of polar amino acids as an aid in the characterization of amino acid preference at membrane–water interfaces. Bioinformatics, 18(4): 608–616, 2002.

Theodore S Motzkin, Howard Raiffa, Gerald L Thompson, and Robert M Thrall. The double description method. 1953.

Kevin P Murphy. Machine learning: a probabilistic perspective. MIT Press, 2012.

Tokumasa Nakamoto. Evolution and the universality of the mechanism of initiation of protein synthesis. Gene, 432(1):1–6, 2009.

Kozo Narita. Isolation of acetylpeptide from enzymic digests of tmv-protein. Biochimica et biophysica acta, 28: 184–191, 1958.

S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443–453, March 1970. ISSN 00222836. doi: 10.1016/0022-2836(70)90057-4. URL http://dx.doi.org/10.1016/0022-2836(70)90057-4.

Mark Newman. Networks: an introduction. Oxford University Press, 2010.

Mark EJ Newman. Fast algorithm for detecting community structure in networks. Physical review E, 69(6): 066133, 2004.

Mark EJ Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23):8577–8582, 2006.

Richard A Notebaart, Bas Teusink, Roland J Siezen, and Balázs Papp. Co-regulation of metabolic genes is better explained by flux coupling than by network distance. PLoS computational biology, 4(1):e26, 2008.

188 bibliography

J Nuno, I Sánchez-Valdenebro, C Pérez-Iratxeta, E Meléndez-Hevia, and F Montero. Network organization of cell metabolism: monosaccharide interconversion. Biochem. J, 324:103–111, 1997. URL http://www. biochemj.org/bj/324/bj3240103.htm.

Jason A Papin, Nathan D Price, Jeremy S Edwards, and BERNHARD Ø PALSSON. The genome-scale metabolic extreme pathway structure in haemophilus influenzae shows significant network redundancy. Journal of Theoretical Biology, 215(1):67–82, 2002.

Jason A Papin, Jennifer L Reed, and Bernhard O Palsson. Hierarchical thinking in network biology: the unbiased modularization of biochemical networks. Trends in biochemical sciences, 29(12):641–647, 2004a. URL http://www.sciencedirect.com/science/article/pii/S0968000404002610.

Jason A Papin, Joerg Stelling, Nathan D Price, Steffen Klamt, Stefan Schuster, and Bernhard O Palsson. Comparison of network-based pathway analysis methods. Trends in biotechnology, 22(8):400–405, 2004b.

Laura Vanda Papp, Jun Lu, Arne Holmgren, and Kum Kum Khanna. From selenium to selenoproteins: synthesis, identity, and their role in human health. Antioxidants & redox signaling, 9(7):775–806, 2007.

Jose B Pereira-Leal, Anton J Enright, and Christos A Ouzounis. Detection of functional modules from protein interaction networks. PROTEINS: Structure, Function, and Bioinformatics, 54(1):49–57, 2004. URL http://onlinelibrary.wiley.com/doi/10.1002/prot.10505/full.

Arie Perliger and Ami Pedahzur. Social network analysis in the study of terrorism and political violence. PS: Political Science & Politics, 44(01):45–50, 2011. URL http://journals.cambridge.org/abstract_ S1049096510001848.

Michel Perrot, Francis Sagliocco, Thierry Mini, Christelle Monribot, Ulrich Schneider, Andrej Shevchenko, Mathias Mann, Paul Jenö, and Hélian Boucherie. Two-dimensional gel protein database of saccharomyces cerevisiae (update 1999). Electrophoresis, 20(11):2280–2298, 1999.

A. Pestana and H. C. Pitot. Acetylation of nascent polypeptide chains on rat liver polyribosomes in vivo and in vitro. Biochemistry, 14(7):1404–1412, Apr 1975.

B. Polevoda, J. Norbeck, H. Takakura, A. Blomberg, and F. Sherman. Identification and specificities of N- terminal acetyltransferases from saccharomyces cerevisiae. EMBO J, 18(21):6155–6168, Nov 1999. doi: 10.1093/emboj/18.21.6155. URL http://dx.doi.org/10.1093/emboj/18.21.6155.

Bogdan Polevoda and Fred Sherman. N-terminal acetyltransferases and sequence requirements for N- terminal acetylation of eukaryotic proteins. J Mol Biol, 325(4):595–622, Jan 2003.

Bogdan Polevoda, Steven Brown, Thomas S. Cardillo, Sean Rigby, and Fred Sherman. Yeast n(alpha)-terminal acetyltransferases are associated with ribosomes. J Cell Biochem, 103(2):492–508, Feb 2008. doi: 10.1002/ jcb.21418. URL http://dx.doi.org/10.1002/jcb.21418.

Bogdan Polevoda, Thomas Arnesen, and Fred Sherman. A synopsis of eukaryotic nalpha-terminal acetyl- transferases: nomenclature, subunits and substrates. BMC Proc, 3 Suppl 6:S2, 2009. doi: 10.1186/ 1753-6561-3-S6-S2. URL http://dx.doi.org/10.1186/1753-6561-3-S6-S2.

Alex Pothen, Horst D Simon, and Kang-Pu Liou. Partitioning sparse matrices with eigenvectors of graphs. SIAM journal on matrix analysis and applications, 11(3):430–452, 1990.

Nathan D Price, Jason A Papin, and Bernhard Ø Palsson. Determination of redundancy and systems proper- ties of the metabolic network of helicobacter pylori using genome-scale extreme pathway analysis. Genome research, 12(5):760–769, 2002. URL http://genome.cshlp.org/content/12/5/760.short.

Nathan D Price, Jennifer L Reed, Jason A Papin, Sharon J Wiback, and Bernhard O Palsson. Network-based analysis of metabolic regulation in the human red blood cell. Journal of Theoretical Biology, 225(2):185–194, 2003.

Nathan D Price, Jan Schellenberger, and Bernhard O Palsson. Uniform sampling of steady-state flux spaces: means to design experiments and to interpret enzymopathies. Biophysical journal, 87(4):2172–2186, 2004.

Yanjun Qi, Fernanda Balem, Christos Faloutsos, Judith Klein-Seetharaman, and Ziv Bar-Joseph. Protein complex identification by supervised graph local clustering. Bioinformatics, 24(13):i250–i268, 2008. URL http://bioinformatics.oxfordjournals.org/content/24/13/i250.short.

Ning Qian and Terrence J Sejnowski. Predicting the secondary structure of globular proteins using neural network models. Journal of molecular biology, 202(4):865–884, 1988.

J. R. Quinlan. Induction of decision trees. Mach. Learn., 1(1):81–106, March 1986. ISSN 0885-6125. doi: 10.1023/A:1022643204877. URL http://dx.doi.org/10.1023/A:1022643204877.

J. Ross Quinlan. Simplifying decision trees. International journal of man-machine studies, 27(3):221–234, 1987.

189 bibliography

J. Ross Quinlan. C4.5: Programs for Machine Learning (Morgan Kaufmann Series in Machine Learning). Mor- gan Kaufmann, 1st edition, October 1992. ISBN 9781558602380. URL http://www.worldcat.org/isbn/ 1558602380.

Predrag Radivojac, Vladimir Vacic, Chad Haynes, Ross R. Cocklin, Amrita Mohan, Joshua W. Heyen, Mark G. Goebl, and Lilia M. Iakoucheva. Identification, analysis, and prediction of protein ubiquitination sites. Proteins, 78(2):365–380, Feb 2010. doi: 10.1002/prot.22555. URL http://dx.doi.org/10.1002/prot. 22555.

Anna Radzicka and Richard Wolfenden. A proficient enzyme. Science, 267(5194):90–93, 1995.

Erzsébet Ravasz, Anna Lisa Somera, Dale A Mongru, Zoltán N Oltvai, and A-L Barabási. Hierarchical organization of modularity in metabolic networks. science, 297(5586):1551–1555, 2002.

Emma Redhead and Timothy L Bailey. Discriminative motif discovery in dna and protein sequences using the deme algorithm. BMC bioinformatics, 8(1):385, 2007.

Steve Ressler. Social network analysis as an approach to combat terrorism: past, present, and future research. Homeland Security Affairs, 2(2):1–10, 2006.

Stefan Roepcke, Degui Zhi, , and Peter F Arndt. Identification of highly specific localized sequence motifs in human ribosomal protein gene promoters. Gene, 365:48–56, 2006.

David J Rogers and Taffee T Tanimoto. A computer program for classifying plants. Science, 132(3434): 1115–1118, 1960.

Michael Rother and Joseph A Krzycki. Selenocysteine, pyrrolysine, and the unique energy metabolism of methanogenic archaea. Archaea, 2010, 2010.

Stuart J. Russell and Peter Norvig. Artificial Intelligence - A Modern Approach. Pearson Education, 3rd edition, 2010. ISBN 978-0-13-207148-2.

Gerard Salton. Automatic text processing: The transformation, analysis, and retrieval of. Reading: Addison- Wesley, 1989.

Manoj Pratim Samanta and Shoudan Liang. Predicting protein functions from redundancies in large-scale protein interaction networks. Proceedings of the National Academy of Sciences, 100(22):12579–12583, 2003. URL http://www.pnas.org/content/100/22/12579.short.

Jan Schellenberger, Junyoung O Park, Tom M Conrad, and Bernhard Ø Palsson. Bigg: a biochemical genetic and genomic knowledgebase of large scale metabolic reconstructions. BMC bioinformatics, 11(1):213, 2010.

Christophe H Schilling, David Letscher, and Bernhard Ø Palsson. Theory for the systemic definition of metabolic pathways and their use in interpreting metabolic function from a pathway-oriented perspective. Journal of theoretical biology, 203(3):229–248, 2000.

Ida Schomburg, Antje Chang, Christian Ebeling, Marion Gremse, Christian Heldt, Gregor Huhn, and Di- etmar Schomburg. Brenda, the enzyme database: updates and major new developments. Nucleic acids research, 32(suppl 1):D431–D433, 2004.

Ida Schomburg, Antje Chang, Sandra Placzek, Carola Söhngen, Michael Rother, Maren Lang, Cornelia Munaretto, Susanne Ulas, Michael Stelzer, Andreas Grote, et al. Brenda in 2013: integrated reactions, kinetic data, enzyme function data, improved disease classification: new options and contents in brenda. Nucleic acids research, page gks1049, 2012.

S Schuster, C Hilgetag, and R Schuster. Determining elementary modes of functioning in biochemical reaction networks at steady state. In Biomedical and Life Physics, pages 101–114. Springer, 1996.

Stefan Schuster, Claus Hilgetag, John H Woods, and David A Fell. Reaction routes in biochemical reaction systems: algebraic properties, validated calculation procedure and example from nucleotide metabolism. Journal of mathematical biology, 45(2):153–181, 2002a. URL http://link.springer.com/article/10.1007/ s002850200143.

Stefan Schuster, Thomas Pfeiffer, Ferdinand Moldenhauer, Ina Koch, and Thomas Dandekar. Exploring the pathway structure of metabolism: decomposition into subnetworks and application to mycoplasma pneumoniae. Bioinformatics, 18(2):351–361, 2002b. URL http://bioinformatics.oxfordjournals.org/ content/18/2/351.short.

Daniel Schwartz, Michael F. Chou, and George M. Church. Predicting protein post-translational modifica- tions using meta-analysis of proteome scale data sets. Mol Cell Proteomics, 8(2):365–379, Feb 2009. doi: 10.1074/mcp.M800332-MCP200. URL http://dx.doi.org/10.1074/mcp.M800332-MCP200.

Roded Sharan, Igor Ulitsky, and . Network-based prediction of protein function. Molecular systems biology, 3(1), 2007.

190 bibliography

Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888–905, 2000.

Christian J A. Sigrist, Edouard de Castro, Lorenzo Cerutti, Béatrice A. Cuche, Nicolas Hulo, Alan Bridge, Lydie Bougueleret, and Ioannis Xenarios. New and continuing developments at prosite. Nucleic Acids Res, 41(Database issue):D344–D347, Jan 2013. doi: 10.1093/nar/gks1067. URL http://dx.doi.org/10.1093/ nar/gks1067.

Malcolm K Sparrow. The application of network analysis to criminal intelligence: An assessment of the prospects. Social networks, 13(3):251–274, 1991. URL http://www.sciencedirect.com/science/article/ pii/037887339190008H.

Kristian K. Starheim, Thomas Arnesen, Darina Gromyko, Anita Ryningen, Jan Erik Varhaug, and Johan R. Lillehaug. Identification of the human n(alpha)-acetyltransferase complex b (hnatb): a complex important for cell-cycle progression. Biochem J, 415(2):325–331, Oct 2008. doi: 10.1042/BJ20080658. URL http: //dx.doi.org/10.1042/BJ20080658.

Kristian K. Starheim, Darina Gromyko, Rune Evjenth, Anita Ryningen, Jan Erik Varhaug, Johan R. Lillehaug, and Thomas Arnesen. Knockdown of human n alpha-terminal acetyltransferase complex c leads to p53- dependent apoptosis and aberrant human arl8b localization. Mol Cell Biol, 29(13):3569–3581, Jul 2009. doi: 10.1128/MCB.01909-08. URL http://dx.doi.org/10.1128/MCB.01909-08.

Stephen M Stigler. Francis galton’s account of the invention of correlation. Statistical Science, pages 73–79, 1989.

G. D. Stormo. Computer methods for analyzing sequence recognition of nucleic acids. Annu Rev Biophys Biophys Chem, 17:241–263, 1988. doi: 10.1146/annurev.bb.17.060188.001325. URL http://dx.doi.org/10. 1146/annurev.bb.17.060188.001325.

Steven H Strogatz. Exploring complex networks. Nature, 410(6825):268–276, 2001.

GJAM Strous, AJM Berns, and H Bloemendal. N-terminal acetylation of the nascent chains of α-crystallin. Biochemical and biophysical research communications, 58(3):876–884, 1974.

Age Tats, Maido Remm, and Tanel Tenson. Highly expressed proteins have an increased frequency of alanine in the second amino acid position. BMC genomics, 7(1):28, 2006.

Sarah A Teichmann and M Madan Babu. Gene regulatory network growth by duplication. Nature genetics, 36(5):492–496, 2004.

Marco Terzer. Large scale methods to enumerate extreme rays and elementary modes. PhD thesis, Diss., Eidgenös- sische Technische Hochschule ETH Zürich, Nr. 18538, 2009, 2009.

Marco Terzer and Jörg Stelling. Accelerating the computation of elementary modes using pattern trees. In Algorithms in Bioinformatics, pages 333–343. Springer, 2006. URL http://link.springer.com/chapter/ 10.1007/11851561_31.

Marco Terzer and Jörg Stelling. Large-scale computation of elementary flux modes with bit pattern trees. Bioinformatics, 24(19):2229–2235, 2008. URL http://bioinformatics.oxfordjournals.org/content/24/ 19/2229.short.

Marco Terzer, Nathaniel D Maynard, Markus W Covert, and Jörg Stelling. Genome-scale metabolic networks. Wiley Interdisciplinary Reviews: Systems Biology and Medicine, 1(3):285–297, 2009.

Gert Thijs, Kathleen Marchal, Magali Lescot, Stephane Rombauts, Bart De Moor, Pierre Rouzé, and . A gibbs sampling method to detect overrepresented motifs in the upstream regions of coex- pressed genes. Journal of Computational Biology, 9(2):447–464, 2002.

Tina L Tootle and Ilaria Rebay. Post-translational modifications influence transcription factor activity: A view from the ets superfamily. Bioessays, 27(3):285–298, 2005.

Robert Urbanczik and C Wagner. Functional stoichiometric analysis of metabolic networks. Bioinformat- ics, 21(22):4176–4180, 2005a. URL http://bioinformatics.oxfordjournals.org/content/21/22/4176. short.

Robert Urbanczik and Carl Wagner. An improved algorithm for stoichiometric network analysis: theory and applications. Bioinformatics, 21(7):1203–1210, 2005b. URL http://bioinformatics.oxfordjournals. org/content/21/7/1203.short.

Petra Van Damme, Kristine Hole, Ana Pimenta-Marques, Kenny Helsens, Joël Vandekerckhove, Rui G Mart- inho, Kris Gevaert, and Thomas Arnesen. Natf contributes to an evolutionary shift in protein n-terminal acetylation and is important for normal chromosome segregation. PLoS genetics, 7(7):e1002169, 2011.

191 bibliography

Christian van Delden, Rachel Comte, and Marc Bally. Stringent response activates quorum sensing and modulates cell density-dependent gene expression inpseudomonas aeruginosa. Journal of bacteriology, 183 (18):5376–5384, 2001.

Richard van Wijk and Wouter W van Solinge. The energy-less red blood cell is lost: erythrocyte enzyme abnormalities of glycolysis. Blood, 106(13):4034–4042, 2005.

Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.

C Wagner. Nullspace approach to determine the elementary modes of chemical reaction systems. The Journal of Physical Chemistry B, 108(7):2425–2431, 2004. URL http://pubs.acs.org/doi/abs/10.1021/ jp034523f.

Clemens Wagner and Robert Urbanczik. The geometry of the flux cone of a metabolic network. Biophysical journal, 89(6):3837–3845, 2005.

David L Wallace. Comment. Journal of the American Statistical Association, 78(383):569–576, 1983. URL http: //amstat.tandfonline.com/doi/pdf/10.1080/01621459.1983.10478009.

Christopher Walsh. Posttranslational Modification Of Proteins: Expanding Nature’s Inventory. Roberts and Com- pany Publishers, 2006. ISBN 9780974707730.

MMC Wamelink, EA Struys, and C Jakobs. The biochemistry, metabolism and inherited defects of the pentose phosphate pathway: a review. Journal of inherited metabolic disease, 31(6):703–717, 2008.

Joe H Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301):236–244, 1963.

Sharon J Wiback and Bernhard O Palsson. Extreme pathway analysis of human red blood cell metabolism. Biophysical Journal, 83(2):808–818, 2002.

Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics bulletin, pages 80–83, 1945.

David F Williamson, Robert A Parker, and Juliette S Kendrick. The box plot: a simple visual method to interpret data. Annals of internal medicine, 110(11):916–921, 1989.

T Wood. Physiological functions of the pentose phosphate pathway. Cell biochemistry and function, 4(4): 241–247, 1986.

Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J McLach- lan, Angus Ng, Bing Liu, S Yu Philip, et al. Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1):1–37, 2008.

Yanping Xi, Yi-Ping Phoebe Chen, Chen Qian, and Fei Wang. Comparative study of computational methods to detect the correlated reaction sets in biochemical networks. Briefings in bioinformatics, 12(2):132–150, 2011. URL http://bib.oxfordjournals.org/content/12/2/132.short.

Qing Xiao, Feiran Zhang, Benjamin A. Nacev, Jun O. Liu, and Dehua Pei. Protein n-terminal processing: substrate specificity of escherichia coli and human methionine aminopeptidases. Biochemistry, 49(26): 5588–5599, Jul 2010. doi: 10.1021/bi1005464. URL http://dx.doi.org/10.1021/bi1005464.

Jennifer Xu and Hsinchun Chen. Untangling criminal networks: A case study. In Intelligence and Secu- rity Informatics, pages 232–248. Springer, 2003. URL http://link.springer.com/chapter/10.1007/ 3-540-44853-5_18.

Rui Yan, Paul C Boutros, and Igor Jurisica. A tree-based approach for motif discovery and sequence classifi- cation. Bioinformatics, 27(15):2054–2061, 2011.

Matthew Yeung, Ines Thiele, and Bernard Ø Palsson. Estimation of the number of extreme pathways for metabolic networks. BMC bioinformatics, 8(1):363, 2007.

Ian T Young. Proof without prejudice: use of the kolmogorov-smirnov test for the analysis of histograms from flow systems and other sources. Journal of Histochemistry & Cytochemistry, 25(7):935–941, 1977.

Haiyuan Yu, Alberto Paccanaro, Valery Trifonov, and Mark Gerstein. Predicting interactions in protein net- works by completing defective cliques. Bioinformatics, 22(7):823–829, 2006. URL http://bioinformatics. oxfordjournals.org/content/22/7/823.short.

Charles R Zerez, Mitchell D Wong, Neil A Lachant, and Kouichi R Tanaka. Impaired erythrocyte phos- phoribosylpyrophosphate formation in hemolytic anemia due to pyruvate kinase deficiency. Blood, 72(2): 500–506, 1988.

Ning Zhang, Bi-Qing Li, Shan Gao, Ji-Shou Ruan, and Yu-Dong Cai. Computational prediction and analysis of protein θs-carboxylationsites based on a random forest method. Mol Biosyst, 8(11):2946–2955, Nov 2012. doi: 10.1039/c2mb25185j. URL http://dx.doi.org/10.1039/c2mb25185j.

Yan Zhang, Pavel V Baranov, John F Atkins, and Vadim N Gladyshev. Pyrrolysine and selenocysteine use dissimilar decoding strategies. Journal of Biological Chemistry, 2005.

192 CURRICULUMVITAE

personal information

Name: Christophe CHARPILLOZ

Address: Ch. des Tulipiers, 18 1208 Genève Switzerland

Phones: +41 76 615 51 10 (mobile) +41 22 379 02 03 (work) E-mails: [email protected] (home) [email protected] (work)

Nationality: Swiss Date of birth: April 12th, 1980 Languages: French (mother tongue, L1), English (C1), German (A2)

education

2015 PhD in Computer science, University of Geneva, Geneva, CH 2007 MSc in Computer science, University of Geneva, Geneva, CH 2000 High School Diploma, Collège de Candolle, Geneva, CH

employment

07.15 – Now Scientific programmer, Federal office of meteorology and climatology, Zurich, CH 09.14 – 05.15 Research and development software engineer, Equinoxe MIS Development, Lausanne, CH 10.09 – 10.13 Teaching assistant, University of Geneva, Geneva, CH 04.08 – 04.10 Specialisation internship in computational biology, Univer- sity of Geneva, Geneva, CH 03.06 – 05.06 Bioinformatician (fixed-term contract), Serono Pharmaceu- tical Research Institute, Geneva, CH 07.05 – 10.05 Bioinformatician (internship), Serono Pharmaceutical Re- search Institute, Geneva, CH

193 teaching experience

Autumn 12,13 Introduction to computer science for biologists and spring 12 Spring 12 Parallel algorithms for computer scientists

Spring 11,12 Modeling and simulation of natural phenomena, for physicists, biologists and computer scientists

Spring 11 Probabilist algorithms for computer scientists

Autumn 09,12 Parallel computing for computer scientists

Spring 09 Introduction to software and network components for computer scientists technical skills

Scala, Java (SE), Matlab, R, Python, Fortran, C++, Ruby, SQL, PL/SQL. scientific contribution

Christophe Charpilloz, Anne-Lise Veuthey, Bastien Chopard and Jean- Luc Falcone. Motifs tree: a new method for predicting post-translational modifications. Bioinformatics 2014 30: 1974-1982. software contribution

An online service based on the motif trees that allows users to submit their sequences to predict initiator methionine cleavage and Nα-terminal acetylation. The service has been developed with Jean-Luc Falcone and is available at http://terminus.unige.ch/. award

Ardity Award in 2007 for the best master thesis in computer science.

194