Supertree Methods for Phylogenomics Celine Scornavacca

Supertree methods for phylogenomics Celine Scornavacca To cite this version: Celine Scornavacca. Supertree methods for phylogenomics. Bioinformatics [q-bio.QM]. Université Montpellier II - Sciences et Techniques du Languedoc, 2009. English. tel-00842893 HAL Id: tel-00842893 https://tel.archives-ouvertes.fr/tel-00842893 Submitted on 9 Jul 2013 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. UNIVERSITY OF MONTPELLIER II DOCTORAL SCHOOL I2S INFORMATION STRUCTURES SYSTEMES PHD THESIS to obtain the title of PhD of Science of the University of Montpellier II Specialty : Computer Science Defended by Celine scornavacca Supertree methods for phylogenomics Méthodes de superarbres pour la phylogénomique defended on December 8th, 2009 Jury : Olivier Gascuel Directeur de recherche, CNRS Advisor Vincent Berry Professeur, UM2 Co-advisor Vincent Ranwez Maître de conférences, UM2 Co-advisor Marie-France Sagot Directeur de recherche, INRIA Referee Daniel Huson Professeur, Tübingen University Referee Simonetta Gribaldo Chargé de recherche, Institut Pasteur Examinator Acknowledgments All people say that acknowledgments are the last thing to write. So do I, but since my body is falling into pieces after having endured all these months of hard work, I hope to not forget someone. First of all, I would like to thank Marie-France Sagot, Daniel Huson and Simon- etta Gribaldo for agreeing to be members of my thesis committee. I would like to thank Olivier Gascuel for finding the time to read my thesis, although his time table is full until 2020, and to have improved it with his helpful comments. What to say about VB & VR, i.e., Vincent Berry and Vincent Ranwez? It is difficult for me to imagine two better co-advisors. They accompanied me in the world of research, always encouraging me, inspiring me and leaving me free to make my choices. They are two excellent scientists and two nice and funny guys. A huge thank-you to my two-headed ogre (a citation for Warcraft fans only). I wish to thank all the MAB team at the LIRMM for tolerating the high level of my voice in the hallways and for the interesting discussions. A sincere thank-you to the members of the PPP team at the ISEM for show- ing me that evolution does not consist only in reconstructing supertrees, especially Emmanuel Douzery and Frédéric Delsuc. I would like to thank the members of the LBBE team at the University of Lyon I, for welcoming me in France and for making the effort to understand the very bad French that I spoke at that time. A particular thank-you to Marie-France Sagot and Eric Tannier who have believed in me and have given me the possibility to start working in bioinformatics, to Vincent Daubin for his precious suggestions and to Sophie Abby and Simon Penel for their friendship. Thanks also to all those people who have made my life in Montpellier pleasant. It is impossible to list them all, but a particular thank-you go to Rasta-Amine, for bringing fun and love in my life, Georgia Tsagkogeorga, for being a fantastic flat- mate and a really good friend, my moral support during this thesis. And Andrew Rodrigues, a true friend and a faithful climbing parter. I think that all people around me have to thank him to let me take it out on climbing walls rather than on people in the last part of the thesis. Thank you to my friends that are far away, in Italy and all over the world. The Italian saying “Lontano dagli occhi, lontano dal cuore” does not work for me are they are still in my heart, especially Dada, Giacomone, Circy and Claudia. A particular thank-you to Fabio Pardi for his help with the Shakespeare language (yes, some Italians can speak a good English!), to Juan Escobar and Samuel Blan- quart for answering to all my silly questions about evolution and to Julien Dutheil, who helped me a lot with the Bio++ libraries. Thank also to Professor Gianpaolo Oriolo for introducing me to Eric Tannier and for giving me the possibility to do my Master stage at the University of Lyon I. iv Finally I want to thank my family for their support. They taught me to love sciences and that hard work always pays. I really love you. This thesis is dedicated to the memory of Roberta Dal Passo, a inspiring math- ematician and a true and frank person. Contents Introduction 1 1Inferringphylogenies 5 1.1 From Aristotle to Darwin: an introduction ............... 5 1.2 Different types of biological data .................... 7 1.3 Parsimony methods ............................ 9 1.4 Models of sequence evolution ...................... 11 1.4.1 Nucleotide models ........................ 11 1.4.2 Protein models .......................... 14 1.4.3 Codon models ........................... 14 1.5 Distance-based methods ......................... 14 1.5.1 Estimation of evolutionary distances .............. 15 1.5.2 Least-squares methods ...................... 16 1.5.3 Minimum-evolution methods .................. 17 1.6 Likelihood methods ............................ 18 1.7 Bayesian methods ............................. 19 1.8 Testing the reliability of inferred phylogenies .............. 21 2Deeperinsightsonmultipledatasetanalysis 23 2.1 Model inadequacy ............................ 24 2.1.1 Compositional bias ........................ 24 2.1.2 Heterotachy ............................ 24 2.1.3 Rapidly evolving lineages .................... 25 2.2 Macro events ............................... 25 2.2.1 Gene duplications and losses .................. 25 2.2.2 Horizontal gene transfers (HGT) ................ 27 2.2.3 Incomplete lineage sorting .................... 28 2.2.4 Interspecific recombination ................... 28 2.2.5 Interspecific hybridization .................... 29 2.3 Combining data .............................. 29 2.3.1 Combining data through a supermatrix approach ....... 30 2.3.2 Combining data through a supertree approach ......... 32 2.3.3 The eternal dilemma: supermatrix or supertree? ....... 34 3Methodsforcombiningtrees 37 3.1 Basic concepts .............................. 39 3.1.1 Splits and clusters ........................ 40 3.1.2 Quartets and triplets ....................... 41 3.1.3 Interpretations of polytomies .................. 43 3.2 Consensus methods for phylogenetic trees ............... 44 vi Contents 3.2.1 Consensus methods defined for both rooted and unrooted forests 44 3.2.2 Consensus methods defined only for rooted forests ...... 50 3.3 Supertree methods ............................ 56 3.3.1 The OneTree supertree method and its variants ........ 57 3.3.2 Matrix Representation-based methods ............. 68 3.3.3 Median supertrees ........................ 77 3.3.4 Other approaches to the supertree problem .......... 79 3.4 Which method to choose? ........................ 82 4Supertreemethodsfromnewprinciples 85 4.1 The PI and PC properties ........................ 86 4.2 PhySIC .................................. 90 4.2.1 The PhySICPC algorithm .................... 91 4.2.2 The PhySICPI algorithm .................... 93 4.2.3 The PhySIC algorithm ...................... 94 4.3 PhySIC_IST ............................... 95 4.3.1 The CIC criterion ........................ 97 4.3.2 The PhySIC_IST algorithm ................... 99 4.3.3 Rooting the source trees ..................... 106 4.3.4 The PhySIC_IST validation ................... 106 4.4 Combining supermatrix and supertree in Triticeea .......... 119 4.4.1 Triticeea: a problematic group ................. 119 4.4.2 Materials and Methods ...................... 121 4.4.3 Results .............................. 124 4.4.4 Discussion ............................. 130 4.5 Conclusions ................................ 135 5Methodstoincludemulti-labeledphylogeniesinasupertreeframe- work 137 5.1 Basic concepts and preliminary results ................. 139 5.1.1 Basic concepts .......................... 139 5.1.2 Identifying observed duplication nodes in linear time ..... 140 5.2 Methods .................................. 141 5.2.1 Isomorphic subtrees ....................... 141 5.2.2 Auto-coherency of a MUL tree ................. 144 5.2.3 Computing a largest duplication-free subtree of a MUL tree . 150 5.2.4 Compatibility of single-labeled subtrees obtained from MUL trees ................................ 152 5.3 Experiments ................................ 154 5.3.1 Enlarging the amount of gene families to be used for species tree building ........................... 155 5.3.2 Running times .......................... 157 5.3.3 Improvement in supertree inference ............... 157 5.4 Conclusions ................................ 161 Contents vii 6 Conclusions and further research 163 7Résuméenfrançais 167 A Appendix to Chapter 4 179 A.1 Outline of main PhySIC subroutines .................. 179 A.2 Outline of main PhySIC_IST subroutines ............... 181 A.3 Supplementary materials of Section 4.4 ................. 185 Bibliography 195 Introduction It was three years and a half ago that I arrived in France. At the time, my work was focused on algorithms for Wi-Fi LAN. Yes, it was interesting but I needed to work on something more warm than computers. This is

Supertree Methods for Phylogenomics Celine Scornavacca

Phylogenetic Comparative Methods: a User's Guide for Paleontologists

Collecting Reliable Clades Using the Greedy Strict Consensus Merger

Supertree Construction: Opportunities and Challenges

Stylophoran Supertrees Revisited

Phylogenetic Supertree Reveals Detailed Evolution of SARS-Cov-2

Chapter Some Desiderata for Liberal Supertrees

Computational Methods for Phylogenetic Analysis

NEW USES for OLD PHYLOGENIES an Introduction to the Volume

Using Phylogenetic Comparative Methods to Understand Diversification and Geographic Range Evolution

Phylogenetics Theory

A Supertree of Temnospondyli: Cladogenetic Patterns in the Most Species-Rich Group of Early Tetrapods Marcello Ruta1,*, Davide Pisani2, Graeme T

Supertree Analyses of the Roles of Viviparity and Habitat in the Evolution of Atherinomorph ﬁshes