Analysis of Early Evolutionary Events During the Transition from Prokaryotes to Eukaryotes
Total Page:16
File Type:pdf, Size:1020Kb
Analysis of early evolutionary events during the transition from prokaryotes to eukaryotes INAUGURAL DISSERTATION For the attainment of the title of Doctor rerum naturalium (Dr. rer. nat.) in the Faculty of Mathematics and Natural Sciences at the Heinrich Heine University Düsseldorf presented by Julia Brückner, M.Sc. from Oldenburg (Oldb), Germany Düsseldorf, March 2021 From the Institute of Molecular Evolution at the Heinrich Heine University Düsseldorf Published by permission of the Faculty of Mathematics and Natural Sciences at the Heinrich Heine University Düsseldorf Supervisor: Prof Dr. William F. Martin Co-supervisor: Prof. Dr. Martin J. Lercher Date of oral examination: June 17th, 2021 Statement of declaration I hereby certify that this dissertation is the result of my own work. I confirm that no other person’s work has been used without due acknowledgement. This work was done wholly while in the candidature for a research degree at the Heinrich Heine University Düsseldorf and has not been submitted in the same or similar form to other institutions. I have not previously failed a doctoral examination procedure. Düsseldorf, August 10th, 2021 Julia Brückner An meine Familie und Freunde. Danke, dass ihr mich durchgehend unterstützt und mir die wichtigen Dinge im Leben deutlich macht. Acknowledgements I never thought I would come this far in research. I didn’t even think of studying after school but only wanted to do my apprenticeship, get a good job as a biological laboratory assistant and that’s it. But as we all know: ‘Life happens to you while you’re busy making other plans’ (John Lennon). Almost all my experiences in my studies at the Heinrich Heine University were lined with lucky events. One of them being the attendance of the genome analysis Master course by Dr. Mayo Röttger from Prof. Dr. William F. Martin’s institute. I was lucky enough to get one of the last waiting list places of this course and found that I liked bioinformatics and the area of studying early evolution. And lucky as I was, there were enough free spaces in the institute at the time and I was able to do my Master thesis, which led to the PhD. Therefore, I want to especially thank Prof. Martin for giving me the opportunity to work in his group and always encouraging me and helping me out. Without him I would probably have quit early on. Additionally, I want to thank Prof. Lercher for his role as co-supervisor. Mayo was responsible for me finding joy in bioinformatics and without Nils and Michael in our office, I would not have felt as welcomed as I did. The banter and scientific discussions and especially the help with programming that I got from those two (mostly Michael, Nils was the emotional support) helped me out a lot in continuing to work hard. Thank you, Verena and Renate, for helping with questions regarding the bureaucratic side of the PhD process and being there for any and all important matters. I also want to thank all other members of the team and students that I got to know during my four years at the institute, for creating a welcoming and sociable work environment. Last but most importantly, I want to thank my mother, my brother and my best friends. You always support me, sometimes question me and my decisions (so I can make better ones), and most of all are always there if I need to talk. I feel special gratitude towards you for always believing in me and for believing that I could do it. Guess you were right. I really can do it after all. And I will never forget. Publications submitted in the course of this thesis: 1. Brueckner J, Martin WF. 2020. Bacterial genes outnumber archaeal genes in eukaryotic genomes. Genome Biol Evol; 12: 282–292. https://doi.org/10.1093/gbe/evaa047 2. Nagies FSP*, Brueckner J*, Tria FDK, Martin WF. 2020. A spectrum of verticality across genes. PLoS Genet; 16: e1009200. https://doi.org/10.1371/journal.pgen.1009200 3. Tria FDK*, Brueckner J*, Skejo J, Xavier JC, Kapust N, Knopp M, Wimmer JLE, Nagies FSP, Zimorski V, Gould SB, Garg SG, Martin WF. 2021. Gene duplications trace mitochondria to the onset of eukaryote complexity. Genome Biol Evol; 13: evab055. https://doi.org/10.1093/gbe/evab055 4. Xavier JC*, Gerhards R*, Wimmer JLE, Brueckner J, Tria FDK, Martin WF. 2021. The metabolic network of the last bacterial common ancestor. Commun Biol; 4: 413. https://doi.org/10.1038/s42003-021-01918-4 * These authors contributed equally to the publication. Abstract There are two forms of living cells on earth — prokaryotes and eukaryotes. Prokaryotic cells are very simple organisms while eukaryotes present more complex cells that can organize into multicellular forms. The most widely accepted theory of eukaryote evolution is the endosymbiotic theory, depicting eukaryotes as descendants of archaea through the acquisition of a bacterial endosymbiont into its archaeal host. There has been more than one endosymbiotic event during eukaryote evolution with two major occurrences. The first gave rise to the mitochondrion from an archaeon incorporating a proteobacterium, generating the first eukaryote; the second resulted in the origin of the plant kingdom as a cyanobacterium was enveloped by an existing eukaryotic host. However, the mechanisms of how these events occurred are still mostly unknown and are the basis of many debates to date. Phylogenomic analyses have become the standard tool to investigate these early evolutionary events. Many studies focus on only a few genomes to represent a diverse spectrum of organisms from the three domains of life, some only examine a small number of genes. To obtain a more comprehensive view on how eukaryotes arose, the investigation of a broad spectrum of genomic data from a varied sample of lineages is desirable. Moreover, genome sequencing has become faster and simpler each year, enabling analyses containing a substantial number of organisms and genes. However, this is accompanied by extensively increased computational demands which can become a limiting factor that has to be addressed. This thesis reports the analysis of all protein coding genes from 5,655 prokaryotic and 150 eukaryotic genomes, the construction of their corresponding protein families, their alignments and phylogenetic trees in order to analyze evolutionary events at the border of prokaryotic and eukaryotic life. I Zusammenfassung Es gibt zwei unterschiedliche Formen von Leben auf der Erde — Prokaryoten und Eukaryoten. Prokaryotische Zellen sind sehr einfache Organismen während Eukaryoten komplexe Zellen darstellen, die sich in multizelluläre Formen organisieren können. Die am weitesten verbreitete Hypothese der Evolution von Eukaryoten ist die Endosymbiontentheorie: die Aufnahme eines bakteriellen Endosymbionten in seinen archaeellen Wirt. Im Laufe der Evolution von Eukaryoten gab es mehrere endosymbiontische Ereignisse — als erstes entstand das Mitochondrion durch Aufnahme eines Proteobakteriums in ein Archaeon und später ist das Pflanzenreich durch den Einschluss eines Cyanobakteriums in einen bestehenden Eukaryoten entstanden. Allerdings sind die Mechanismen, welche sich hinter diesen Vorgängen befinden, immer noch unzureichend untersucht und die Basis fortlaufender Debatten. Mittlerweile sind phylogenomische Analysen eine der Standardmethoden, um diese frühen Evolutionsvorgänge zu erforschen. Viele Untersuchungen verwenden nur wenige Genome, um die drei Domänen des Lebens zu repräsentieren, oder fokussieren sich auf eine geringe Anzahl an Genen. Um jedoch einen umfassenderen Überblick von der Entwicklung der Eukaryoten zu bekommen, sollte ein breites Spektrum von Sequenzdaten aus Genomen aller bekannten taxonomischen Gruppen herangezogen werden. Da die Sequenzierung von Genomen jedes Jahr einfacher und schneller wird, können nun auch Untersuchungen mit einer umfangreichen Anzahl an Organismen und Genen durchgeführt werden. Allerdings ist damit eine gesteigerte Nutzung von Rechenkapazitäten verbunden, welches schnell ein limitierender Faktor wird, der beachtet werden muss. Diese Arbeit stellt die Analyse aller proteinkodierenden Gene aus 5,655 prokaryotischen und 150 eukaryotischen Genomen dar. Die daraus rekonstruierten Proteinfamilien und phylogenetische Stammbäume wurden zur Untersuchung von evolutionären Ereignissen an der Grenze von prokaryotischem und eukaryotischem Leben verwendet. II Table of contents Abstract ..................................................................................................................................... I Zusammenfassung ................................................................................................................ II 1 Introduction ........................................................................................................................ 1 1.1 Protein sequences, gene sequences and genome sequences for phylogenetics ....... 4 1.2 Large-scale protein family reconstruction ................................................................................ 6 1.2.1 The importance of detecting orthologous proteins ................................................... 6 1.2.2 Influence of stringency on protein family properties ............................................ 10 1.2.3 Clustering of large protein sequence datasets .......................................................... 12 1.2.4 Applications for protein clusters ..................................................................................... 16 1.2.4.1 Available databases of protein clusters .............................................................. 16 1.2.4.2 Eukaryote-prokaryote