<<

Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1176

Inferring Ancestry

Mitochondrial Origins and Other Deep Branches in the Tree of

DING HE

ACTA UNIVERSITATIS UPSALIENSIS ISSN 1651-6214 ISBN 978-91-554-9031-7 UPPSALA urn:nbn:se:uu:diva-231670 2014 Dissertation presented at Uppsala University to be publicly examined in Fries salen, Evolutionsbiologiskt centrum, Norbyvägen 18, 752 36, Uppsala, Friday, 24 October 2014 at 10:30 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Professor Andrew Roger (Dalhousie University).

Abstract He, D. 2014. Inferring Ancestry. Mitochondrial Origins and Other Deep Branches in the Eukaryote . Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1176. 48 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-9031-7.

There are ~12 supergroups of complex-celled (), but relationships among them (including the ) remain elusive. For Paper I, I developed a dataset of 37 eukaryotic of bacterial origin (euBac), representing the conservative core of the proto- . This gives a relatively short distance between ingroup (eukaryotes) and outgroup (mitochondrial progenitor), which is important for accurate rooting. The resulting phylogeny reconstructs three eukaryote megagroups and places one, Discoba (), as sister group to the other two (neozoa). This rejects the reigning “Unikont-” root and highlights the evolutionary importance of Excavata. For Paper II, I developed a 150- dataset to relationships in supergroup SAR (Stramenopila, Alveolata, ). Analyses of all 150- give different trees with different methods, but also reveal artifactual signal due to extremely long rhizarian branches and illegitimate sequences due to (HGT) or contamination. Removing these artifacts leads to strong consistent support for Rhizaria+Alveolata. This breaks up the core of the chromalveolate hypothesis (Stramenopila+Alveolata), adding support to theories of multiple secondary endosymbiosis of . For Paper III, I studied the of cox15, which encodes the essential mitochondrial protein Heme A synthase (HAS). HAS is nuclear encoded (nc-cox15) in all aerobic eukaryotes except godoyi (Jakobida, Excavata), which encodes it in mitochondrial DNA (mtDNA) (mt-cox15). Thus the gene was postulated to represent the ancestral gene, which gave rise to nc-cox15 by endosymbiotic gene transfer. However, our phylogenetic and structure analyses demonstrate an independent origin of mt-cox15, providing the first strong evidence of to mtDNA HGT. Rickettsiales or SAR11 often appear as sister group to modern mitochondria. However these bacteria and mitochondria also have independently evolved AT-rich . For Paper IV, I assembled a dataset of 55 mitochondrial proteins of clear α-proteobacterial origin (including 30 euBacs). Phylogenies from these data support mitochondria+Rickettsiales but disagree on the placement of SAR11. Reducing amino-acid compositional heterogeneity (resulting from AT-bias) stabilizes SAR11 but moves mitochondria to the base of α-. Signal heterogeneity supporting other alternative hypotheses is also detected using real and simulated data. This suggests a complex scenario for the origin of mitochondria.

Keywords: Molecular Phylogeny; ; Mitochondria; Eukaryote tree of life

Ding He, Department of Organismal , Systematic Biology, Norbyv. 18 D, Uppsala University, SE-75236 Uppsala, Sweden.

© Ding He 2014

ISSN 1651-6214 ISBN 978-91-554-9031-7 urn:nbn:se:uu:diva-231670 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-231670)

“If I have seen further, it is by standing on the shoulders of giants.”

- Isaac Newton, 1676

For Family & Friends

List of Papers

This thesis is based on the following papers, which are referred to in the text by their Roman numerals.

I He, D.*, Fiz-Palacios, O.*, Fu, C.-J., Fehling, J., Tsai, C.-C., & Bal- dauf, S. L. (2014) An Alternative Root for the Eukaryote Tree of Life. Current Biology, 24(4):465–470.

II He, D., Sierra R., Pawlowski J., & Baldauf, S. L. Reducing long- branch effects in a multi-protein data set identifies a new Rhiza- ria+Alveolata . [Manuscript]

III He, D., Fu, C.-J. & Baldauf, S. L. Horizontal transfer of a heme A synthase gene from bactera to jakobid mitochondrial DNA. Submit- ted.

IV He, D., Whelan S., & Baldauf, S. L. Mining the phylogenetic signal for inferring the origin of mitochondria. [Manuscript]

The reprint of Paper I was made with permission from Elsevier Limited (license number: 3464701032242).

Ding He made primary contribution to the experimental design, data assem- bly and analyses in all four papers. He jointly conceived the ideas, wrote the first drafts and made extensive editing for the final versions of Paper II, III and IV. He contributed minor editing in Paper I.

*Equal contribution.

The cover mosaic image “Tree of Life” was built with ’s illustrations (, 1899-1904) using AndreaMosaic www.andreaplanet.com/ andreamosaic/ based on the drawing by Hwei-yen Chen.

Contents

Introduction ...... 9 Molecular ...... 10 Phylogenomics – Uniting Phylogeny and ...... 10 Part I: Eukaryote Tree of Life ...... 12 Overview of current eukaryote supergroups ...... 12 The root of the eukaryote tree of life ...... 19 Phylogeny of supergroup SAR (Stramenopila, Alveolata and Rhizaria) ...... 20 Part II: Mitochondrial Evolution ...... 22 The endosymbiotic theory ...... 22 Movement of the genetic material ...... 22 The bacterial origin(s) of mitochondria ...... 23 Challenges for Molecular Phylogeny ...... 24 Quality control for phylogenomic data set ...... 24 Long-branch attraction ...... 25 Compositional bias ...... 25 Objectives ...... 27 Materials and Methods ...... 28 Results ...... 30 Summary of Paper I ...... 30 Universal eukaryotic proteins of bacterial origin (euBac) ...... 31 A Rooted Phylogeny of Eukaryotes ...... 32 Analyzing the contradictory data ...... 32 The of the proto-mitochondrial ...... 32 Evolutionary implications of the “neozoan-excavate” root ...... 33 Summary of Paper II ...... 34 The phylogenetic position of Rhizaria ...... 34 Summary of Paper III ...... 35 The jakobid mtDNA cox15 gene does not reflect a primitive state. .... 35 Summary of Paper IV ...... 36 A phylogenomic data set of 55 universal mitochondrial proteins ...... 36 Mitochondria-Rickettsiales signal and compositional heterogeneity .. 36 Testing phylogenetic signal heterogeneity ...... 37 Svensk Sammanfattning ...... 38 Acknowledgments ...... 40

Abbreviations

DNA Deoxyribonucleic acid RNA Ribonucleic acid SSU-rDNA Small subunit ribosomal deoxyribonucleic acid rRNA Ribosomal ribonucleic acid ATP DHFR Dihidrofolate reductase TS Thymidylate ToL Tree of life eToL Eukaryote tree of life RGC Rare genomic change MtMt Mitochondrial encoded mitochondrial NcMt Nuclear genome encoded mitochondrial euBac Eukaryotic proteins of bacterial origin LMCA Last mitochondrial common ancestor MROs Mitochondrial-related FECA First eukaryote common ancestor LECA Last eukaryote common ancestor LMCA Last mitochondrial common ancestor EGT Endosymbiotic gene transfer HGT Horizontal gene transfer RPS-BLAST Reversed position specific basic local alignment search tool BLAST Basic local alignment search tool BLASTp Basic local alignment search tool for protein mlBP Maximum likelihood bootstrap proportion biPP Bayesian inference posterior probabilities EST Expressed sequence tag

Introduction

“Nothing in biology makes sense except in the light of evolution”

- Dobzhansky, T., 1973

“Nothing in evolution makes sense except in the light of phylogeny”

- Society of Systematic , 2001

Figure 1. The first diagram of an evolu- tionary tree from 's First Notebook on Transmutation of (1837). Interpretation of handwriting: "I think the case must be that one genera- tion should have as many living as now. To do this and to have as many species in the same (as is) requires extinc- tion. Thus between A + B the immense gap of relation. C + B the finest grada- tion. B + D rather greater distinction. Thus genera would be formed. - …”

9 The science of phylogenetics concerns the study of evolutionary relation- ships between organisms. In other words, a phylogeny (or evolutionary tree) describes a hypothesis of the evolutionary history between groups of organ- isms (Figure 1). It is It is also an important tool in the study of – the science of biological classification that was pioneered by the Swedish botanist and naturalist Carolus Linnaeus and later integrated with Darwin- ism, providing a comparative framework to study organismal biology in both present and past. The development of was a major advance for modern phylogenetic practice (molecular phylogenetics) allowing it to leap into vir- tually every level of biological diversity. Since then phylogenetic analysis has no longer been restricted to the analysis of small amounts of morpholog- ical characteristics for studying relationships primarily within macroscopic organisms (mostly , fungi and ). Molecular sequences, i.e. DNA or protein, have enabled tens of thousands of characters to be applied to phylogenetic reconstruction. With every new generation of molecular technologies, the exponential increase in accumulated data con- tinues to revolutionize our understanding of the Tree of Life (ToL).

Phylogenomics – Uniting Phylogeny and Genomics The last decade was a period of major DNA sequencing revolution. The technologies progressed rapidly and have not only made “genomic scale” sequencing practical but also cost-effective. Compared to “first-generation” Sanger sequencing, “next-generation” high-throughput sequencing can pro- duce more than 1000-fold more DNA bases in the same run- but also with less than one-thousandths of the cost per million bases (Liu et al. 2012). This has opened up a whole new spectrum of genomic research and has ena- bled exponential growth of genome sequencing projects. This has resulted in at least 50913 complete or on-going genome sequencing projects across all 3 domains of life (Bacteria, and Eukaryotes) as of August 2014 (Ge- nome OnLine Database v.5). One of the most important implications is that we now have an unprece- dented wealth of molecular data, not just a handful of genes but whole ge- nomes, to infer and improve organismal phylogenies. This is particularly true for eukaryote phylogeny, for which, unlike archaeal and especially bac- terial phylogeny, we have a good chance of recovering species relationships. These data also include genomes from the vast diversity of uncultivated mi- cro-eukaryotes found in “environmental DNA” (total DNA extracted from environmental samples such as or water) (Moreira and Lopez-Garcia 2002). Genomics-enabled phylogenetic approaches – phylogenomics - offer

10 us a vastly improved platform to investigate the eukaryote tree of life (eToL) both in terms of breadth and depth (Burki 2014). The two most frequently used phylogenomic approaches are the “super- alignment” (or sometimes called supermatrix) and “” methods. The superalignment method involves the concatenation of multiple single gene alignments into a single multi-gene alignment. This approach requires that all concatenated genes have the same underlying evolutionary history, ie. all genes trace the same genealogical history. If this requirement is satisfied, then can be analyzed using the standard phylogenetic methods. Alternative- ly, one can perform separate phylogenetic reconstruction for each gene and then combine these single gene trees into a “supertree”. One of the main disadvantages of the supertree approach is that it cannot solve any phyloge- netic problem that is not solved by at least one of the individual genes. How- ever, no single gene alone has enough information to solve deep branches in the tree of life. Therefore I have used the supermatrix approach throughout my thesis research. In this thesis, I describe my work utilizing molecular phylogenetics to study two fundamental aspects of eukaryotic evolution: (I) the deep phylog- eny of eukaryotes, i.e. the relationships between major groups of eukaryotes, as well as the root of the eukaryote tree of life, and (II) the origin and evolu- tion of one of the defining eukaryotic organelles, the mitochondrion.

11 Part I: Eukaryote Tree of Life A eukaryotic is defined by the presence of at least one distinct mem- brane-bound , the nucleus, which contains the bulk of the cells ge- netic material (i.e, the chromosomal DNA). This is in contrast to nearly all known prokaryotic cells, which lack such a structure for storing their ge- nomic material. Eukaryotic cells are also far more complex than by having other cellular structures such as a and organelles like mitochondria and chloroplasts. Eukaryotic organisms that we see everyday, e.g. and fungi and plants, are the sources of much of our fundamental knowledge in biology. It is undeniable that studying them will hold a wealth of critical information for a better understanding of biological mechanism. Moreover, reconstructing their “family history” or ancestry, represented as a bifurcating tree (Figure 1), is crucial for developing a timeline in order to fully comprehend the remarkable inventions that ultimately led to the world as we now know it. Although modern biology is still to a large extent dominating by the study of “higher” (i.e. multi-cellular) organisms, it is now clear that the microbial eukaryotes (the -sized and relatively “simple” single-celled eu- karyotes, the so-called ) representing a huge “hidden” diversity may be in fact the phylogenetic majority (Figure 2). Molecular phylogenies have identified many novel lineages that appear to be only distantly related to the known taxa or even branch outside of the recognized eukaryote major groups, suggesting that the diversity of these is probably far greater than we thought. Nevertheless, there are few relatively stable phylogenetic assemblages representing the majority of eukaryotes across entire eToL (Baldauf 2008). These major divisions of eToL are usually called super- groups.

Overview of current eukaryote supergroups The assemblages of eukaryote supergroups as well as their relationships are based on the knowledge combining molecular, morphological and cytologi- cal evidence (Adl et al. 2012). Molecular phylogeny now can assign most of eukaryotes to one of 3 supragroups – Diahporetickes, and Ex- cavata. These contain roughly 5 supergroups - SAR, , Opis- thokonta, and Discoba, which can be further divided into 12 well-supported major groups (Figure 2). Some of the relationships between them (the deep phylogeny) are strongly supported, such as Fungi (or more accurately ) as the sister group of . Others are less cer- tain because of a scarcity of data leading to apparently contradicting evi- dence, such as the relationship between Haptophyta, SAR and Archaeplas- tida (Figure 2).

12 Rhizaria

Alveolata SAR

Stramenopila

5 Haptophyta Cryptophyta

Katablepharids Diaphoretickes Centrohelida Chloroplastida Archaeplastida 4 1 Rhodophyta Fungi 3 Opisthokonta Holozoa 2 Amoebozoa Amorphea 6 Discoba Jakobida Metamonada Excavata

Figure 2. A consensus phylogeny of the major divisions (major-, super- and mega-groups) of eukaryotes. The tree topology and branch lengths (drawn to scale as indicated by the bottom left scale bar) are adapted from the eukaryote phylogenies in Papers I and II. The 12 well- established major groups are collapsed and represented as colored triangles with designated group names. Some of these 12 have been assigned to higher-order supergroups indicated to their immediate right with wavy brackets, including SAR, Archaeplastida, Opisthokonta, Amoebozoa and Discoba (Adl et al. 2012). All these taxa are subsumed into 3 mega-groups Diaphoretickes, Amorphea and Excavata. Dash lines indicate ambiguous relationships (Adl et al. 2012). The red-circled number on the branch indicates the new eukaryote root identified in Paper I. The black-circled numbers on the branches indicate previously proposed alternative eukaryote root positions, which are discuss in the text.

13 Opisthokonta Supergroup Opisthokonta includes Fungi (now Holomycota, including Fon- ticula and Nucleariidae) and Holozoa (Metazoa, , Filasteria). This supergroup has a few relatively stable synapomorphies including a ~12- insertion in a protein synthesis 1α (EF-1α (Steenkamp et al. 2006)), mainly , posterior and flat mitochondrial cristae (Cavalier-Smith and Chao 2003). Major group Fungi, once considered a prior to molecular phylogeny, contains a large diversity of organisms including the familiar budding widely used for alcohol production and bread baking, and the -bearing fruiting body mushrooms. The only potential synapomorphic character that unites fungi is the use of a chitin-rich during the polarized move- ment of the . The earliest branching fungi appear to be a putative monophyletic and diverse Cryptomycota + (possibly with Aphelidea) clade possessing unique reproductive with flagella. (Jones et al. 2011; James et al. 2013; Karpov et al. 2013). It is suggested that the last common ancestor of fungi was a parasitic single-celled , perhaps living in an aquatic environment. This later evolved into multicellu- larity with filamentous growth such as hyphae and rhizoids capable of at- taching to substrates and acquiring/exchanging nutrition from its environ- ment or from hosts (James et al. 2006; James et al. 2013). A number of major lineages of early branching fungi have been discov- ered in the last few , including the , Nuclear- iidae+, and the (Lara et al. 2010). The Glomeromycota appear to be exclusively endomycorhizzal, developing intimate mutualistic interactions with the of ~90% of land plants. These “haustorial com- plexes” facilitate nutrient exchange and are critical to plant fitness and pos- sibly the key to plants’ successful colonization of land (James et al. 2006; Stajich et al. 2009). The most recent revision of Opisthokonta phylogeny was a strongly supported fungal sister group, Fonticula + Nucleariidae, branching deeply in the Holoza-Fungi split (Brown et al. 2009). Both Nucle- ariids and Fonticula are amoebae, the latter being capable of aggregative multicellularity cellular slime Fonticula alba within Opisthokonta that may provide interesting aspects to the evolution of fungal multicellularity. The Rosidae are the most recently recognized new deep of Fungi, but these organisms are largely known from environmental samples (ref). Major group Holozoa is probably one of the most extensive studied groups of organisms, consisting largely of Metazoa (or Animalia) including . However, the deepest branch(es) of metazoa are Choanozoa. Meta- zoa can be divided into Diploblastica (animals with radial symmetry) and (animals with bilateral symmetry). Bilateria represents the primary metazoan diversity identified so far. This definite monophyletic group is further divided into two monophyletic groups Protostomia and Deuterosto-

14 mia with distinct early developmental differences (Erwin and Davidson 2002). Diploblastica is the group of early branching animals including (Porifera), (), comb jellies () and the simplest “flat-animal” () adhaerens (Srivastava et al. 2008). However, the exact branching order within this group and the nature of the last metazoan common ancestor remains elusive (Paps et al. 2010).

Amoebozoa Amoebae are almost exclusively unicellular organisms with a crawling-like locomotion and irregular cell shapes. The amoeboid lifestyle is scattered across the eToL tand herefore it is not a clade-specific character. However the morphology and movement of vary widely and various types seem to define specific groups. As with many groups of micro- bial eukaryotes, the identification and classification of amoebozoan amoebae is difficult due to limited morphological/cytological traits and insufficient molecular data (Bapteste et al. 2002; Cavalier-Smith 2002). They tend to have lobose or tube-like pseudopods and tubular mitochondrial cristae (Adl et al. 2012). This supergroup contains mostly heterotrophic free-living naked amoe- bae such as Hartmanella and one group of testate , the . There is also a majr division of mitochondria-lacking (amitochondriate) taxa including the parasite Entaamoeba. Probably the best studied amoe- bozoans are the social amoebae which undergoes aggregative multicellularity, the giant amoebae of the (e.g. ), which form often complex macroscopic fruiting bodies from a single undif- ferentiated . The of Amoebozoa is frequently recov- ered (Parfrey et al. 2006; Lahr et al. 2011; Brown et al. 2013). However, contrary to the generally well-resolved phylogeny of their close kin Opistho- konta, phylogeny within Amoebozoa is largely uncertain due to a paucity of molecular data (Glöckner and Noegel 2013). The most recent phylogenetic analysis based on SSU-rDNA and genes failed to support most of the higher-level classification, e.g. the two major divisions Lobosea and Cono- sea (Lahr et al. 2011). Two mysterious taxa Apusozoa (aerobic , e.g. Thecamonas) and Breviatea (anaerobic or microaerophilic amoebae, e.g. ) have recently been identified as the closest sister group to Opisthokonta, and these two lineages, together with Amoebozoa form a mega-group Amorphea (Fig- ure 2) (Adl et al. 2012; Brown et al. 2013; Cavalier-Smith 2013; He et al. 2014).

Archaeplastida An ancestral primary cyanobacterial endosymbiosis gave rise of this alnost- exclusively photosynthetic supergroup, formerly known as Plantae. All taxa carry type a (with few secondarily lost exceptions),

15 along with cell walls and mitochondrial with flat cristae (Adl et al. 2012). Archaeplastida includes three major divisions Rhodophyta (red al- gae), Glaucophyta (cyanelle-containing ) and Chloroplastida ( and land plants). Although extensive phylogenetic analyses support a monophyletic Archaeplastida (Reyes-Prieto et al. 2007), some recent phylo- genomic data sets of nuclear-encoded genes with novel deep-branching taxa such as Picobiliphytes (Burki et al. 2012) and Palpitomonas (Yabuki et al. 2014) have added renewed uncertainty. One of the most important evolutionary aspects about Archaeplastida is their primary role in the evolution of plastids. The single cyanobacterial en- dosymbiosis event that gave rise to plastids took place about 1.5 million years ago (Yoon et al. 2004). This then led to multiple secondary endosym- biotic events, where a primary--containing eukaryote was engulfed by a second, non-photosynthetic eukaryote, which then converted this endo- symbiont into its own plastid by transferring nuclear-encoded plastid genes from the ’s nucleus to its own. Several instances are also known of tertiary endosymbiosis, whereby a secondary-plastid-containing endosymbiont was taken up and converted to an endosymbiont by a new eukaryotic host (which may or may not itself have possessed a plastid). As a result, endosymbiosis is frequently found in many non-archaeplastid protists, and the exact evolutionary paths of these events are still a of consid- erable bated (Archibald 2009).

SAR SAR (Stramenopila, Alveolata and Rhizaria) is the most recently recognized supergroup primarily because of the rapidly increased molecular data from rhizarians in the last few years. These data have also revealed a tremendous diversity within the major group Rhizaria (Adl et al. 2012) The monophyly of SAR is strictly supported by molecular phylogeny, and there are no known or even proposed morphological synapomorphies for the group (Burki 2014). share the characteristic of cortical alveoli and their monophyly is strongly supported (Harper et al. 2005). They include three also well sup- ported monophyletic subgroups: apicomplexans (e.g. Plasmodium, the ma- laria parasite), (including numerous beneficial as well as harmful producers) and (e.g. , an important with unique nuclear dimorphism) (Bachvaroff et al.). Re- cently, several lineages, called colponemids, were found retaining ancestral mitochondrial genome structure and content, and indicating more early- diverging and diverse alveolates yet to be identified (Janouškovec et al. 2013). (or ) are very diverse ranging from multicellu- lar to unicellular to non-photosynthetic microbes and par- asites. The motile cell features biflagella with a hairless poste-

16 rior and an anterior flagellum covered with unique hollow tri- partite hairs (stramenopiles). Molecular phylogenies support a monophyletic group , which includes most (if not all?) of the photosynthetic stramenopiles. These are all aquatic algae and range from the microbial and vastly abundant diatoms, which produce beautiful bilaterally symmetrical silica , to the giant multicellular that are used as economical important food. Meanwhile the non-photosynthetic stramenopiles appear to be paraphyletic, including the fungal-like (so-called water ) such as infestans, which include the most aggressive plant (Riisberg et al. 2009). Rhizarians include naked and testate amoeboid with filose or reticulose pseudopodia. The most famous rhizarian lineages are radiolarians, producing radiate-form that were beautifully illustrated by German naturalist Ernst Haeckel. Their skeletons are also important micro- for geological and paleontological studies from the period and onwards. Monophyly of this major group is strongly supported by mo- lecular phylogenies (Burki and Keeling 2014), but the relationships among the members within the group, including , , Plasmo- diophora and are still under debate (Sierra et al. 2012). Numerous taxa of uncertain phylogenetic position are nevertheless appear to be closely associated with SAR and/or Archaeplastida. These include cryptophytes, , , and rappemonads, which have together been proposed to form an additional supergroup, (Okamoto et al. 2009). Together all these taxa - Archaeoplastida, SAR and Hacrobia - have recently been designated as megagroup Diaphoretickes (Adl et al. 2012).

Discoba (Excavata) The “excavate” organisms normally exhibit distinct suspension-feeding groove and share a particular set of flagellar apparati (Simpson 2003). All taxa identified so far are unicellular, and their life style can be free-living, symbiotic or parasitic. Support for this group being monophyly remains problematic (Hampl et al. 2009). Some of these lineages were previously shown to form some of the deepest branches in the first eToLs based on ri- bosomal RNA (rDNA) sequences (Sogin et al. 1989). Based on these data, Cavalier-Smith named these taxa “archezoans”, and proposed that they di- verged before the advent of mitochondria (Cavalier-Smith and Chao 1996). However the hypothesis was later disproved with the demonstra- tion of mitochondrial-like organelles and genes in most “archaeazoa” (Embley and Martin 2006). Furthermore, the discovery that one branch of “archaezoa”, the microsporidia, belong in fungi, was striking evidence of how much the problem of could distort molecular trees (Stiller and Hall 1999). This megagroup (as Excavata) also features extraordinary mitochondrial evolution, and can be divided into two assem-

17 blages – Discoba and the amitochondriate excavates (AME or Metamonada) – based on their mitochondrial characteristics. However, it is important to note that the monophyly of Excavata has never been tested in a rooted multi- gene tree, and given the eukaryote root as re-defined in Paper I of this thesis it is possible that Excavata could be paraphyletic (He et al. 2014). Discoba (excavates with aerobic mitochondria) consists of Discicristata (Heterolobosea + ) and Jakobida, while Heterolobosea includes mostly free-living amoeba with eruptive pseudopodia, such as and Acrasis. Naegleria contains mostly free-living species, but can also causes fatal in human such as N. fowleri. Most of these taxa can switch between and amoeboid forms (Fritz-Laylin et al. 2010; Herman et al. 2013). Acrasis, which appears to lack flagella, was originally thought to be a fungal taxon as it is spore-producing. Later it was considered as the sister group to because its also undergoes aggregative social behavior and produces tiny tree-like fruiting bodies consisting of multiple cells (Olive 1970). However some protozologists noted that Acrasis had heterolobosean-type pseudopodia (Olive 1960), and molecular phylogeny now clearly places Acrasis in Heterolobosea together with Naegleria (Brown, Silberman, et al. 2012). Euglenozoa includes both free-living (e.g. ) and parasitic (e.g. ) flagellates. Euglena has long been studied as model organism and is capable of both phototrophy, due to a secondary green chloroplasts as well as (with the engulfed green algae) and heterotrophy. Trypanosoma is a member of the larger group Kinetoplastid, which contains many parasites, including human pathogens caus- ing sleeping sickness and . Jakobida currently only consists of a handful of described species. However, they possess remarkable gene-rich mitochondrial genomes that are hypothesized to be the result of retention of ancestral characters from the last mitochondrial common ancestor (LMCA) (Lang et al. 1997; Burger et al. 2013) (but see Paper III of this thesis). Metamonada (excavates with anaerobic mitochondrial-related organelles, or MROs) consists of four major divisions, the Fornicata, Parabasalia and Preaxostyla. All members of Fornicata are anaerobic or microaerophilic and are either symbiotic (e.g. hypermastigids, the dominant eukaryote in the gut) or parasitic (e.g. an intestinal parasite in ). All described species so far have no respiratory mitochondrion. Instead some of them have reduced forms of “genomeless” MROs (e.g. and ) that show clear evidence of being mitochondrial relics (Shiflett and Johnson 2010). To potentially important but currently unclassified eukaryotes are Mala- wimonas and Collodictyon. Both are free-living mitochondriate amoeboflag- ellates with excavate-like characteristics (Simpson 2003). However, recent phylogenomic data sets demonstrated their deep-branching positions (some- grouping together as sister taxa) without clear affiliation to any super- groups (Zhao et al. 2012; Brown et al. 2013; Yabuki et al. 2014). Their un-

18 certain, yet possibly deep phylogenetic positions make them intriguing tar- gets for understanding the nature of the last eukaryote common ancestor (LECA).

The root of the eukaryote tree of life The immense eToL diversity is astounding and will certainly keep growing for some time. Having access to full genomic data for every living organism on is possible, should enable us to develop a much more sophisticate understanding of the molecular eukaryote phylogeny (Grant and Katz 2014). One of, if not the most, critical “point” in any is the posi- tion of the root (Huelsenbeck et al. 2002). For eukaryote phylogeny, this single critical point corresponds to the last shared ancestor of eukaryotes (LECA) and can be used to infer the common heritage of all extant species. The root also orientates the evolutionary paths within the tree and helps to define the fundamental relationships among species (Baldauf 2003; Roger and Simpson 2009; Burki 2014). Specifically, identifying relationships among the major divisions of eukaryotic (supergroups, majorgroups, mega- groups, Figure 2), and even relationships within the groups, will depend on where the root lies. Essentially, a “group” is no longer a group if the root is placed within it. However, rooting the eToL, meaning to infer the first major diversifica- tion of extant species since eukaryotes first arose (probably 1~2 billion years ago (Knoll et al. 2006)), is no easy task by nature. Without a time machine, we must rely primarily, possibly even exclusively, on finding shared charac- teristics among extant species to infer their common ancestry. Under a phy- logenetic framework, the primary method to root a tree is to use an “out- group” as a reference that is known to be equally related all of the “ingroup” taxa. Given our current three- understanding of the universal tree of life, this outgroup usually come either from Archaea or Bacteria (Woese et al. 1990). The earliest molecular attempts to root eToL started from the use of con- servative (slowly evolving) information-processing genes that are universal- ly present in all three domains of life. Many of these strongly placed Eukar- yota as closer to Archaea than to Bacteria (Iwabe et al. 1989). Hence, the ubiquitous ribosomal RNA (rRNA) gene (Sogin 1991; Cavalier-Smith and Chao 1996) or elongation factor proteins (Baldauf et al. 1996; Kamaishi et al. 1996; Roger et al. 1999) were used to root the eToL with Archaea as the closest outgroup. However, these trees led to the “Archezoa hypothesis” (Cavalier-Smith 1989), which was eventually questioned and ultimately abandoned when it was shown to be affected, at least for some taxa, by the phylogenetic artifact long-branch attraction (LBA) (Stiller and Hall 1999; Philippe 2000; Roger and Hug 2006). Ever since, the mistrust of molecular phylogeny, particularly for the study of ancient divergences (deep branches),

19 never seems to have really disappeared (Brinkmann et al. 2005). Even with multigene data yielding vastly greater ancestral information for inferring the root, there has been continued distrust of deep rootings due to the usually large genetic distance between ingroup (eukaryotes) and outgroup (prokary- otes) (Arisue et al. 2005). Thus, most of the more recent attempts at rooting eToL have focused on macromolecular characters, which are assumed to be free of the impact of LBA artifacts (ref). The popular “unikont-bikont” dichotomy (root 2, Figure 2) was based on the presence/absence of a gene fusion between dihidrofolate reductase (DHFR) and thymidylate synthase (TS) (Stechmann and Cavalier- Smith 2002) as well as the of domains (Richards and Cavalier-Smith 2005). However, such gene fusion/ events that were presumed to be rare, irreversible and therefore highly reliable synapo- morphies have turned out to be more frequent and reversible than expected and therefore far from homoplasy-free (Leonard and Richards 2012). Not surprisingly, the debate about the eukaryote root continues. A recent study showing the taxonomic distribution of the (ER) – mitochondria encounter structure (ERMES) has been interpreted as sup- porting a root lying between Amorphea+Excavata and Diaphoratickes (root 5, Figure 2) (Wideman et al. 2013). An alternative approach using “rare ge- nomic change” (RGC) combined with statistical tests places the root be- tween either Archaeplastida or Opisthokonta and everything else (root 4 or 3, respectively, Figure 2) (Rogozin et al. 2009; Katz et al. 2012). In my re- search, I developed a new data set to test the eukaryote using molecular phy- logeny and a closer outgroup than Archaea (Paper I). These data show very strong support for a eukaryote root lying between Excavata and Neozoa (root 1, Figure 2), and statistical tests standing strongly against the possibility that this new root is an LBA artifact (He et al. 2014).

Phylogeny of supergroup SAR (Stramenopila, Alveolata and Rhizaria) The newly recognized supergroup SAR is supported by many recent phylo- genomic analyses, although there is still no consensus on relationships with- in the group, i.e., between Stramenopila, Alveolata and Rhizaria (Burki et al. 2007; Parfrey et al. 2010; Brown, Kolisko, et al. 2012; Burki et al. 2012; Sierra et al. 2012; Burki et al. 2013; Yabuki et al. 2014). Solving this tri- chotomy is complicated by intrinsic problems arising primarily with the Rhi- zaria data. First, cultivation of Rhizaria species has been largely unsuccessful. This poses a prime challenge to obtaining enough pure material for sequencing. Thus, most of the recently available diverse rhizarian sequence data have been derived from samples collected directly from the wild. This leads to

20 contamination issues coming from the symbionts, that are common and var- ied in many Rhizaria, organisms that get trapped in the sometimes complex rhizarian tests, and undigested food remaining within the rhizarian cells themselves (Sierra et al. 2012). A second problem I uncovered in my anal- yses is that many rhizarians mysteriously have high rates of molecular evolu- tion (Paper III). Some of these can be explained by, for example, the para- sitic life style of the rhizarian Mikrocytos mackini, which appears to be the fastest-evolving species to date (Burki et al. 2013). However, I found many long-branching rhizarians, including most of the examined free-living spe- cies (Paper III). Therefore solving the branching order in SAR has an in- creased risk of LBA artifact that could lead to artifactual grouping of unre- lated groups with long branches, or disrupt a real grouping by drawing Rhi- zaria towards the long branches found in some of the outgroups.

21 Part II: Mitochondrial Evolution The mitochondrion is a cellular organelle found in nearly all eukaryotic cells, and one that appears to have been well established in LECA. This or- ganelle acts as the “power generator” in most non-photosynthetic cells, pro- ducing nearly all of the cell’s adenosine triphosphate (ATP), the chemical carrier of life. ATP is essential for enabling nearly all cellular activi- ties such as , mobility and signaling. Since it is well accepted that all eukaryotes have, or once had, mitochondria, the evolution of mitochon- dria is a major part of eukaryotic cell evolution.

The endosymbiotic theory The concept of “” was originally proposed by the Russian Konstantin Mereschcowsky in 1910, in which he postulated that less complex organisms could form a more complex one through a symbiotic relationship. Later in 1967, revived this idea, and published the paper “On the Origin of Mitosing Cells”. This illustrated her endosymbi- otic theory to explain what she saw as the major discontinuities between eukaryotic and prokaryotic cells, i.e. the mitochondria, 9+2 bodies of the eukaryotic flagella, and chloroplasts. She proposed that all of these were derived from once free-living prokaryotes that gradually became parts of eukaryotic cells through endosymbiotic processes (Sagan 1967). Although theories of how various aspects of the eukaryotic cell arose are still a matter of debate (Embley and Martin 2006), molecular studies have provided solid evidence that the endosymbiotic origins of mitochondria and chloroplasts from proteobacteria and , respectively (Küntzel and Köchel 1981; Gray et al. 1989; Archibald 2011; Gray 2012). Although commensal relationships are common among prokaryotes, it appears that endosymbiosis is only found in eukaryotic cells. So far there is no direct evidence showing endosymbiosis between two prokaryotic cells that could be inferred for eu- karyogenesis.

Movement of the genetic material Endosymbiotic gene transfer (EGT) The bulk of genetic material transmission, at least in eukaryotes, is from parent to offspring – i.e. . However there are two im- portant exceptions. The process of transforming the eukaryotic endosymbi- onts into an organelle also involved transmission of DNA from the endo- symbiont (both mitochondria and chloroplasts) to the host nucleus. This transfer was presumably gradual, involving random release of DNA from the organelle, possible due to organellar degradation, uptake of the free DNA by the nucleus, integration of the sequences in nuclear DNA and acquisition of

22 expression and organelle targeting signals, finally followed by random loss of the nuclear or organelle copy of the gene. The reasons for this trend are debated, but it is likely that transfer of genes led to cellular and genetic con- solidation for both host and endosymbiont. Therefore, one of the critical criteria in the establishment of endosymbio- sis is the genomic reduction for the endosymbiont (McCutcheon and Moran 2012). Thus during the development of a stable endosymbiotic relationship, the endosymbiont reshapes its genome content by the combined means of gene loss, for no longer required functions, and gene transfer of genetic ma- terials to the nucleus. The transfer process, which is termed endosymbiotic gene transfer (EGT), relocates genes from the endosymbiont genome to the host nuclear genome. This further requires a transporting system to send the protein products of the transferred genes, which are now translated in the rather than in the organelles themselves, back into the endosym- biont (McCutcheon and Keeling 2014). Evidence of EGT can be found in the highly reduced nature of organelle genomes compared with those of free- living bacterium as well as molecular phylogenies of nuclear-encoded orga- nelle genes (Adams and Palmer 2003; Timmis et al. 2004). Moreover, EGT apparently is still an ongoing process and seems to be happening frequently, especially for chloroplasts (Baldauf and Palmer 1990; Huang et al. 2003; Stegemann et al. 2003; Stegemann and Bock 2006).

Horizontal gene transfer (HGT) A second, perhaps more intriguing type of non-vertical transmission of DNA is horizontal gene transfer (HGT), which is gene acquisition between two unrelated species (although EGT could be considered a specialized type of HGT). Convincing evidence began piling up as sequences began to accumulate to indicate that HGT is a major force shaping the evolu- tion of bacterial genomes (Doolittle 1999). Moreover, HGT can occur in any direction in a eukaryotic cell, between the nuclear genome and organelle genomes, or between mitochondria and in photosynthetic eukar- yotes (Stegemann et al. 2012). Hence, HGT (EGT) has been a major force in the origin and evolution of mitochondria and chloroplasts, as well as giving rise to new species and advancing genetic innovation within bacteria and archaea (Husnik et al. 2013; Rice et al. 2013; Fuentes et al. 2014).

The bacterial origin(s) of mitochondria Molecular evidence unambiguously supports a bacterial origin of mitochon- dria (Lang et al. 1997; Andersson et al. 1998). This based on the fact that all taxonomically widespread mitochondrial genes clearly trace to bacteria, and the fact that roughly half of the nuclear-encoded mitochondrial protein genes do as well (Andersson and Kurland 1998). Extensive phylogenetic studies further suggest a proteobacterial origin of mitochondria, most likely coming

23 from the alpha division (α-proteobacteria) (Gray 2012). This signal is espe- cially strong for genes in the mitochondrial DNA (mtDNA) itself. However, the genetic relics from the last mitochondrial common ancestor (LMCA) that are retained in modern mtDNA are scarce. That is there are very few univer- sal mitochondrially encoded mitochondrial (MtMt) genes (for example, most animals mtDNAs contain only 13 protein-coding genes). Furthermore only roughly half of the nuclear encoded mitochondrial (NcMt) genes show clear bacterial origin, the rest appearing to be eukaryote specific inventions to enable functions unique to an enslaved organelle (Gray 2014). While rough- ly 10-15% of these NcMt proteins show a strongly a-proteobacterial signal, the major do not trace to any specific bacterial group but rather to a taxo- nomically broad sampling of bacteria (Thiergart et al. 2012; He et al. 2014). Unraveling mitochondrial history is further complicated by the fact that the- se genomes tend to have high evolutionary rates (except for plant mtDNA that appears to evolve slower than nuclear DNA or chloroplast DNA), biased composition, and sometimes use an altered or edit their transcripts, as well as there being a large variation in mtDNA gene con- tent between eukaryotic groups. Thus, although the most consistent bacterial signal for mitochondrial origin comes from α-proteobacteria, determining which α-proteobacterial lineage is most closely related to LMCA is extreme- ly challenging.

Challenges for Molecular Phylogeny In this thesis, I study some of the deepest branches in eukaryote phylogeny including the origin of mitochondria using molecular phylogenetic ap- proaches. Molecular phylogeny has been described as molecular archaeolo- gy, essentially trying to construct ancient events based on random signatures of ancient genomes that survive in living organisms (Pääbo et al. 1989). Therefore it is not surprising that there are many potential problems associ- ated with phylogenetic inference, including molecular phylogeny (Philippe et al. 2011). I particularly address three specific phylogenetic challenges that are most relevant to resolving deep phylogenies.

Quality control for phylogenomic data set Inference of the organismal phylogeny requires authentic orthologous phy- logenetic marker, i.e. genes that are only transmitted vertically and therefore represent genuine events. Experience has taught that it is virtually impossible to confidently resolve deep eukaryote phylogeny with trees based on single genes. This is primarily because there is limited genetic infor- mation in single genes – possibly the largest universal protein is the large subunit of RNA polymerase, which is still less than 2000 amino acid in

24 length, and most universal proteins are more like 200-600 amino acids long. In addition, individual genes have individual quirks, such as rate acceleration or specialized function in restricted lineages, which biases their reconstruc- tion of organismal relationships. The phylogenomic approach is to combine multiple independent loci thereby resulting in an accumulation of weak sig- nal from individual single genes. This assumes that all the individual genes carry a weak signal, and that this weak signal supports the same deep phy- logeny, i.e. the single individual genes carry a single congruent signal. Thus, one of the main challenges for this approach is to vet the individual genes to determine that all copies in the data set are 1) genuinely ortholo- gous, and 2) that all genes carry congruent signal. Moreover, this all has to be done without having a strong phylogenetic signal in the individual genes. This means, among other things, that it is nearly impossible to develop a completely automated protocol for this propose. Therefore the work requires considerable manual inspection to minimize risks of having non-orthologous information, such as paralogy ( instead of speciation event) and HGT. Thus the problem of developing a large multi-gene data set is a bioinformatic as well as phylogenetic challenge.

Long-branch attraction LBA is a phylogenetic methodological phenomenon that was statistically described by Felsenstein in 1978. Before it was even observed in empirical data, he showed the potential for “positively misleading” signal in the avail- able phylogenetic methods, particularly parsimony. Thus, under certain con- ditions, as more data are accumulated the method converges on an incorrect solution, i.e. the method is inconsistent (Felsenstein 1978). LBA artifact is still surprisingly poorly understood, but appears to occur when the phyloge- netic methods misinterpret homoplasy as synapomorphy. LBA artifact in phylogenetic reconstruction is that two long branches (e.g. two fast evolving taxa) are grouped together regardless of their relatedness. In other words, fast evolving taxa may be grouped together due to similarities by chance arising from limited character states (four or 20 amino acids) rather than genuine . Thus, this problem could potentially become even more serious for the phylogenomic data set.

Compositional bias Another potential problem in deep phylogeny, especially in the case of mito- chondrial evolution, is composition bias. For practical reasons, most phylo- genetic methods implemented homogenous evolutionary models. These as- sume that the evolutionary processes and characteristics remain unchanged throughout the evolution. This assumption of course does not entirely reflect real evolutionary processes, leading to problems of model misspecification.

25 Efforts have been made to incorporate some heterogeneous processes in phylogenetic models and methods in order to better describe the evolution- ary mechanisms and therefore more accurately reconstruct evolutionary rela- tionships. For example, it is now possible to model both rate variation and variation in substitution processes across different sites in a multiple se- quence alignment. This has been an important advance that has led to sub- stantially improved accuracy of phylogenetic inference (Yang 1994; Lartillot and Philippe 2004). However, there are some actual biological phenomena that will also po- tentially lead to phylogenetic artifacts. One important model misspecifica- tion problem that specifically pertains to certain work in this thesis is the problem of compositional bias. Compositional bias naturally exists in many genomes, and many genomes, especially bacteria but also many eukaryotes that use very different subsets of the full genetic code for all genes, or certain subsets of genes. The difference in nucleotide AT/GC content can be quite dramatic, for example, ranging from 25-75 % in bacteria (Sueoka 1961). Extreme compositional bias is a serious problem leading to phylogenetic artifacts due to erroneously grouping together unrelated that share the same bias (Foster and Hickey 1999).

26 Objectives

This thesis addresses several fundamental questions in eukaryotic evolution using molecular phylogenetic approaches. All phylogenetic markers used in this thesis are protein sequences. These possess much larger character than nucleotide sequences (20 amino acids vs. four nucleotides). This makes proteins sequences more suitable for deep phylogeny. Specific aims are:

1. Identify and analyze proteins of close bacterial ancestry that are present in all major eukaryotic lineages. As most, if not all of the- se proteins are targeted to the mitochondrion, these eukaryotic proteins with bacterial origin (euBac) should represent the core of the proto-mitochondrial proteome (Paper I). 2. Reconstruct a rooted global phylogeny of eukaryotes with a closer outgroup than previously used in an attempt to avoid LBA artifact. The euBac proteins (aim 1) were used for this task based on the reasoning that they provide a close outgroup reference point. This is because mitochondrial endosymbiosis occurred after the first eukaryote common ancestor (FECA) but before LECA (Paper I). 3. Expand the robust euBac protein markers (aim 1) in order to test the phylogenetic position within supergroup SAR of Rhizaria, the least sequenced major group of eukaryotes (Paper II). 4. Investigate the origin of an unusual mitochondrial genome encod- ed gene, cox15, in version of the universal (nearly all nuclear- encoded) mitochondrial gene cox15. Both genes encode a Heme A synthase, which appears to be essential for mitochondrial function (Paper III). 5. Analyze phylogenetic signal in universal mitochondrial proteins for inferring the closest extant α-proteobacterial lineage of LMCA (Paper IV).

27 Materials and Methods

Quality control for phylogenomic data set Throughout this thesis, various vetting steps were used to ensure the quality of the phylogenomic data. Controls vary depending on the question of inter- ests; therefore every phylogenomic data set was tailored to address specific questions. Since the phylogenomic data are generally designed for studying organismal phylogeny, these controls primarily relied on single gene phy- logeny to eliminate illegitimate sequences that disrupt the orthologous rela- tionship, such as HGT (xenology), paralogy and contamination. Long branch sequences are often problematic therefore were also removed whenever pos- sible. I primarily used MUSCLE (Edgar 2004) or MAFFT (Katoh et al. 2002) for multiple sequence alignment, and Gblocks (Castresana 2000), trimAl (Capella-Gutiérrez et al. 2009) or SeqFire (Ajawatanawong et al. 2012) for extracting unambiguous alignment regions, and RAxML (Stamatakis et al. 2008) for phylogenetic control tree reconstruction. Costumed Python/Perl scripts were developed to facilitate and pipeline the procedures wherever possible.

Phylogenomic data for SAR phylogeny Rhizaria is one of the most poorly sequenced major eukaryotic groups in the SAR supergroup, both in terms of taxa and individual genes. Therefore only one Rhizaria was tentatively included the 37 euBac phylogenetic markers in Paper I. To address the SAR phylogenetic problem, we added extensive rhizarian data (ESTs) to the euBac data set. Rhizarian se- quences were added by BLASTp search against 14 rhizarian translated EST databases (Sierra et al. 2012). All hits with e-value < 1e-5 cut-off were added to existing euBac alignments. Furthermore, we also included an independent previously published phylogenomic data set of 119 genes that contains ex- tensive Rhizara data (Burki et al. 2013). After phylogenetic quality control, two euBac proteins were excluded due to lack of reliable rhizarian sequences (strong indication of non-orthology or HGT), and four proteins included in both data sets were reduced to single copies. The remaining 150 partitions were assembled into a single 150-gene phylogenomic data set containing 35195 amino-acid alignment positions (Paper II).

28 Phylogenomic data for inferring the origin of mitochondria Although many of the 37 euBac proteins (Paper I) are also suitable for in- ferring the origin of mitochondria, they are all nuclear encoded. To tailor a phylogenomic data set to address this particular question, I combined 30 euBac proteins showing clear phylogenetic affinities to α-proteobacteria with previously published 19 mitochondrial proteins encoded in the most gene-rich mitochondrial genome of Andalucia godoyi (Burger et al. 2013), and 6 previously described phylogenomic markers that are of α- proteobacterial origin (Derelle and Lang 2011). Blastp searches against NCBI non-redundant protein database with e-value cutoff 1e-20 were used to find homologs of each mitochondrial protein, followed by phylogenetic quality controls, particularly for making sure eukaryotic monophyly (details see Paper IV). Total 55 universal mitochondrial proteins were assembled into a phylogenomic data set containing 14542 amino-acid alignment posi- tions.

Phylogenomic tree reconstruction Both maximum-likelihood and Bayesian inference phylogenetic recon- struction were used in all phylogenomic data sets. RAxML (Stamatakis 2014) was used for maximum-likelihood method, and PhyloBayes (Lartillot et al. 2009; Lartillot et al. 2013) for Bayesian method. The WAG (Whelan and Goldman 2001) or LG (Le and Gascuel 2008) evo- lutionary models were used for site-homogeneous amino-acid substitution and the rate variation across alignment sites was modeled by gamma distri- bution (Yang 1994), in both maximum-likelihood and Bayesian methods. The site-heterogeneous amino-acid substitution model implemented in PhyloBayes was also used (Lartillot and Philippe 2004).

Hypothesis testing Phylogenetic (topological) hypothesis tests were conducted using Approxi- mately unbiased (AU) test (Shimodaira 2002). The processes include con- straining alternative topologies followed by re-estimation of maximum- likelihood model parameters and per-site log-likelihoods computation by RAxML (Stamatakis et al. 2008), then p-values associated with correspond- ing topologies were calculated from multi-scale RELL bootstrap replicates with default settings using CONSEL (Shimodaira and Hasegawa 2001).

29 Results

Summary of Paper I

He, D., Fiz-Palacios, O., Fu, C.-J., Fehling, J., Tsai, C.-C., & Baldauf, S. L. (2014) An Alternative Root for the Eukaryote Tree of Life. Current Biology, 24(4):465–470.

The position of the eukaryote root is one of the classic problems in evolu- tionary biology, yet has also proven extremely challenging. Resolving this problem has major impact on addressing a set of fundamental evolutionary questions – the identity of the major groups of eukaryotes, the branching order among them and the nature of the last eukaryote common ancestor (LECA). One of the main issues for robustly placing this single rooting point with molecular phylogeny is the long-branch attraction (LBA) that causes long-branch taxa artificially grouping together and being drawn toward the long-distant outgroups. LBA is especially problematic for rooting the eukar- yote tree, as the extraordinary genetic distance between Eukaryota and Ar- chaea or Bacteria would intrinsically exacerbate the problem. A potential useful information that could shorten the distance between in- group (Eukaryota) and outgroup (Archaea or Bacteria) is the endosymbiotic event that happened before LECA. The logic is based on that genetic influx of the single mitochondrial endosymbiotic event (i.e. EGT) should be con- sistent with the eukaryote genealogy, and the genetic distance should be shorter between the mitochondrial progenitor (bacterial donor) and eukary- otes (recipient). This makes nuclear encoded (nc) mitochondrial proteins potentially good phylogenetic markers to reconstruct a rooted eToL. However, there are no systematic characteristics of nc mitochondrial pro- teins in eukaryotes. Since the bacterial origin of mitochondria is well accept- ed, we therefore conducted a systematic search for eukaryotic proteins of clear bacterial origin that are evidently mitochondrial localization or func- tion.

30

Figure 3. A rooted phylogeny of eukaryotes based on 37 euBac proteins. The tree shown was derived from the full euBac data set ( M) by maximum likelihood using RAxML 7.6.6. Branches are drawn to scale as indicated by the scale bar. Ma- jor groups are indicated by different colors and to the right with brackets and names. Nodal support from maximum-likelihood bootstrapping (mlBP, above branches) and Bayesian inference posterior probabilities (biPP, below branches) is shown only for nodes that did not receive full support (100% mlBP and 1.0 biPP). The inferred topologies from different data sets or analyses were identical except where indicated by dashed lines, for which only the highest support values are shown. Statistical support for the two nodes flanking the root (denoted a and b) is shown separately in the upper left for analyses with all alignment positions (matrix M), analyses with fast-evolving sites removed (matrix M’), or analyses with bacterial sequences masked for the ten euBac proteins with IOScore > 1.0 (matrix M”).

Universal eukaryotic proteins of bacterial origin (euBac) Bacterial homologs in eukaryotic genomes are abundant, but their distribu- tion vary significantly in different taxa and their origins are usually uncer- tain. The identification of “universal euBac” is two-folded based on two criteria: 1) eukaryote proteins with clear bacterial origin; 2) eukaryote pro- teins that are universally present (so they are more likely to be present in the LECA). We developed two protocols employing a combination of homologous clustering and phylogenetic screening that were used to identify proteins suitable for deep eukaryote phylogeny based on two topological characteris- tics of (1) clear bacterial origin, and (2) legitimate phylogenetic signal indi-

31 cating genuine orthology (out-paralog free, and absence of both HGT and strong conflicts with well-supported eukaryotic grouping). Two search pro- tocols preliminarily yielded 281 unique euBac clusters. Each of these clus- ters was then expanded to include 54 eukaryotes and 13 bacteria (Table S2 in Paper I) using the corresponding S. cerevisiae (or, if unavailable, D. mela- nogaster or A. thaliana) sequence as query. These database assembly and screening steps were facilitated with an inspection-enabled program (Lundin 2010). A final set of 37 universal euBac clusters were determined to be suit- able for deep eukaryote phylogeny and concatenated into a phylogenomic data set containing 14350 amino-acid alignment positions.

A Rooted Phylogeny of Eukaryotes The 37-euBac data set produced a well-resolved global eukaryote phyloge- ny, rooted with bacterial sequences (Figure 3). All supergroups of eukary- otes receive maximum statistical supports (100% mlBP and 1.00 biPP). Most importantly, we were able to place the rooting point between Discoba and the rest of eukaryotes with very strong support (90% mlBP and 1.00 biPP). Furthermore, we have also shown that this root appears to be robust to re- moval of fast-evolving site and partitions with larger ingroup-outgroup dis- tance. Other two most likely alternatives were significantly rejected (p < 0.05, AU test), even with only a random subset of the data. Based on these analyses, we proposed a novel “neozoan-excavate” dichotomy for the eToL.

Analyzing the contradictory data The only recent phylogenomic data also developed based on mitochondrial endosymbiotic theory generated drastic different result by supporting the classic “unikont-” theory (Derelle and Lang 2011). However, we found potential incongruent signals within the contradictory data regarding the phylogenetic positions of Discoba. Therefore, we conducted the side-by- side analyses with different Discoba subsets and found that our euBac data continued to robustly reconstruct “neozoan-excavate” root. On the contrary, the contradictory data was unable to recover “unikont-bikonts” root with nearly all combinations of Discoba subsets, instead two non-canonical groupings strongly emerged (+plants and heterolo- boseans+amoebozoans).

The nature of the proto-mitochondrial proteome Although a portion of the mitochondrial proteins consistently traces back to α-proteobacterial origin, it is certain not the case for the majority of other mitochondrial proteins. For example, only 10% of yeast mitochondrial pro- teome can be certain of α-proteobacterial origin (Karlberg et al. 2000). Nev-

32 ertheless, what we found was consistent with the known chimeric nature of proto-mitochondrial proteome. Regardless of the mysterious histories of proto-mitochondrial proteome evolution, a single evolutionary history of eukaryotes should be expected after that.

Evolutionary implications of the “neozoan-excavate” root We feel that this new root should stimulate research on supergroup Discoba, as it may hold crucial information for better understanding of the early eu- karyotic evolution. Although the amitochondriate “core” excavates were unable to be included in the euBac data set, the monophyletic Excavata sug- gested by another independent phylogenomic data set may underscore the importance of any synapomorphy of this ancient eukaryote supergroup.

33 Summary of Paper II

He, D., Sierra R., Pawlowski J., & Baldauf, S. L. Reducing long-branch effects in a multi-protein data set identifies a new Rhizaria+Alveolata clade. [Manuscript]

Rhizaria are primarily single-celled organisms living in diverse ecological niches with fascinating biological diversity, albeit still largely unexplored (Burki and Keeling 2014). They include naked and testate amoeboid, hetero- trophic protists with thread-like pseudopodia functioning to acquire food and direct movement. The testate amoeba such as Foraminifera with carbonate shells and Radiolaria with their ornate silica shells, contribute substantial that widely used for paleogeographic and paleocli- matic reconstruction (Boudagher-Fadel 2012). The naked amoeba, such as Cercozoa, including the photosynthetic Chlorarachiophytes, the parasitic , and possibly a recent discovered multicellular sorocarpic Guttulinopsis vulgaris (Brown et al. 2012). Although the most recently recognized supergroup of SAR (Stramenopila, Alveolata and Rhizaria) appears to be robust based on recent phylogenomic studies (Adl et al. 2012; Burki 2014), the exact branching order within the SAR clade remains uncertain (Parfrey et al. 2010; Burki et al. 2012; Sierra et al. 2012). One important aspect of the relationships within SAR is the con- tradictory “chromalveolate hypothesis”, which predicts one single endosym- biosis event of secondarily acquiring a red alga by many photosynthetic line- ages within “chromalveolate” clade (Cavalier-Smith 1999). However, it would require a minimum monophyletic Stramenopila+Alveolata clade.

The phylogenetic position of Rhizaria To test the phylogenetic branching order within SAR, we assembled a data set of 150 genes with extensive rhizarian coverage. We first found conflict- ing topologies that maximum-likelihood method placed Rhizaria sister to Alveolata, while Bayesian inference recovered the more popular topology of Rhizaria sister to Stramenopila+Alveolata. By further analyzing some illegit- imate sequences (HGT or contamination) and extreme long branches within Rhizaria, we developed two data subsets consisting of 1) only genes recon- structing rhizarian monophyly (39 proteins), and 2) only genes lacking ex- tremely long rhizarian branches (36 proteins). The 36- and 39-protein data subsets consistently recovered a Rhizaria+Alveolata clade with nearly full support regardless of method used. Moreover, this relationship was also re- covered by all data sets and phylogenetic methods when only using close outgroups (Archaeaplastida and/or Haptophyta). Therefore, the results in Paper II suggest a new Rhizaria+Alveolata assemblage.

34 Summary of Paper III

He, D., Fu, C.-J. & Baldauf, S. L. Horizontal transfer of heme A synthase gene from bacterium to jakobid mitochondrial DNA. Submitted.

The most “primitive” mitochondrial genomes are believed in jakobid flagel- lates (Jakobida). They possess strikingly bacterial-like and unique mitochon- drial genome architectures such as a four-subunit bacteria-like RNA poly- merase and potential Shine–Dalgarno initiation motifs (Burger et al. 2013). An exciting case is the jakobid Andalucia godoyi whose mito- chondrial genome, besides other jakobid-shared bacterial features, contains the largest gene compliment to date. Moreover, it includes a cox15 gene encoding the essential protein Heme A synthase (HAS) that universally re- sides in nuclear genomes in other eukaryotes (nc-cox15). Therefore, the only mitochondrial genome encoded cox15 (mt-cox15) was inferred as an extra outstanding ancestral character of A. godoyi’s mitochondria.

The jakobid mtDNA cox15 gene does not reflect a primitive state. During the course of analyzing universal mitochondrial proteins (including euBac), I attempted to incorporate the mitochondrial genes of A. godoyi. Surprisingly, instead of confirming its ancestral state, we found that mt- cox15 does not group with nc-cox15 and the majority of α-proteobacteria in molecular phylogeny of HAS. We subsequently performed systematic search to include copies of HAS from all three domains (Eukaryota, Bacteria and Archaea). The global HAS phylogeny clearly showed two fully supported distinct , which we designated two HAS types: type-1 consists mt- cox15 encoded HAS and variety of different Bacteria and Archaea, and type- 2 consists of all nc-cox15 encoded HAS and the majority of α- proteobacteria. The type-2 clade topology was recovered as a well-supported monophyletic eukaryote clade embedded within the α-proteobacterial clade, which is consistent with the single mitochondrial endosymbiosis event. We also discovered distinct structural organization and functional im- portant sites that separate two HAS types. Thus the results in Paper III strongly suggest that the seemingly “primitive” mt-cox15 appears to be hori- zontally acquired by either A. godoyi or the last common ancestor of jaokids from a bacterium, thus effectively rejects its ancestral state.

35 Summary of Paper IV

He, D., Whelan S., & Baldauf, S. L. Mining the phylogenetic signal for inferring the origin of mitochondria. [Manuscript]

It is almost certain that the last mitochondrial common ancestor (LMCA) was a bacterium, most likely a α-proteobacterium. After ~ 2 billion years evolution, the challenge for searching its close extent α-proteobacterial rela- tive is enormous. Many years of research put two prime candidates under the spotlight: Rickettsiales and SAR11 (Andersson et al. 1998; Thrash et al. 2011). Regardless of substantial phylogenetic evidence supporting their close relationships with mitochondria, methodological inadequacy remains the primary obstacle to settle the debate. Two shared genomic characteristics of these three parties (i.e. mitochondria, Rickettsiales and SAR11) are the AT-bias and fast-evolving, which are the two main sources of phylogenetic artifact. I sought to assemble a phylogenomic data set of universal mitochondrial proteins that represent a portion of proto-mitochondrial proteome to test the α-proteobacterial origin of mitochondria. Furthermore, I specifically tested if these two genomic features would affect the branching of mito- chondria clade.

A phylogenomic data set of 55 universal mitochondrial proteins I assembled a 55-protein data set with 13 eukaryotic (mitochondriate) taxa from all eukaryote supergroups consisting ~72% nuclear genome encoded and ~28% mitochondrial genome encoded proteins. I included 34 α- proteobacteria from 6 major groups and 18 proteobacteria as the outgroup. The phylogenetic analyses of this 55-protein data showed consistent mito- chondria+Rickettsiales clade, however, showed inconsistent branching posi- tion of SAR11 with different methods used.

Mitochondria-Rickettsiales signal and compositional heterogeneity Therefore, I tested if the severe amino-acid compositional heterogeneity in the data set that could potential be led by AT-bias and cause the phylogenet- ic artifact. The data after reducing amino-acid compositional heterogeneity indeed consistently recovered the same SAR11 clade regardless of different methods. Surprisingly, the mitochondria+Rickettsiales clade was no longer recovered. Instead, the mitochondrial clade was placed sister to the whole extent α-proteobacterial clade. This novel topology was also robust to fast-

36 evolving site removal suggesting that the fast-evolving genome characteris- tics is responsible for it.

Testing phylogenetic signal heterogeneity To further explore the ancient signal supporting the new hypothesis, we de- veloped novel approaches to study the phylogenetic signal in individual genes and at individual alignment sites. We conducted comparative analyses of real and simulation data and found that while the real data was reasonably fitted with the evolutionary model, unexpected level of signal heterogeneity supporting other alternative branching positions of mitochondrial clade was also revealed. Thus, the results in Paper IV challenged both popular candi- dates being the mitochondrial sister group, and suggested the LMCA may have diverged before the bulk of a-proteobacterial diversity.

37 Svensk Sammanfattning

De tidiga skeendena av eukaryoternas evolution är intimt förknippade med uppkomsten av mitokondrier. Utvecklingen inom molekylär fylogeni och genomik har gett oss verktygen för att få en bättre förståelse av några av de frågor som fortfarande kvarstår om eukaryoternas gemensamma ursprung. I den här avhandlingen så behandlar jag två grundläggande aspekter av eukar- yoternas tidiga utveckling: de djupa förgreningarna i eukaryoternas släktskaps träd och utvecklingen av mitokondrier. Eukaryoternas släktskapsträd visar på de evolutionära förhållandena mel- lan alla eukaryoter. Nästan alla eukaryoter, inklusive oss själva, kan med säkerhet placeras i någon av ett dussin supergrupper. I Paper I så tar vi fram ett fylogenomiskt dataset bestående av 37 eukaryoteproteiner vilka har ett bakteriellt ursprung (euBac) för att använda som fylogenetiska markörer, för att i sin tur testa vart roten på eukaryoternas släktskapsträd ska placeras. Baserat på mitokondrieendosymbiosteorin representerar euBact-datasetet en del av proto-mitokondrie-proteomet, vilket ger ett kortare genetiskt avstånd än den klassiska utgruppen Archaea. Ett kortare genetisk avstånd ger i sin tur mindre problem med ”long-branch attraction” (LBA), d.v.s. att grupper som avskiljs av långa grenar får ett närmare släktskap i analysen än de har i verk- ligheten. EuBac fylogenin identifierade alla välkända supergrupper och pla- cerade roten mellan Discoba (Excavata) och resten av eukaryoterna. Detta förkastar den förhärskande hypotesen om en primär dikotomi mellan Uni- konter och Bikonter. I Paper II, adderar jag transkriptomdata från den minst sekvenserade eu- karyota gruppen, Rhizaria, till euBac-datasetet. EuBact-datasetet kombinera- des också med ett fylogenomiskt dataset med 119 gener som kom uteslu- tande från Rhizaria. Detta gav ett fylogenomiskt dataset med 150 gener för att testa de djupa förgreningarna mellan tre medlemmar av den mest nyligen definierade supergruppen SAR: Stramenopila, Alveolata och Rhizaria. Olika analysmetoder gav motstridiga topologier och avslöjade en falsk signal som beror på LBA, möjlig horisontal genöverföring (horizontal gene transfer; HGT) och sekvenskontamination, orsakad av Rhizaria taxa. Genom att minska dessa falska fylogenetiska signaler så resulterade alla metoder i att Rhizaria och Alveolata formar en klad. Detta motsäger den populära gruppe- ringen Stramenopila+Alveolata. I Paper III, studerar vi den unika evolutionära historien av en universal mitokondriegen, cox15, vilken kodar Heme A synthase (HAS) som är essen-

38 tiellt för anaerob respiration. Den enda cox15 som kodas i det mest bakterie- lika och genrika mitokondriegenomet av jakobid Andalucia godoyi har anta- gits vara ursprunglig (mt-cox15), medan alla andra nukleära genom kopior (nc-cox15) har antagits var resultatet av klassisk endosymbiontisk genöver- föring (endosymbiotic gene transfer; EGT). Det är slående hur vår fylogeni och strukturanalys av HAS proteiner visar ett klart oberoende ursprung av mt-cox15, och därmed bestrider att det är den ursprungliga formen. Nyligen gjorda fylogenomiska studier indikerar att två α-proteobakterie- ordningar, Rickettsiales och SAR11, är de troligaste kandidaterna. Men slutsatsen att de har ett nära förhållande med mitokondrier kan ha påverkas av deras AT-rika genom vilket kan orsaka fylogenetiska artefakter. För att testa ursprunget av mitokondrier sammanställde jag, i Paper IV, ett fyloge- nomiskt dataset med 55 universella mitokondrieproteiner, vilket kombinerar 30 euBacs (nucleärt kodade) och 25, tidigare publicerade, mitokondrieprote- iner med tydligt α -proteobakterieursprung. En fylogenetisk analys av datasetet stöder en mitokondrie+Rickettsiales-klad där SAR11 grenar av i olika positioner beroende på metod. Men om man reducerar de aminosyror som ger högt AT-innehåll, så förkastas mitokondrie+Rickettsiales-kladen, och mitokondrierna blir i stället en systergrupp till alla α -proteobakterier. Jämförande analyser mellan verklig och simulerad data identifierar en ovän- tad signal-heterogenitet som stöder andra hypoteser. Detta antyder ett mer komplext scenario för mitokondriernas ursprung.

39 Acknowledgments

Countless gratitude is what in my heart.

I would like to start by thanking my supervisor Sandra Baldauf. I was lucky enough to meet you in York (UK) when I barely knew what is phylogeny. You not only offered me this great opportunity to study here in Uppsala University, but also taught me very much, from scientific knowledge to how to be a qualified scientist. Your inspiration and encouragement has always brought me confidence, whenever I almost lost it. You are definitely one of the most passionate scientists who I’d look up to! You are also a great friend, who helped me to carry my first bed from IKEA, lent me your car for pick- ing up my cat…and a long list of other things! And yes, I also enjoyed a lot babysitting your (especially Max!). Big thanks to all the members at Systematic Biology! Afsaneh, thank you for your generosity and kindness for helping me get through some of my tough moments. Agneta, thanks for all the help with my book loans. Allison, you have always been very helpful and a good friend, thank you for taking care of our Luna cat and I hope someday I can return the favor taking care of Wally! Andreas, thank you for letting me use your costumed “Fenris” ma- chine during the first two years of my PhD, it really helped! Anders L, thanks for good tips on using your fastest alignment viewer AliView, good job! Anders R, although only for a short time in the department, it’s good to know you and good luck with your further work. Anja, thanks for the help during initial period of my PhD study, and it’s good to know you’re around Uppsala. Anneleen, thanks for checking on me in my office from time to time, but I know my office is no where near as good as yours, especially equipped with that toilet that requires exercise to climb at least 10 stairs eve- ry time one just wanna take a loo! Erik L, my project student, thank you for the great help on the euBac project. Gemma, thank you for your helps on python programming all these time from York to Uppsala, I hope to visit you sometimes in Umeå. Henrik S, thanks for a quick “fungal lesson” along the ride after julbord. Hugo, thank you for being a good colleague and friend, helping and cheering me when I needed the most, and you’re definitely more “chairman” than I ever did. Johanna and Maria L, I will never forget your remarkable courage and strength that truly inspired me. John, thank you for everything! We had so much great time not only in Uppsala, but also in Kyo- to and Taiwan. I am convinced that someday you will make your “banana

40 project” come true! Nahid, thanks for donating some of your lovely plants in my office that really kept me fresh, and I know you’ll get better in no time! Ping, your Thai food was awesome and good luck with your career back in Thailand. Sanea, thanks for the helps with perl scripting and best of luck with your PhD study. Sarina, it’s great to know you in the department and I am still waiting to taste that mysterious (to me) food (or snack?) Chikanda! Stina, you are such an excellent and enthusiastic teacher, and thank you for organizing all these great party for us! Sunniva, thanks for the help at the beginning of PhD study, and I hope to visit you sometime in . Baset, great to chat with you and realized that some orchids are actually eatable! ChengJie, it’s been a great pleasure to have you as a colleague and a friend. You are a great scientist and I am lucky to do projects with you! Many thanks for your critical inputs and productive discussion all around! Of course, a great appreciation to you, 唐文 and 瑞秋 for the hospitality! Catarina, great plant scientist now at Stockholm University, I had a great time in your fossil course and hopefully will be more in the future. Leif and Sanja, great mycologists/lichenologists who I wish to collaborate in the future. Inga, awesome botanist, I hope you will get your big series published online soon! Karin, you are the cutest German girl I’ve ever known. I en- joyed chatting with you very much, and best wishes for your future plan (preferably in Uppsala)! Katarina, thanks for the help on the teaching/duty credit calculation. Henrik L, thanks for your MacPro “legacy”, which turned out to be my primary computing power! Magnus, I really like our regular “flowering alert”, more please because I just started to archive them! Mariet- te, many thanks for your kind regards and all the best to your company! Ma- ria R, the “dicty scientist” thanks for the happy chat. Martin R, thanks for the valuable discussion and suggestion throughout the end of my PhD studies. I wish to collaborate with you (this time for real!). Mats, thanks for the help on plant systematics clarification. Mikael, thanks for explaining (repeatedly) the fundamentals of statistical models, and the helps for getting my teaching in the computer labs running smoothly. Omar, thank you for all the help that got our first project finally published. Paco, always pleasant to chat with you and I wish your projects go very well and manage to visit Taiwan. Petra, thank you for giving me a chance to present my study in the VEM course. Thomas, thank you for your kind regards and I really admire your passion for science! Åsa, thanks for organizing a good book club and hopefully more are coming. David M, it has been fun to chat with you and I wish yo hear more about your fun research. I would also like to especially thank all the personelle and engineer staffs at EBC. You guys are invaluable for improving my working conditions and helping me out with all kinds of troubles that I had. Stefan Å, big thanks to you for helping me so many computer/internet problems, and I also learnt a lot from you! Elisabeth L, your smile always warmed me and thank you for the great help for course credits and Infoglu system.

41 Cosima, Frederik and Jakob, thanks for your kind friendship and the most excited world cup game that I have ever enjoyed, thanks! Yoshi, the all- time-superstar, you and your impressive guitar performance accompanied brought me so many good memories! I wish you all the best for your future life. Peter H, thank you for your kind friendship and good luck with your thesis defense. Camille, Jonathan, KiWoong and Sofia, it was a great ple- asure to spend quality time (scientifically or casually) with you! I would like to thank the people from Molecular Evolution. Feifei (who managed to make a student cry) and Katarzyna, thanks for all the help during the teaching in the Bioinformatics labs. Siv A and Jan A, thank you for your insightful comments for my projects. Johan, Joran and Thijs, thanks for the constructive discussions and technical helps for the mitochondrial project. Lionel, thanks for the bioinformatics help and I missed the time in Český Krumlov. Ola, even though you don’t work here anymore, I won’t forget your help for my computer problem, and yes, I still kept the ”under-water” hat you made in my office as a very nice decoration. Special thanks to my Rhizaria project collaborator Roberto and Jan P at University of Geneva. I had a great summer time in Geneva and it has been my pleasure working with you! Special thanks to my mitochondrial project collaborator Simon W at Evo- lutionary Biology. You keep inspiring me every time we chat and I learnt a great deal of knowledge about evolutionary model and statistics from you. I look forward working with you in the future. Special thanks for the independent fundings, which substantially aid my PhD study. Stiftelsen Lars Hiertas Minne, EBC Graduate School on Ge- nomes and Phenotypes, Anna Maria Lundins resestipendier and Liljewalchs resestipendium. 感謝所有在瑞典的台灣朋友與同學 , you are an amazing big family! 陳姐 (在地的台灣家鄉味代言人,我資金都準備好了快成立公司吧! ), 王大哥 (太極+演奏大師,每次跟您聊天都學到好多,謝謝您! ),薇薇 (我見過最高的十一歲女孩,聰明又乖巧多才多藝! ),謝謝你們 待我像 家人般 的照顧 !吳姐和吳大哥,謝謝你們的多次招待,和我分享許多 在瑞典生活上的經驗!家龍,祝你日後的工作順利! 尚峰,素真,卡 卡,謝謝你們經常的舉辦 ”家聚 ”,以及令人驚艷的 ”辦桌 ”能力! Nancy, Christer and Linnea, thank you for your kindness and all the big helps for many things, I was always amazed by the energetic Linnea, and special thanks to Christer for the linux machine purchase and setting that helped me a lot for my work! Kai-Chun, Ying-Hsiu and the “little beauty”, thank you for all the helps and best wishes for your studies and life here! Frank thanks for taking all that great photos of me and good luck with your future carrier! Hsin-Ho and Guei-Bau, thanks for all kinds of life tips and best wishes with your life in Boston! Thanks to all the Taiwanese students studied here in Sweden, Ching-En, Yi-Fan, Reverb, Serene, Yu-Ting, Craig, Leanne, Liang- Chien, Anthony and Jack T, for all the helps and cheers for your own life

42 paths! Special thanks to Vivianne and Cultural Division, Taipei Mission in Sweden for the help of funding student activities here in Uppsala.

最重要的,要感謝我遠在台灣的家人,奶奶,爸爸,媽媽,弟弟, 無條件的支持我在異鄉求學,雖然你們免不了的還是擔心這擔心那, 但這些年我總是成長了,過程比我想像的顛頗許多,稱不上順利,但 總是完成了。不過新的挑戰才要開始,你們就好好享受人生煩惱留給 我們吧!親愛的老弟,你也辛苦啦,好好加油你很優秀的要相信自己! 文傑,立偉,以及龍潭幫眾幫友,謝謝你們 遠在台灣的吐槽和支持, 對我來說是最珍貴的精神食糧!

慧妍,我摯愛的妻子

這輩子最幸運的事莫過於在這遇見妳

然後

我們呼吸的空氣,騎乘的路,品嚐的美食,看的電視,路過的風景

很快的 同步 了

謝謝妳,出現在我的生命中,加入,成為我的家人

Hwei-yen and Luna, thank you, for being my family

Finally, I am sure that I must have missed people who helped me along the way of my PhD study. In that case, please accept my sincere apology for forgetting to thank you and don’t be mad at me. Being angry is not good for your health, you know!

43 References:

Adams KL, Palmer JD. 2003. Evolution of mitochondrial gene content: Gene loss and transfer to the nucleus. Mol. Phylogenet. Evol. 29:380–395. Adl SM, Simpson AGB, Lane CE, et al. 2012. The revised classification of eukaryotes. J. Eukaryot. Microbiol. 59:429–514. Ajawatanawong P, Atkinson GC, Watson-Haigh NS, Mackenzie B, Baldauf SL. 2012. SeqFIRE: a web application for automated extraction of indel regions and conserved blocks from protein multiple sequence alignments. Nucleic Acids Res. 40:W340–7. Andersson S, Kurland C. 1998. Reductive evolution of resident genomes. Trends Microbiol. 6:263–268. Andersson SG, Zomorodipour A, Andersson JO, et al. 1998. The genome sequence of prowazekii and the origin of mitochondria. Nature 396:133–140. Archibald JM. 2009. The puzzle of plastid evolution. Curr. Biol. 19:R81–8. Archibald JM. 2011. Origin of eukaryotic cells: 40 years on. 54:69–86. Arisue N, Hasegawa M, Hashimoto T. 2005. Root of the eukaryota tree as inferred from combined maximum likelihood analyses of multiple molecular sequence data. Mol Biol Evol 22:409–420. Bachvaroff TR, Handy SM, Place AR, Delwiche CF. phylogeny inferred using concatenated ribosomal proteins. J. Eukaryot. Microbiol. 58:223–233. Baldauf SL, Palmer JD, Doolittle WF. 1996. The root of the universal tree and the origin of eukaryotes based on elongation factor phylogeny. Proc. Natl. Acad. Sci. 93:7749–7754. Baldauf SL, Palmer JD. 1990. Evolutionary transfer of the chloroplast tufA gene to the nucleus. Nature 344:262–265. Baldauf SL. 2003. The deep roots of eukaryotes. Sci. (New York, NY) 300:1703– 1706. Baldauf SL. 2008. An overview of the phylogeny and diversity of eukaryotes. J. Syst. Evol. 46:263–273. Bapteste E, Brinkmann H, Lee JA, et al. 2002. The analysis of 100 genes supports the grouping of three highly divergent amoebae: Dictyostelium, , and . Proc. Natl. Acad. Sci. U. S. A. 99:1414–1419. Brinkmann H, van der Giezen M, Zhou Y, Poncelin de Raucourt G, Philippe H. 2005. An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics. Syst. Biol. 54:743–757. Brown MW, Kolisko M, Silberman JD, Roger AJ. 2012. Aggregative Multicellularity Evolved Independently in the Eukaryotic Supergroup Rhizaria. Curr. Biol. Brown MW, Sharpe SC, Silberman JD, Heiss AA, Lang BF, Simpson AGB, Roger AJ. 2013. Phylogenomics demonstrates that breviate flagellates are related to and apusomonads. Proc. Biol. Sci. 280:20131755.

44 Brown MW, Silberman JD, Spiegel FW. 2012. A contemporary evaluation of the acrasids (, Heterolobosea, Excavata). Eur. J. Protistol. 48:103–123. Brown MW, Spiegel FW, Silberman JD. 2009. Phylogeny of the “forgotten” cellular , Fonticula alba, reveals a key evolutionary branch within Opisthokonta. Mol. Biol. Evol. 26:2699–2709. Burger G, Gray MW, Forget L, Lang BF. 2013. Strikingly bacteria-like and gene- rich mitochondrial genomes throughout jakobid protists. Genome Biol. Evol. Burki F, Corradi N, Sierra R, Pawlowski J, Meyer GR, Abbott CL, Keeling PJ. 2013. Phylogenomics of the Mikrocytos mackini reveals evidence for a in rhizaria. Curr. Biol. 23:1541–1547. Burki F, Keeling PJ. 2014. Rhizaria. Curr. Biol. 24:R103–7. Burki F, Okamoto N, Pombert J-F, Keeling PJ. 2012. The evolutionary history of haptophytes and cryptophytes: phylogenomic evidence for separate origins. Proc. R. Soc. B Biol. Sci. 279:2246–2254. Burki F, Shalchian-Tabrizi K, Minge M, Skjaeveland A, Nikolaev SI, Jakobsen KS, Pawlowski J. 2007. Phylogenomics reshuffles the eukaryotic supergroups. PLoS One 2:e790. Burki F. 2014. The eukaryotic tree of life from a global phylogenomic perspective. Cold Spring Harb. Perspect. Biol. 6:a016147–a016147. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. 2009. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25:1972–1973. Castresana J. 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol 17:540–552. Cavalier-Smith T, Chao EE. 1996. Molecular phylogeny of the free-living archezoan Trepomonas agilis and the nature of the first eukaryote. J. Mol. Evol. 43:551– 562. Cavalier-Smith T, Chao EEY. 2003. Phylogeny of choanozoa, apusozoa, and other and early eukaryote . J. Mol. Evol. 56:540–563. Cavalier-Smith T. 1989. Archaebacteria and Archezoa. Nature 339:100–101. Cavalier-Smith T. 2002. The phagotrophic origin of eukaryotes and phylogenetic classification of Protozoa. Int. J. Syst. Evol. Microbiol. 52:297–354. Cavalier-Smith T. 2013. Early evolution of eukaryote feeding modes, cell structural diversity, and classification of the protozoan phyla Loukozoa, Sulcozoa, and Choanozoa. Eur. J. Protistol. 49:115–178. Derelle R, Lang BF. 2011. Rooting the eukaryotic tree with mitochondrial and bacterial proteins. Mol Biol Evol. Doolittle WF. 1999. Phylogenetic classification and the universal tree. Science 284:2124–2129. Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32:1792–1797. Embley TM, Martin W. 2006. Eukaryotic evolution, changes and challenges. Nature 440:623–630. Erwin DH, Davidson EH. 2002. The last common bilaterian ancestor. Development 129:3021–3032. Felsenstein J. 1978. Cases in which Parsimony or Compatibility Methods will be Positively Misleading. Syst. Biol. Foster PG, Hickey DA. 1999. Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. J. Mol. Evol. 48:284–290.

45 Fritz-Laylin LK, Prochnik SE, Ginger ML, et al. 2010. The genome of illuminates early eukaryotic versatility. Cell 140:631–642. Fuentes I, Stegemann S, Golczyk H, Karcher D, Bock R. 2014. Horizontal genome transfer as an asexual path to the formation of new species. Nature advance on. Glöckner G, Noegel AA. 2013. in the Amoebozoa clade. Biol. Rev. Camb. Philos. Soc. 88:215–225. Grant JR, Katz LA. 2014. Building a Phylogenomic Pipeline for the Eukaryotic Tree of Life – Addressing Deep Phylogenies with Genome-Scale Data. PLoS Curr. Tree Life Edition 1. Gray MW, Cedergren R, Abel Y, Sankoff D. 1989. On the evolutionary origin of the plant mitochondrion and its genome. Gray MW. 2012. Mitochondrial evolution. Cold Spring Harb. Perspect. Biol. 4:a011403. Gray MW. 2014. The pre-endosymbiont hypothesis: a new perspective on the origin and evolution of mitochondria. Cold Spring Harb. Perspect. Biol. 6:a016097. Hampl V, Hug L, Leigh JW, Dacks JB, Lang BF, Simpson AGB, Roger AJ. 2009. Phylogenomic analyses support the monophyly of Excavata and resolve relationships among eukaryotic “supergroups”. Proc. Natl. Acad. Sci. 106:3859–3864. Harper JT, Waanders E, Keeling PJ. 2005. On the monophyly of chromalveolates using a six-protein phylogeny of eukaryotes. Int. J. Syst. Evol. Microbiol. 55:487–496. He D, Fiz-Palacios O, Fu C-J, Fehling J, Tsai C-C, Baldauf SL. 2014. An Alternative Root for the Eukaryote Tree of Life. Curr. Biol. 24:465–470. Herman EK, Greninger AL, Visvesvara GS, Marciano-Cabral F, Dacks JB, Chiu CY. 2013. The mitochondrial genome and a 60-kb nuclear DNA segment from , the causative agent of primary amoebic . J. Eukaryot. Microbiol. 60:179–191. Huang CY, Ayliffe MA, Timmis JN. 2003. Direct measurement of the transfer rate of chloroplast DNA into the nucleus. Nature 422:72–76. Huelsenbeck JP, Bollback JP, Levine AM. 2002. Inferring the root of a phylogenetic tree. Syst. Biol. 51:32–43. Husnik F, Nikoh N, Koga R, et al. 2013. Horizontal gene transfer from diverse bacteria to an genome enables a tripartite nested mealybug symbiosis. Cell 153:1567–1578. Iwabe N, Kuma K, Hasegawa M, Osawa S, Miyata T. 1989. Evolutionary relationship of archaebacteria, eubacteria, and eukaryotes inferred from phylogenetic trees of duplicated genes. Proc. Natl. Acad. Sci. U. S. A. 86:9355–9359. James TY, Kauff F, Schoch CL, et al. 2006. Reconstructing the early using a six-gene phylogeny. Nature 443:818–822. James TY, Pelin A, Bonen L, Ahrendt S, Sain D, Corradi N, Stajich JE. 2013. Shared signatures of and phylogenomics unite cryptomycota and microsporidia. Curr. Biol. 23:1548–1553. Janouškovec J, Tikhonenkov D V, Mikhailov K V, Simdyanov TG, Aleoshin V V, Mylnikov AP, Keeling PJ. 2013. Colponemids represent multiple ancient alveolate lineages. Curr. Biol. 23:2546–2552.

46 Jones MDM, Forn I, Gadelha C, Egan MJ, Bass D, Massana R, Richards TA. 2011. Discovery of novel intermediate forms redefines the fungal tree of life. Nature 474:200–203. Kamaishi T, Hashimoto T, Nakamura Y, Nakamura F, Murata S, Okada N, Okamoto K, Shimizu M, Hasegawa M. 1996. Protein phylogeny of translation elongation factor EF-1 alpha suggests microsporidians are extremely ancient eukaryotes. J. Mol. Evol. 42:257–263. Karlberg O, Canbäck B, Kurland CG, Andersson SG. 2000. The dual origin of the yeast mitochondrial proteome. Yeast 17:170–187. Karpov SA, Mikhailov K V, Mirzaeva GS, Mirabdullaev IM, Mamkaeva KA, Titova NN, Aleoshin V V. 2013. Obligately phagotrophic aphelids turned out to branch with the earliest-diverging fungi. Protist 164:195–205. Katoh K, Misawa K, Kuma K. 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. Katz LA, Grant JR, Parfrey LW, Burleigh JG. 2012. Turning the crown upside down: gene tree parsimony roots the eukaryotic tree of life. Syst. Biol. Knoll AH, Javaux EJ, Hewitt D, Cohen P. 2006. Eukaryotic organisms in . Philos. Trans. R. Soc. Lond. B. Biol. Sci. 361:1023–1038. Küntzel H, Köchel HG. 1981. Evolution of rRNA and origin of mitochondria. Nature 293:751–755. Lahr DJG, Grant J, Nguyen T, Lin JH, Katz LA. 2011. Comprehensive phylogenetic reconstruction of amoebozoa based on concatenated analyses of SSU-rDNA and actin genes.Ouzounis CA, editor. PLoS One 6:e22780. Lang BF, Burger G, O’Kelly CJ, Cedergren R, Golding GB, Lemieux C, Sankoff D, Turmel M, Gray MW. 1997. An ancestral mitochondrial DNA resembling a eubacterial genome in miniature. Nature 387:493–497. Lara E, Moreira D, López-García P. 2010. The Environmental Clade LKM11 and Form the Deepest Branching Clade of Fungi. Protist 161:116–121. Lartillot N, Lepage T, Blanquart S. 2009. PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating. Bioinformatics 25:2286–2288. Lartillot N, Philippe H. 2004. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol Biol Evol 21:1095–1109. Lartillot N, Rodrigue N, Stubbs D, Richer J. 2013. PhyloBayes MPI: Phylogenetic Reconstruction with Infinite Mixtures of Profiles in a Parallel Environment. Syst. Biol. 62:611–615. Le SQ, Gascuel O. 2008. An Improved General Amino Acid Replacement Matrix. Mol Biol Evol 25:1307–1320. Leonard G, Richards TA. 2012. Genome-scale comparative analysis of gene fusions, gene fissions, and the fungal tree of life. Proc. Natl. Acad. Sci. 109:21402– 21407. Liu L, Li Y, Li S, Hu N, He Y, Pong R, Lin D, Lu L, Law M. 2012. Comparison of next-generation sequencing systems. J. Biomed. Biotechnol. 2012. Lundin E. 2010. AutoPhylo, a bioinformatic tool for identifying and retrieving sequences. McCutcheon JP, Keeling PJ. 2014. Endosymbiosis: Further Erodes the Organelle/Symbiont Distinction. Curr. Biol. 24:R654–R655.

47 McCutcheon JP, Moran NA. 2012. Extreme genome reduction in symbiotic bacteria. Nat. Rev. Microbiol. 10:13–26. Moreira D, Lopez-Garcia P. 2002. The molecular of microbial eukaryotes unveils a hidden world. Trends Microbiol. 10:31–38. Okamoto N, Chantangsi C, Horák A, Leander BS, Keeling PJ. 2009. Molecular phylogeny and description of the novel gen. et sp. nov., and establishment of the Hacrobia taxon nov. PLoS One 4:e7080. Olive LS. 1970. The : A revised classification. Bot. Rev. 36:59–89. Pääbo S, Higuchi RG, Wilson AC. 1989. Ancient DNA and the polymerase chain reaction. The emerging of molecular . J. Biol. Chem. 264:9709–9712. Paps J, Ruiz-Trillo I, Barcelona U De. 2010. Animals and Their Unicellular Ancestors. Parfrey LW, Barbero E, Lasser E, Dunthorn M, Bhattacharya D, Patterson DJ, Katz LA. 2006. Evaluating support for the current classification of eukaryotic diversity. PLoS Genet. 2:e220. Parfrey LW, Grant J, Tekle YI, Lasek-Nesselquist E, Morrison HG, Sogin ML, Patterson DJ, Katz LA. 2010. Broadly Sampled Multigene Analyses Yield a Well-Resolved Eukaryotic Tree of Life. Syst. Biol. Philippe H, Brinkmann H, Lavrov D V, Littlewood DTJ, Manuel M, Wörheide G, Baurain D. 2011. Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol. 9:e1000602. Philippe H. 2000. Phylogeny of Eukaryotes Based on Ribosomal RNA: Long- Branch Attraction and Models of Sequence Evolution. Mol. Biol. Evol. Reyes-Prieto A, Weber APM, Bhattacharya D. 2007. The origin and establishment of the plastid in algae and plants. Annu. Rev. Genet. 41:147–168. Rice DW, Alverson AJ, Richardson AO, et al. 2013. Horizontal transfer of entire genomes via mitochondrial fusion in the angiosperm Amborella. Sci. (New York, NY) 342:1468–1473. Richards TA, Cavalier-Smith T. 2005. Myosin domain evolution and the primary divergence of eukaryotes. Nature 436:1113–1118. Riisberg I, Orr RJS, Kluge R, Shalchian-Tabrizi K, Bowers HA, Patil V, Edvardsen B, Jakobsen KS. 2009. Seven Gene Phylogeny of Heterokonts. Protist 160:191–204. Roger AJ, Hug LA. 2006. The origin and diversification of eukaryotes: problems with molecular phylogenetics and molecular clock estimation. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 361:1039–1054. Roger AJ, Sandblom O, Doolittle WF, Philippe H. 1999. An evaluation of elongation factor 1 alpha as a phylogenetic marker for eukaryotes. Mol. Biol. Evol. 16:218–233. Roger AJ, Simpson AGB. 2009. Evolution: revisiting the root of the eukaryote tree. Curr. Biol. 19:R165–7. Rogozin IB, Basu MK, Csuros M, Koonin E V. 2009. Analysis of Rare Genomic Changes Does Not Support the Unikont-Bikont Phylogeny and Suggests Cyanobacterial Symbiosis as the Point of Primary of Eukaryotes. Genome Biol. Evol. 1:99–113. Sagan L. 1967. On the origin of mitosing cells. J. Theor. Biol. 14:255–274. Shiflett AM, Johnson PJ. 2010. Mitochondrion-related organelles in eukaryotic protists. Annu. Rev. Microbiol. 64:409–429.

48 Shimodaira H, Hasegawa M. 2001. CONSEL: for assessing the confidence of phylogenetic tree selection. Bioinformatics 17:1246–1247. Shimodaira H. 2002. An approximately unbiased test of phylogenetic tree selection. Syst. Biol. 51:492–508. Sierra R, Matz M V, Aglyamova G, Pillet L, Decelle J, Not F, de Vargas C, Pawlowski J. 2012. Deep relationships of Rhizaria revealed by phylogenomics: A farewell to Haeckel’s Radiolaria. Mol. Phylogenet. Evol. Simpson AGB. 2003. Cytoskeletal organization, phylogenetic affinities and systematics in the contentious taxon Excavata (Eukaryota). Int. J. Syst. Evol. Microbiol. 53:1759–1777. Sogin ML, Gunderson JH, Elwood HJ, Alonso RA, Peattie DA. 1989. Phylogenetic meaning of the concept: an unusual ribosomal RNA from Giardia lamblia. Sci. (New York, NY) 243:75–77. Sogin ML. 1991. Early evolution and the origin of eukaryotes. Curr. Opin. Genet. Dev. 1:457–463. Srivastava M, Begovic E, Chapman J, et al. 2008. The Trichoplax genome and the nature of placozoans. Nature 454:955–960. Stajich JE, Berbee ML, Blackwell M, Hibbett DS. 2009. The fungi. Curr. Biol. Stamatakis A, Hoover P, Rougemont J. 2008. A rapid bootstrap algorithm for the RAxML Web servers. Syst. Biol. 57:758–771. Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics 30:1312–1313. Stechmann A, Cavalier-Smith T. 2002. Rooting the eukaryote tree by using a derived gene fusion. Sci. (New York, NY). Steenkamp ET, Wright J, Baldauf SL. 2006. The protistan origins of animals and fungi. Mol Biol Evol 23:93–106. Stegemann S, Bock R. 2006. Experimental reconstruction of functional gene transfer from the tobacco plastid genome to the nucleus. 18:2869–2878. Stegemann S, Hartmann S, Ruf S, Bock R. 2003. High-frequency gene transfer from the chloroplast genome to the nucleus. Proc. Natl. Acad. Sci. U. S. A. 100:8828–8833. Stegemann S, Keuthe M, Greiner S, Bock R. 2012. Horizontal transfer of chloroplast genomes between plant species. Proc. Natl. Acad. Sci. U. S. A. 109:2434–2438. Stiller JW, Hall BD. 1999. Long-branch attraction and the rDNA model of early eukaryotic evolution. Mol Biol Evol 16:1270–1279. Sueoka N. 1961. Correlation between base composition of deoxyribonucleic acid and amino acid composition of protein. Proc. Natl. Acad. Sci. U. S. A. 47:1141–1149. Thiergart T, Landan G, Schenk M, Dagan T, Martin WF. 2012. An Evolutionary Network of Genes Present in the Eukaryote Common Ancestor Polls Genomes on Eukaryotic and Mitochondrial Origin. Genome Biol. Evol. 4:466–485. Thrash JC, Boyd A, Huggett MJ, et al. 2011. Phylogenomic evidence for a common ancestor of mitochondria and the SAR11 clade. Sci. Rep. 1:13. Timmis JN, Ayliffe MA, Huang CY, Martin W. 2004. Endosymbiotic gene transfer: organelle genomes forge eukaryotic . Nat. Rev. Genet. 5:123– 135.

49 Whelan S, Goldman N. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18:691–699. Wideman JG, Gawryluk RMR, Gray MW, Dacks JB. 2013. The ancient and widespread nature of the ER-mitochondria encounter structure. Mol. Biol. Evol. 30:2044–2049. Woese CR, Kandler O, Wheelis ML. 1990. Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc. Natl. Acad. Sci. 87:4576–4579. Yabuki A, Kamikawa R, Ishikawa SA, Kolisko M, Kim E, Tanabe AS, Kume K, ISHIDA K-I, Inagki Y. 2014. represents a basal cryptist lineage: insight into the character evolution in . Sci. Rep. 4:4641. Yang Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. 39:306– 314. Yoon HS, Hackett JD, Ciniglia C, Pinto G, Bhattacharya D. 2004. A molecular timeline for the origin of photosynthetic eukaryotes. Mol. Biol. Evol. 21:809– 818. Zhao S, Burki F, Brate J, Keeling PJ, Klaveness D, Shalchian-Tabrizi K. 2012. Collodictyon--An Ancient Lineage in the Tree of Eukaryotes. Mol Biol Evol.

50

Acta Universitatis Upsaliensis Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1176 Editor: The Dean of the Faculty of Science and Technology

A doctoral dissertation from the Faculty of Science and Technology, Uppsala University, is usually a summary of a number of papers. A few copies of the complete dissertation are kept at major Swedish research libraries, while the summary alone is distributed internationally through the series Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology. (Prior to January, 2005, the series was published under the title “Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology”.)

ACTA UNIVERSITATIS UPSALIENSIS Distribution: publications.uu.se UPPSALA urn:nbn:se:uu:diva-231670 2014