Origins and divergence of the eukaryotic kinetochore

Jolien van Hooff Jolien van Hooff (2018) Origins and divergence of the eukaryotic kinetochore PhD thesis, Utrecht University Cover and layout by Alessia Peviani (www.photogenicgreen.nl) Printed by Ridderprint BV (www.ridderprint.nl) ISBN 978-94-6375-165-0 Origins and divergence of the eukaryotic kinetochore

Oorsprong en divergentie van het eukaryote kinetochoor

(met een samenvatting in het Nederlands)

Proefschrift ter verkrijging van de graad van doctor aan de Universiteit Utrecht op gezag van de rector magnificus, prof. dr. H.R.B.M. Kummeling, ingevolge het besluit van het college voor promoties in het openbaar te verdedigen op maandag 10 december 2018 des middags te 2.30 uur

door

Jolien Johanna Elisabeth van Hooff

geboren op 10 januari 1988 te Breda Promotoren: Prof. dr. G.J.P.L. Kops Prof. dr. B. Snel Table of Contents

1.Introduction...... 8 Introduction...... 10 The diversity and origins of eukaryotes...... 11 Cell division and chromosome segregation...... 15 Comparative genomics ...... 20 Scope and outline of this thesis...... 24 2. Inferring the evolutionary history of your favorite protein: A guide for cell biologists...... 26 Summary...... 28 Abstract...... 28 Introduction...... 28 Studying the evolution of a protein: what do we mean?...... 32 A quick (and dirty) guide to inferring the evolutionary history of a protein...... 33 Life is more complicated...... 42 Conclusions & Outlook...... 49 Acknowledgments...... 51 Author contributions...... 51 3. Evolutionary dynamics of the kinetochore network as revealed by comparative genomics...... 52 Abstract...... 54 Introduction...... 54 Results...... 56 Discussion...... 70 Materials and Methods...... 72 Acknowledgments...... 76 Author contributions...... 76 Supplementary Material...... 76 4. Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1...... 82 Abstract ...... 84 Main text...... 84 Materials and Methods...... 93 Acknowledgments...... 97 Author contributions...... 97 Supplementary Text...... 97 Supplementary Material...... 100 5. Mosaic origin of the eukaryotic kinetochore...... 106 Abstract...... 108 Introduction...... 108 Results...... 109 Discussion...... 125 Data and Methods...... 127 Author contributions...... 129 Acknowledgments...... 129 Supplementary Text...... 130 Supplementary Material...... 134 6. Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specific proteins...... 142 Abstract...... 144 Introduction...... 144 Results...... 147 Conclusion & Discussion...... 155 Materials & Methods...... 159 Author contributions...... 162 Acknowledgments...... 162 Supplementary Material...... 163 7. Discussion...... 164 Kinetochore as a model for evolution of eukaryotic cellular processes?...... 166 Predictions on non-model kinetochores...... 170 More data from extant species make more complex ancestors...... 172 Small- and large-scale studies complement each other in illuminating eukaryogenesis...... 173 References...... 176 Abbreviations...... 189 Samenvatting...... 190 Curriculum vitae...... 194 Publications...... 195 Acknowledgments...... 196

1 Introduction 10 Chapter 1 Introduction 11

Introduction 1 Virtually all life we can see by eye is eukaryotic. Eukaryotes are a group of organisms that encompasses animals, plants and fungi. In addition to large, multicellular organisms, eukaryotes also include many unicellular organisms, such as the causal agent of malaria and the photosynthetic symbiont of corals. Compared to prokaryotes, which are all non-eukaryotic cellular life forms, eukaryotes are characterized by their intracellular complexity.

We can presently access the genomes of more species than ever before, both of eukaryotes and prokaryotes. Extensive analysis of these genomes has taught us that eukaryotic genomes evolve under completely different dynamics than prokaryotic genomes. As a result, these analyses have provided a framework to study the evolution of specifi c proteins encoded by these genomes. In return, small-scale protein evolution studies can uncover new paradigms in genome evolution.

Protein evolution studies are regularly used to get a glimpse of the evolution of the cellular processes they are involved in. In the case of eukaryotes, proteins subject to such analyses often are those that participate in the intricate cellular structures, processes and machineries that separate eukaryotes from prokaryotes. These studies aim to answer questions like: How did this cellular structure originate? How did it differentiate in current-day species? Moreover, evolutionary signatures of proteins can contribute to our understanding of their cellular or molecular function.

One of these typical eukaryotic features is the mitotic spindle. This is a complex machinery that eukaryotes use to segregate their duplicated DNA during cell division. The mitotic spindle pulls apart duplicated chromosomes towards the opposite ends of the cell, after which the cell is ready to divide. The spindle consists of hundreds of proteins. Among these are the pulling elements known as microtubules and the protein complex that links microtubules to the chromosomes. This latter complex is known as the kinetochore, and it is essential for proper cell division.

In this thesis, I describe three studies on the evolution of kinetochore, and one on the generic origins of eukaryotic proteins. In this chapter I elaborate on the phylum of eukaryotes and describe what we currently know about its origins and its evolutionary dynamics. I supply background information on the eukaryotic cell cycle, chromosome segregation and I detail the kinetochore constituents. I outline how comparative genomics aids in understanding the function and evolution of eukaryotic structures such as the kinetochore. I supply key defi nitions essential to the evolutionary interrogation of proteins. 10 Chapter 1 Introduction 11

The diversity and origins of eukaryotes Eukaryotes form a highly diverse group of organisms that all descended from a single 1 common ancestor, the Last Eukaryotic Common Ancestor, or LECA. The Eukarya (or Eukaryota) taxon is currently considered one of the three domains of life, next to Bacteria and Archaea (Figure 1A, [42]). Eukaryotes differ greatly in their morphology, life cycles and styles, and habitats. They can be either uni- or multicellular, autotrophs or heterotrophs, reproduce primarily sexually or clonally, live freely, as obligate pathogens or as obligate symbionts. Nevertheless, they share a plethora of features on the cellular level, which distinguish them from prokaryotes. Eukaryotes are designated by their nucleus (‘eu’ means ‘true’, ‘karyote’ means ‘kernel’), a membrane-bound compartment that encircles most of their DNA, which itself is packed into linear chromosomes. However, eukaryotic cells have many characteristics, such as the membranous compartments of the ER, Golgi and lysosome, intracellular vesicles that serve as transporters, mitochondria, post- transcriptional processing of mRNAs, an intricate cytoskeleton consisting of different building blocks and a large cell size (on average 1000 times larger than the average prokaryote).

Eukaryotes often are taxonomically categorized into fi ve different supergroups (Figure 1B, [29]). The supergroup of Opisthokonta consists of animals and their unicellular relatives, such as choanofl agellates, and fungi and their relatives. The supergroup of Amoebozoa consists mainly of unicellular organisms, but also encompasses aggregative multicellular ones such as slime moulds. The Archaeplastida are mostly photosynthetic, plastid-bearing organisms, like unicellular red and green algal species and multicellular land plants. These contain primary plastids that were derived by endosymbiosis of a cyanobacterium into a eukaryotic cell. The Excavata are largely unicellular and include plastid-bearing species as well as various parasites and free-living heterotrophs. The SAR supergroup is composed of Stramenopila, Alveolata and Rhizaria and consists mainly of unicellular organisms, such as ciliates, but also includes multicellular, plastid-bearing . Most non-Archaeplastids that harbor plastids acquired these secondarily, by endosymbiosis of a cell containing a primary or secondary plastid [47].

Although these supergroups provide a useful framework to study (genomic) diversity and evolution, they do not form solid units: both the supergroup themselves as well as their members are being revised regularly [32]. Such revisions are often required due to newly available genomes, particularly those that represent newly identifi ed taxa [49, 50]. Moreover, how these supergroups are related to one another is under continuous debate. While the Opistokonta together with the Amoebozoa and some other smaller taxa convincingly form a monophyletic group, whether the Excavata, Archaeplastida and SAR also do is an open question. This question is directly related to the root of the 12 Chapter 1 Introduction 13

eukaryotes. For the position of this root, broadly two models can be distinguished. In the fi rst, LECA diverged into an Opisthokonta/Amoebozoa - lineage and an Excavata/ 1 Archaeplastida/SAR-lineage (the Opimoda-Diphoda model, [31]). In the second, LECA diverged into (a subset of) Excavata and all other eukaryotic lineages (the Excavata- Neozoa model, [33]). In this thesis, I will assume the mentioned fi ve supergroups are related as proposed by the Opimoda-Diphoda model. However, the implications of my results would not be substantially different under the Excavate-Neozoa hypothesis.

A B Eukarya Archaea Opisthokonta

Asgard

Excavata

Amoebozoa

Bacteria

SAR Archaeplastida

Figure 1. Phylogenetic positions of eukaryotes and eukaryotic supergroups. A. The tree of life with the three domains of life (Bacteria, Archaea and Eukarya). The star indicates the probable root of the tree of life. Eukarya evolved from within the Archaea with the Asgard group as their most closely related sister clade. The Asgard represent one of the superphyla within the Archaea [9]. B. The eukaryotic tree of life with the 5 eukaryotic supergroups. The square indicates one of the hypothetical eukaryotic roots, largely in line with the Opimoda-Diphoda model [31], in which the Amoebozoa and the Opisthokonta comprise the Opimoda, and the Excavata, Archaeplastida and SAR the Diphoda. The Excavata that are being studied in this thesis likely are positioned as shown here, however, possibly not all Excavata do: The Excavata might not be monophyletic since some may group with the Opimoda.

Genome evolution in eukaryotes Over the last decade, genomes of many diverse eukaryotes have been sequenced. These genomes provide much phylogenetic information that improved the eukaryotic species phylogeny, due to which for example the Amoebozoa could be related to the Opishtokonta (Figure 1B) [52]. By tracing homologous sequences across genomes, we now have a much better understanding of how eukaryotic genomes evolve. Such studies, which are collectively called ‘comparative genomics’ (see below), have indicated that eukaryotic genomes evolve primarily via gene duplication, including whole-genome duplication, and gene loss [53]. Unlike prokaryotes, eukaryotes do not seem frequently involved in horizontal gene transfer (HGT), which possibly explains why duplication and 12 Chapter 1 Introduction 13

loss might play such important roles in eukaryotic genome evolution. Nevertheless, genes are being transferred horizontally towards and between eukaryotes, albeit at a lower frequency [54]. From or to which eukaryotes genes get transferred, how often they 1 do, and how important these transfers are for eukaryotic genome evolution, is being hotly debated (see also Chapter 7).

Duplication and loss events may occur as part of a more general evolutionary process that shapes genomes. Genome evolution processes have been described in various theories, like as constructive neutral evolution [55] and infl ation-streamlining [56]. The fi rst theory seeks to explain why some genomes are more complex than others (‘constructive’), while they encode very similar functions (‘neutral’). Why would some species have six genes cooperating to accomplish a given task, while another one only has three? According to the constructive neutral evolution theory, a new gene (for example acquired through duplication) might start to function in a given process, e.g. by incorporating into a protein complex. Initially, it is dispensable for this process. However, other genes involved in this process might become mutated. Such a mutation could be allowed for in the presence of the new gene, but not in its absence. As a result, the new gene becomes essential, and the process has become more complex, although its outcome has not changed. Because the new gene cannot be lost, one also speaks of ‘irremediable complexity’ [57]. On a longer time-scale, although initially neutral, the increased complexity may subsequently form a substrate for the evolution of novel features. Furthermore, on a longer time-scale the increased complexity may not be completely irremediable, as we for example observed that many genes that contributed to complexity got lost in recent lineages (Chapter 3).

The infl ation-streamlining theory posits that at the base of many species-rich clades, such as the eukaryotes, animals and plants, the genome complexifi ed rapidly (‘infl ation’). Subsequently, while lineages diverge, they gradually lose some genes that were present in their ancestor (‘streamlining’), sometimes in a differential manner. As a result of differential loss, we are still able to infer how complex this ancestor was. What may drive this evolutionary process? Possibly, the initial genome expansion is partially adaptive, for example through an increase in gene dosage, and partially neutral, for example in case of genes that co-duplicate in the same genomic region. These neutral expansions provide building blocks for adaptations, which in turn might make these genomically expanded lineages successful [58]. Subsequently, genomes get streamlined either via neutral loss, or via positive selection for a smaller genome. Likewise, after a whole-genome duplication (the ‘infl ation’), lineages seem to evolve by diversifying and losing many of the ancestral genes, sometimes reciprocally [59, 60]. The infl ation-streamlining theory may recapture what is often observed in large-scale phylogenomics reconstructions: many ancestors of major taxa, such as LECA, appear to have possessed surprisingly many genes, and 14 Chapter 1 Introduction 15

these genes were often lost in various more recent lineages [56]. This is what we also observe when studying a specifi c cellular structure such as the kinetochore (Chapter 3). 1 However, ancestral genomes may also be overestimated, particularly if the frequency of HGT is underestimated. Of note, although constructive neutral evolution and infl ation- streamlining theories predict different observations in the evolution of the gene contents of species, they are not necessarily mutually exclusive. Among eukaryotes, for example within the animal lineages, we observe the footprints of both processes [61].

The origins of eukaryotes Eukaryotes cells arose from prokaryotes approximately 2 billion years ago [1, 2]. Given the many ways in which eukaryotes differ from prokaryotes, this transition famously has been described as “the greatest single evolutionary discontinuity to be found in the present-day world” [62], although its uniqueness has been challenged [63]. The eukaryotic cell originated between the divergence of the eukaryotic lineage from its prokaryotic ancestors (referred to as the fi rst eukaryotic common ancestor, or FECA) and the eukaryotic species that gave rise to all current-day eukaryotic life (LECA), a process called ‘eukaryogenesis’ (Chapter 5, Figure 1). To shed light on the series of events that comprised eukaryogenesis, it is essential to pinpoint from which prokaryotes the eukaryotes descended. Hence, over the last decades, many studies focused on identifying where eukaryotes are positioned in the tree of life. These have discovered that eukaryotes evolved from within the Archaea [7, 8]. Recently, the archaeal origin of the eukaryotes gained further support by metagenomics studies that identifi ed the closest archaeal relatives of eukaryotes so far, i.e. the Asgard superphylum [9, 10]. This is not the complete story, however: it has long been recognized that mitochondria, the ATP-generating organelles of eukaryotic cells, emerged from the endosymbiosis of a bacterium [17]. Many consider this endosymbiotic event a major or even a key contribution to the evolution of the eukaryotic cell. Unsurprisingly, many efforts have been made to identify the position of this bacterium in the tree of life as well. It is widely recognized to belong to the class of Alphaproteobacteria and recently it was shown to diverge deeply within the alphaproteobacterial species tree, before the diversifi cation of most alphaproteobacterial clades [64]. Hence, eukaryotes stem from an Asgard-related host cell that incorporated an early-branching alphaproteobacterium.

While the source of the mitochondria is more or less resolved, those of the other typical eukaryotic features are not. How, why and when did features like the endomembrane system, post-transcriptional processing, linear chromosomes with telomeres, meiotic sex, and the eukaryotic cytoskeleton evolve? And how is their evolution related to the mitochondrial endosymbiosis? Did this endosymbiosis trigger the autogenous evolution of all other eukaryotic features, or did the alphaproteobacterium enter the Asgard- related host when these were already present? Some of these features, or at least their 14 Chapter 1 Introduction 15

genetic building blocks, might indeed have been present already; the genomes of Asgard species contain various genes previously considered to be eukaryote-specifi c, including some that in eukaryotes have roles in for example the endomembrane system 1 [9, 10]. Since hitherto we cannot culture Asgard cells, we cannot investigate whether these genes fulfi ll a similar role, i.e. to assess whether Asgard cells have something like an (primordial) endomembrane system for which they use these genes. Using phylogenomics, Pittis & Gabaldón found additional evidence for a relatively late entry of the mitochondria, as the host genome might have already encoded various genes playing roles in typical eukaryotic processes [65]. While this was one of the few attempts to systematically assess how complex the host already was, their method was strongly criticized and many continue to consider it more likely that mitochondrial entry was the key event that triggered the evolution of all other typical eukaryotic features [66, 67]. Mitochondria would have provided an ATP-surplus that allowed for an increase in genome and cell size, or would have driven the evolution of the nucleus as a way to cope with the harmful effects of spliceosomal introns [68-70]. A third scenario, coined ‘inside- out’, proposes concomitant evolution of mitochondria and the nucleus [71].

The emergence of eukaryotic cellular complexity coincided with an increase in genome complexity: The genome of a typical eukaryote contains approximately four times as many genes as a prokaryotic genome [68]. Next to the genes that were already present in the common ancestor of the eukaryotes and their archaeal relatives, genes from the alphaproteobacterial endosymbiont also contributed to the eukaryotic genome. Moreover, genes from various other Bacteria likely entered the pre-eukaryotic lineage via horizontal gene transfer [72]. Such horizontally transferred genes include for example certain genes involved in the nuclear pore [73]. Although various previously ‘eukaryote- specifi c’ genes were detected in Archaea, other eukaryotic genes still appear to have no homologs in prokaryotes [74, 75]. These likely evolved de novo between FECA and LECA. Finally, genes (with and without prokaryotic ancestry) duplicated frequently during eukaryogenesis, thus duplications also contributed to the gene complement of eukaryotes [76]. In Chapters 5 and 6, I examine how these duplications contributed to cellular complexity of eukaryotes and what sort of genes duplicated.

Cell division and chromosome segregation

Cell cycle Eukaryotes differ from prokaryotes in the way they regulate and execute cell division. Prokaryotes seem to genomically replicate and segregate, and cellularly divide in diverse manners [27, 77]. Most eukaryotes have conserved the overall layout of these processes, which is referred to as the eukaryotic cell cycle. The cell cycle is divided into 16 Chapter 1 Introduction 17

two parts: interphase, which consists of the G1, S and G2 phases, and mitosis, referred to as the M phase. During G1 and G2, cells primarily grow. During S (‘synthesis’) phase, the 1 nuclear DNA replicates, resulting in duplicated chromosomes that consist of two sister chromatids. These are separated during the M phase, which usually ends with the actual cell division, a.k.a. cytokinesis. Transitions from one phase to another are regulated by kinases (cyclin-dependent kinases, or CDKs) [78], whose essential cofactors (cyclins) are regulated at the transcriptional and translation levels, and by protein degradations. For example, CDK1 and cyclin B collaborate to initiate mitosis. In M phase, sister chromatid separation is regulated by the anaphase-promoting complex/cyclosome (APC/C, see below).

The mitotic machinery In S phase, sister chromatids of duplicated chromosomes are ‘glued’ to each other by cohesin, a protein complex that encircles them in a ring-like structure. In M phase, when the sister chromatids segregate, cohesin needs to be removed. In order to ensure that each daughter cell receives a complete set of chromosomes, the chromatids need to be segregated equally. Segregation is therefore a strictly regulated process, and missegregation can have detrimental effects on organismal fi tness [79]. Chromosomes are segregated by the mitotic spindle (Figure 2A). This spindle consists of four elements: (1) microtubule-organizing centers (MTOCs), which are typically located at opposite ends of the cell, and form the ‘poles’ of the spindle, (2) microtubules, which emanate from these MTOCs, forming the spindle ‘fi bers’, (3) kinetochores, which connect the spindle microtubules to the chromosomes, and (4), the duplicated chromosomes themselves, consisting of two sister chromatids bound by cohesin. The sister chromatids are pulled apart by spindle microtubules. In human cells, the midzone of this spindle forms the place where later the cytoplasm will divide during cytokinesis, but eukaryotes differ in how they organize cytokinesis [80]. As stated, for sister chromatid separation, cohesin needs to be removed. This removal only occurs after all chromosomes have their sister kinetochores stably attached to microtubules from opposite spindle poles, referred to as ‘bioriented attachment’. Cohesin is removed through cutting of its ring-like structure by separase. Separase is activated by the APC/C, which itself is activated only when stable, bioriented attachments have been established for all chromosomes. As a result, sister chromatids are only allowed to separate if they are connected to microtubules from the opposite spindle poles. The surveillance mechanism that monitors the kinetochore attachment status is called the spindle assembly checkpoint (SAC) [81]. The SAC directly senses the attachment status of kinetochores: SAC proteins localize to kinetochores if these are unattached, and from this kinetochore they emit a signal that inhibits the APC/C.

As far as is known, all eukaryotes use microtubules to separate their nuclear chromosomes, 16 Chapter 1 Introduction 17

A mitotic spindle B kinetochore (human)

ARHGEF17

spindle sister Ska1 Ska2 Mps1 microtubule chromatids kinetochore Ska3 1

Ndc80 Apc15 Nuf2

Spc24 Zwilch Cdc20 Astrin Spindly TRIP13 Spc25 Mad2 p31comet SKAP ZW10 F E Rod Mad1 Dsn1 Bub1 BubR1 Nsl1 Knl1 Cep57 Bub3 Outer KT Mis12 Zwint-1 Nnf1 Plk BugZ O U R P Inner KT H I C Q L N K M

W T Centromeric DNA A S X Sgo1 Aurora Borealin Inner Centromere Incenp Survivin

Protein complexes

Chromosomal Passenger Complex (CPC) Rod-Zwilch-ZW10-Spindly Complex (RZZS*)

Constitutive Centromere-Associated Network (CCAN) Spindle Assembly Checkpoint (SAC) poleward-movement Mis12 Complex (Mis12-C) TRIP13-p31comet

Ndc80 Complex (Ndc80-C) KMN network Astrin-SKAP

microtubule-organizing Knl1 Complex (Knl1-C) Ska Complex (Ska-C) center (MTOC)

Figure 2. Illustrations of the eukaryotic mitotic spindle and the human kinetochore. A. The mitotic spindle in eukaryotes separate sister chromatids from duplicated chromosomes by pulling forces of spindle microtubules. Spindle microtubules depolymerize at the chromosome-bound end, due to which they get shortened and pull along the chromatids towards the microtubule-organizing centers (MTOCs). B. Kinetochores form the attachment site of the microtubules to the chromosomes. The human kinetochore is shown here, and its proteins are colored according the protein complex they belong to. The budding yeast kinetochore is largely similar. Notable difference include the RZZ (which yeast lacks), the Ska complex (which yeast also lacks, while instead it has the Dam1 complex - see Chapter 4) and the inner kinetochore Nkp complex and Csm1 protein (which yeast has, but human lacks).

which is why probably LECA did so as well. However, eukaryotes differ in the ways they organize their spindle, of which I will describe a few examples here. In metazoan lineages the mitotic MTOCs consist of centrosomes. At the beginning of M phase, the nuclear membrane disassembles (‘open mitosis’), allowing the microtubules from the centrosomes to connect to the chromosomes. In contrast, in budding yeasts the MTOCs are so-called spindle pole bodies, which are embedded in the nuclear membrane. In these species, the nuclear membrane stays intact during mitosis (‘closed mitosis’). In the Alveolata group dinofl agellates, the nucleus stays intact as well, but the spindle forms in the cytoplasm, and connects to the nuclear envelope-embedded kinetochore [82]. Next to these examples of ´open´ and ´closed´ mitosis, also intermediate forms exist. For example, the fungus Aspergillus nidulans has a semi-open mitosis, in which the nuclear pores disassemble but the envelope does not [83]. Moreover, in human, kinetochores connect to 15-20 microtubules [84], whereas in the Amoebozoa Dictyostelium discoideum they bind 2-3 [85], and in budding yeasts only to a single one [86]. The eukaryotic spindle morphologies are thus highly different [87, 88], due to which it is diffi cult to infer what sort of mitosis LECA had [89]. 18 Chapter 1 Introduction 19

Eukaryotic chromosomes have specialized regions, called centromeres, that execute their separation [90]. The centromere is the spot where the kinetochore is assembled and where 1 the chromosomes connect to microtubules. In eukaryotic species with a characterized centromere, the centromere is specifi ed by its alternative nucleosome composition. A ‘regular’ nucleosome consists of DNA that is wrapped around two copies of the proteins H2A, H2B, H3, and H4. A centromeric nucleosome contains CenpA (or ‘CenH3’) instead of H3. Although the centromere is an important structure for cell division, centromeric DNA sequence and centromere size are not conserved [91, 92]. In fact, centromeric DNA evolves tremendously fast [93] and has been observed to expand and contract. This dynamic nature of the centromere is referred to as the ‘centromere paradox’ [94]. It may be explained by the adaptive evolution for a stronger, more microtubule- binding centromere in species with unequal female meiosis, which is called ‘meiotic drive’, which may also affect the evolution of centromere-binding proteins, i.e. CenpA [95] (see also Chapter 3). Across many species, however, centromeres are composed of tandem repeats [96]. The DNA sequences of these regions are generally not suffi cient to defi ne the centromere. These are defi ned epigenetically, which may partially explain why the sequences are relatively free to evolve. Budding yeasts form a notable exception: their centromeres are defi ned by a short, specifi c DNA sequence of approximately 120 base pairs [97]. In contrast to these ‘point’ centromeres of budding yeast, most species have ‘regional centromeres’, spanning a larger region of the chromosome. Yet other species, such as nematodes, certain insects and the sedge Rhynchospora pubera, have centromeres that are dispersed along the full length of their chromosomes. These are called ‘holocentromeres’. Holocentromeres are found to have different confi gurations [98, 99], varying from dispersed point centromeres [100] to CenpA-lacking centromeres [101] to multiple, interspersed satellite regions [99].

Kinetochore & the inner centromere The kinetochore connects the sister chromatid to spindle microtubules at the site of the centromere (Figure 2B). In addition, it serves as a signaling platform for the SAC. The kinetochores of human and budding yeast have been studied thoroughly, due to which we know their proteins [102]. The kinetochore is subdivided into the inner kinetochore, which is closely associated to the centromere, and the outer kinetochore, which captures spindle microtubules and activates SAC signaling. Below, I will describe these regions and the proteins that constitute them. I will also describe inner centromeric proteins, which contribute to the regulation of microtubule attachment. Unless mentioned otherwise, I will primarily describe the most important proteins of the kinetochore that human and yeast share, and that are likely also part of the kinetochores of other eukaryotic clades. In Chapters 3 and 4, the conservation and divergence of these kinetochore proteins across eukaryotes will be examined and discussed in more detail. 18 Chapter 1 Introduction 19

The inner kinetochore The typical CenpA-containing nucleosomes recruit the Constitutive Centromere- Associated Network (CCAN), which is the main constituent of the inner kinetochore. 1 The CCAN bridges the centromeric chromatin to the microtubule-binding site formed by the outer kinetochore. The CCAN consists of CenpC and four protein complexes: CenpT-W-S-X, CenpO-P-Q-U (plus CenpR in human), CenpL-N, CenpH-I-K-M. The CenpA-containing nucleosomes are recognized by CenpC and CenpN [103]. CenpC serves as a scaffold within the inner kinetochore. Furthermore, CenpC recruits, together with CenpT, the outer kinetochore, microtubule-binding KMN network (Knl1-C, Mis12-C and Ndc80-C, see below) [104, 105]. In addition to CenpC and CenpT, the subcomplex CenpQ-U has been shown to interact with the outer kinetochore KMN network, possibly serving as an extra bridge [106]. CenpU also has a role in recruiting Plk, a protein kinase that operates in the SAC [107]. In addition, the yeast CCAN contains Nkp1 and Nkp2, which constitute the Nkp complex. If these proteins have a specifi c role in the CCAN is currently unknown [108].

The outer kinetochore & the SAC Kinetochores bind microtubules primarily via the KMN network. This network consists of three complexes: Knl1-C (Knl1-Zwint-1), Mis12-C (Mis12-Nnf1-Dsn1-Nsl1) and Ndc80-C (Ndc80-Nuf2-Spc25-Spc24). The KMN network is recruited by CenpC via Mis12 and by CenpT via Spc24 and Spc25 [109]. Ndc80 is the main microtubule-binding protein of the kinetochore. If Ndc80 is not bound by microtubules, it activates the SAC. It does so by binding to Mps1. Mps1 is an outer kinetochore kinase that governs the SAC. If recruited to the kinetochore, Mps1 phosphorylates Knl1. Phosphorylated Knl1 recruits Bub3 and MadBub (BubR1 in human, Mad3 in yeast). Together, Bub3, MadBub and Mad2, which is recruited by Mad1, bind Cdc20. These four proteins form the Mitotic Checkpoint Complex, or MCC. The MCC effectuates the SAC. In an unbound state, Cdc20 activates the APC/C, but as part of the MCC it inhibits the APC/C. Hence, as long as no microtubule is bound, the kinetochore inactivates the APC/C. Consequently, the sister chromatids are not allowed to separate (see above). Once Ndc80 is attached to microtubules, it no longer binds Mps1, thus no more MCC is being produced and the APC/C becomes activated by free Cdc20. The human outer kinetochore also contains the Rod-Zwilch-ZW10 (RZZ) complex, which recruits Mad1 and Spindly. The latter protein serves to ´clean´ the kinetochore from SAC proteins after the Ndc80 has become captured by microtubules. Although Ndc80 is the primary microtubule-binding protein, microtubules also interact with Knl1, with the three-subunit Ska complex (human) and the ten-subunit Dam1 complex (budding yeast). These latter two complexes serve to stabilize the kinetochore-microtubule interactions (see also Chapter 4). 20 Chapter 1 Introduction 21

The inner centromere Although not part of the kinetochore, proteins of the inner centromere are critical for 1 kinetochore function. The inner centromere is the chromatin region between the sister kinetochores. The inner centromere proteins that impact kinetochore function are those that form the Chromosomal Passenger Complex (CPC, Aurora, Incenp, Borealin, Survivin) and Sgo [110, 111]. These proteins destroy erroneous microtubule-kinetochore attachments, like two sister kinetochores being attached to microtubules from the same spindle pole [112]. They do so via the kinase module of the CPC, the protein kinase Aurora B, which phosphorylates Ndc80 to diminish its affi nity for microtubules. Aurora B additionally phosphorylates subunits of the Ska and Dam1 complexes [113], which, as a result, detach from the microtubules [114, 115].

In this introduction, I outlined the most important and shared components of the human and yeast kinetochores, which, compared to those of other eukaryotic species, were studied into great detail. As shown in Chapters 3 and 4, the kinetochores of other eukaryotes differ strongly. In fact, although not identical, the human and yeast kinetochores are relatively similar. I cannot exclude that other eukaryotes contain yet other important kinetochore components, which we, given the lack of experimental data in these species, do not know of yet. The kinetochores of trypanosomes (Excavata) may contain (some) other components [116], but I consider these likely to be derived and not shared with other eukaryotes, which is why I do not describe them here (see also Chapter 7).

Comparative genomics

In the work I describe in this thesis, I used comparative and evolutionary genomics to map the diversity and ancestry of eukaryotic kinetochores, and to uncover the processes that gave rise to the genome of LECA. In Chapter 2, I argue that specifi c research questions require specifi c approaches and I delineate my perspective on how to study the evolution of a single protein. Here I provide some key defi nitions and explain how comparative genomics can be used to detect co-evolution, and how this in turn can be used to predict gene functions.

Over the last decades, the life sciences have become increasingly infl uenced by genomic data. Whereas initially most sequenced genomes were prokaryotic, because these are small and relatively simple to sequence, now more and more eukaryotic genomes become available. Not only the number of sequenced eukaryotic genomes increased, but also their diversity. Likely the eukaryotic genome data will continue to increase, since sequencing no longer necessitates culturing and since new sequencing projects intend 20 Chapter 1 Introduction 21

to cover a yet wider, and more representative, genomic diversity [117-119]. By analyzing genomes, we can identify genes or proteins that share common ancestry (‘homologs’), both in different species as well as within a single species. Homologs shed light on 1 the evolutionary history of proteins and can be used to predict protein functions by transferring functional characterizations across homologs [120]. As such, comparative genomics has proven to be a powerful tool to uncover large-scale evolutionary events, to guide experimental biology and to construct or improve the species tree, to name a few applications [121]. Unsurprisingly, as the amount of genome information propagates, so does the number and quality of tools that are used to identify and analyze homologs.

Homology, orthologs, paralogs, analogs As mentioned, proteins that share common ancestry are called ‘homologs’. Homology may not just apply to proteins (or genes - in the remainder of this chapter these can be replaced with one another). It may apply to any pair of biological entities that descend from a single ancestral feature. In comparative genomics, genes/proteins are recognized to be homologous if their aligned (DNA/amino acid) sequences are suffi ciently similar. Homology is a boolean trait: either two features are homologous, or they are not. Hence, saying that “proteins A and B are 60% homologous” is strictly speaking incorrect. Often, with using this phrase one actually intends to say that 60% of their DNA/amino acid positions are identical or similar. Homology is also transitive trait, which means that if sequences (in the narrow meaning of: a sequence of nucleotides or amino acids, not necessarily a full protein or gene) A and B are homologous, and B and C are homologous, A and C are homologous as well. We discriminate various types of homology relationships between proteins: homologous proteins might be orthologs or paralogs, depending on which sort of event split them since their common ancestor (see Chapter 2, Box 2) [122]. If two proteins were derived from speciation of an ancestral species, they are orthologous. If they are derived from a gene duplication within a given ancestral species, they are paralogous. Moreover, some homologous proteins underwent a horizontal gene transfer since their last common ancestor. These are sometimes called ‘xenologs’, although logically these can also be classifi ed as either orthologous or paralogous.

Why is it important to discriminate between orthologs and paralogs, rather than simply using ‘homologs’? To answer this question, we have to consider how we think the function of a protein evolves after either a speciation or a duplication. Ohno, in his famous book Evolution by Gene Duplication, proposed that after a gene duplicates into two paralogs, one of these paralogs is released from the evolutionary pressure to conserve the function of the ancestral, pre-duplication protein (free from ‘purifying selection’) [123]. Hence, it may diverge its sequence and thereby acquire a new function. In other words: it may ‘neofunctionalize’. As a result, duplication paves the way for molecular and functional innovation. Since Ohno’s publication, many have observed that paralogs may not just 22 Chapter 1 Introduction 23

‘neofunctionalize’ but also ‘subfunctionalize’, if the function of the ancestral protein is divided over the duplication-derived paralogs [124]. Nevertheless, subfunctionalization 1 still implies that paralogs often do not retain the (entire) ancestral function. In contrast, orthologs are expected to have identical, or at least biologically equivalent, functions. This model about the alternative ways in which a protein’s function evolves after either a duplication or speciation is known as ‘the ortholog conjecture’ [125]. The ortholog conjecture has been challenged by arguments and observations about an equally strong or stronger functional correlation between paralogs compared to orthologs [126, 127]. Nevertheless, this conjecture is commonly used to rationalize why the function of a characterized protein a given species may be transferred to an uncharacterized protein in another species: this transfer is allowed for if these proteins are orthologs, not if they are paralogs.

While orthologs may have biologically equivalent functions, functionally similar proteins may not be orthologous. In fact, they might not even be homologous: many studies have reported about proteins with similar functions [128, 129], but without detectable homology. Such proteins are called ‘analogs’. Analogs might arise if one protein, for example encoded by a novel gene, takes over the function of another. As a result, the latter may become dispensable and lost. We refer to this process as ‘non-homologous displacement’. In some cases, proteins initially thought to be analogs turned out to be homologs. Possibly, their shared ancestry is not detected because their sequences diverged extensively since their common ancestor. In these cases, their solved crystal structures may show the similarity that points to their homology.

The terms ‘homologs’, ‘orthologs’ and ‘paralogs’ are used to describe the relationship between two genes or proteins. However, comparative genomics is about studying many species, and therefore many proteins, at the same time. For example, a common comparative genomics question is: which species have a particular protein? To answer this question, we aim to establish a set of homologous sequences that comprise an orthologous group (Chapter 2, Box 2). Such an orthologous group encompasses all proteins that were derived from a single protein in a selected last common ancestor, such as LECA or the last common ancestor of all animals. Within the orthologous group, two proteins can be paralogs. This is the case if they resulted from a gene duplication that occurred in a more recent lineage than the ancestral species we chose to defi ne the orthologous group for. These proteins are called ‘inparalogs’. Two proteins that result from a gene duplication prior to the last common ancestor of choice are called ‘outparalogs’; they are in different orthologous groups. In this thesis, I in most cases apply comparative genomics to the domain of eukaryotes. Therefore, I generally defi ne the orthologous group as those sequences that were derived from a single protein in LECA. However, not all proteins I study date back to LECA. Some likely were invented 22 Chapter 1 Introduction 23

more recently. In those cases, the orthologous group only contains sequences from species that descend from the lineage in which the protein was invented. 1 Phylogenetic profi ling & detecting co-evolution Comparative genomics often aims to fi nd out which species have a given protein, that is, a member of an orthologous group, and which do not. This information can be used to predict this protein’s function based on its predicted (functional or physical) interactors. The presences and absences of a protein are captured by a so-called ‘phylogenetic profi le’ which simply is a binary vector, consisting of e.g. ones and zeros, or‘P’s (presences) and ‘A’s (absences) (Figure 3). Often species are ordered according to their position in the species tree. This phylogenetic profi le is used to infer whether the protein of interest co-occurs with another protein across species. If two proteins have identical or very similar phylogenetic profi les, this suggests that they are functionally related, or maybe even interdependent. As a result, phylogenetic profi ling can be used to predict functions for uncharacterized proteins in a ‘guilt-by-association’-approach [130], which has proven to be successful for many proteins [28]. Moreover, the phylogenetic profi le of a protein can be compared to those of other biological entities, such as a cellular feature, a functional protein motif or a metabolic product. For example, many proteins involved in the centrosome have a phylogenetic profi le that is very similar to that of the centrosome itself [131]. Reversely, this method can also be used to fi nd out which (other) proteins are involved in a certain process. Phylogenetic profi les are not only carrying information if they are very similar, they also do when they are very dissimilar [132]. If two proteins are both present in many species, but never co-occur, they might be functionally redundant. After all, if two proteins do exactly the same job, it does not make sense to have them both, especially if they cannot cooperate or alternate. Such proteins might be analogs.

Both similar or dissimilar phylogenetic profi les are the result of evolutionary processes. Similar profi les often indicate that proteins co-evolved, for example because if one of the protein is lost in a given lineage, the other one cannot function any longer and is lost in this lineage as well. Asimilar profi les might indicate displacement, which I mentioned above. Phylogenetic profi ling hence also provides a means to study a gene’s (co-)evolutionary dynamics, albeit in an indirect manner. Co-evolution can also be more directly studied by comparing phylogenetic trees of proteins: if these are very similar, this points towards co-evolution [28]. In this approach, one is able to time when in evolution a protein was lost, but also when it for example duplicated, and if this coincided with the loss or duplication of another protein. Thereby, one directly observes that these proteins co-evolved, and how they did. Moreover, co-evolution may not only occur at the level of proteins, but for example also at the level of residues [133], or between a protein and a protein module [134]. Such co-evolution can equally well provide functional cues. 24 Chapter 1 Introduction 25

species tree phylogenetic profile (pp)

H. sapiens C. owczarzaki 1 S. cerevisiae F. alba A. castellanii D. discoideum N. gruberi T. brucei E. siliculosus P. falciparum B. natans A. thaliana C. reinhardtii G. sulphuraria

gain / presence protein A/B/C loss protein A/B/C

correlating pp’s: collaboration? protein A protein B protein C anticorrelating pp’s: substitution?

Figure 3. Phylogenetic profi ling of proteins refl ects co-evolution and might indicate functional interactions. Hypothetical proteins A, B and C were found to be present or absent in certain lineages, as indicated by their phylogenetic profi les (right side). Proteins A and B are often present or absent together, which may indicate that they depend on each other for their function and hence collaborate in current-day species, for example in a single pathway or in a protein complex. Proteins A and C, and B and C, are never both absent or both present in a single species. This might indicate that they perform the same functions in different species as analogous proteins. The phylogenetic profi les of proteins are the results of their evolutionary trajectories, which are represented on the species tree. From this evolutionary reconstruction, one can directly conserve that proteins A and B co-evolve: they often get lost on the same branch. Moreover, loss of A and B may either predate or postdate the gain of protein C.

Scope and outline of this thesis

In the work described in this thesis, I used comparative genomics to study the evolution of eukaryotic proteins since and before the last eukaryotic common ancestor, specifi cally those proteins that constitute the eukaryotic kinetochore. I mainly applied comparative genomics on a small scale, that is, by manually studying the evolution of individual proteins. In Chapter 2, I expose how I approach this type of study. Thereby, this chapter serves as a guide for other researchers who want to investigate the evolution of a particular protein of interest. Moreover, this chapter reveals how I examined the evolution of various kinetochore proteins in Chapters 3, 4, and 5. In Chapter 3, I report the overall 24 Chapter 1 Introduction 25

presences and absences of characterized kinetochore proteins across eukaryotic species. I try to explain what (co-)evolutionary dynamics are responsible for these presences and absences. This inventory revealed the unique presence-absence patterns of the 1 analogous outer kinetochore complexes Ska and Dam1. In Chapter 4, I discuss our in- depth study on the evolution of these complexes. I propose they arose from ancient gene duplications and that Dam1 spread via horizontal gene transfer and displaced Ska in the recipient lineages. In the analysis performed for Chapter 3, we furthermore noticed that in addition to the Ska and Dam1 complexes, also other kinetochore complexes contain paralogous proteins. In Chapter 5, we uncover the ancient origins of kinetochore proteins, which revealed that the kinetochore is of mosaic origin and that duplication played an important role in its expansion during eukaryogenesis. In Chapter 6, I generalize the latter topic by addressing how gene duplications during eukaryogenesis contributed to the complex genome of LECA. I studied the prokaryotic origins and the duplication histories of LECA’s genes through phylogenomics. In Chapter 7, I end this thesis with a discussion on the evolutionary phenomena I encountered, their impact on kinetochore evolution and why investigating them can be a challenge.

2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists

Jolien JE van Hooff, Eelco Tromer, Teunis JP van Dam, Geert JPL Kops and Berend Snel

Manuscript submitted 28 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 29

Summary

Van Hooff et al. expose how to approach the evolutionary analysis of individual proteins. By outlining different evolutionary scenarios, they provide an analytical scheme for molecular and cellular biologists who aim to study the evolution of their protein of interest.

Abstract 2 Comparative genomics has proven a fruitful approach to acquire many functional and evolutionary insights into core cellular processes. Such studies typically yield a set of sequences orthologous to the protein of interest. They lack a simple protocol, because different proteins have different evolutionary dynamics, and therefore demand different approaches. For the same reason, automatic approaches to establish sets of orthologs often fall short. We here discuss this challenge from a practical (what are the observations?) and conceptual (how do these indicate what happened in evolution?) viewpoint, with the aim to guide investigators who want to analyze the evolution of their protein of interest. We argue that one should fi rst and foremost generate a scenario for the protein’s evolutionary dynamics, because it aids in choosing the appropriate strategy. By sharing how we draft, test and update such a scenario and how it directs our investigations, we hope to illuminate how to execute molecular evolution studies and how to interpret them.

Introduction

Historically, comparative and evolutionary analysis of genes and proteins has been a versatile and useful approach in molecular biology to guide experiments that investigate the cellular role of human proteins. Typically, their evolutionary interrogation is used to answer the following questions: What is the function of characterized homologs in other species, such as budding yeast? Which residues are conserved and therefore likely functionally important? For example, the telomere function of the human protein Rap1 was predicted from the telomere function of its yeast ortholog [135], and the multiple sequence alignment of S6 kinases identifi ed the TOS motif, which is essential for TOR signaling [136]. Evolutionary analysis can also have more advanced applications (Table 1). The function of a protein might be predicted from ‘guilt-by-association’: if a protein co-occurs nicely with another protein across species, it is likely to play a role in the same process [137]. The structure of a protein might be predicted from a multiple sequence alignment: a multiple sequence alignment might reveal which residues co-evolve, and therefore likely are in close proximity in the protein’s 3D structure [138]. 28 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 29

Table 1. Evolutionary examination of a protein serves a variety of purposes.

Purpose Example studies Key challenges Key observations Function prediction The human protein Rap1 Finding (pairs of) Orthologs have of human proteins operates at telomeres [135] orthologs (Box 2). biologically equivalent functions.

Identifying conserved S6 kinase - TOS motif [136], Using correct Functional sequence functional domains/ Nbs1/Ku80/ATRIP - ATM sequences for the motifs are often motifs interaction motif [139] multiple sequence conserved patches 2 alignment in unconserved regions.

Understanding CenH3 and holocentricity [101], Retrieving Protein loss and function and Effector proteins, horizontal information on gain (e.g. via evolution of gene transfer and pathogenicity biological (e.g. horizontal gene biological features [140] cellular) features transfer) correlate to feature loss and gain.

Reconstructing GKPID’s ability to bind Pins Obtaining a high- New versus old evolution of evolved via a single amino acid quality gene tree protein functions function by substitution [141] can be distinguished ancestral sequence based on sequence reconstruction information

Function prediction BBS proteins are involved in the Correct Non-classical by co-evolution cilium [142], identifi cation of phylogenetic or phylogenetic TRIP13 has conserved roles in presences as well profi les predict profi ling both mitosis and meiosis [14] as absences across analogy and species bifunctionality

Understanding MadBub duplicates diverged Allowing for loss Duplicate subfunc- functional divergence into a Bub-like and Mad-like of domains in tionalization might after gene protein through reciprocal homology searches, be more common duplication domain/motif loss [143, 144] obtaining a high- than neofunctional- quality gene tree ization

Predicting gene CAMSAP tracks the minus ends Finding (pairs of) Orthologs have functions or cellular of microtubules in Trichomonas orthologs biologically make-up of non- vaginalis and Tetrahymena equivalent functions. model organisms thermophila [145] 30 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 31

Phylogenetically Kinesins [146], histones [147], Distinguishing Large-scale classifying proteins ATPases [148], myosins [149, different duplications before that are part of a 150], Rab GTPases [151] orthologous groups LECA. large protein family and identifying ancient duplications

Uncovering ancient Prokaryotic origins of various Detecting remote Characteristic origins of eukaryotic ‘eukaryotic signature proteins’ homology eukaryotic features cellular features [10, 152-154] are encoded by 2 genes shared with Archaea.

In addition to helping to elucidate function, evolutionary analysis of a protein reveals the evolution of the processes it is involved in. This way, molecular evolutionary biologists study the origins of eukaryotic cellular complexity, just as classic evolutionary biologists study the origins of physiological innovations such as living on land, fl ight or warm- bloodedness. Studies that unraveled the evolutionary history of proteins have also profoundly altered biological paradigms. They revealed the large contribution of gene duplication to genome expansion and the pivotal role of promiscuous protein domains in shaping signaling networks [155, 156]. In some cases, protein evolutionary analyses yielded results that are relevant to both cellular as well as evolutionary biology. An example is MadBub. This protein independently duplicated at least 16 times in diverse eukaryotic lineages. Most duplicates diverged by reciprocally losing one of the two biochemical functions of the ancestral protein (‘subfunctionalization’). This observation predicted a function for one of the human duplicates, and it provided a (thus far) unique example of parallel duplication and subfunctionalization [143, 157]. In general, obtaining relevant functional and evolutionary insights requires a strong interplay between the two: projecting functional knowledge onto a protein can explain its evolutionary history, and the evolutionary history aids in understanding or predicting the functions of a protein across species.

We foresee that evolutionary analysis of single proteins will continue to pay off in the future. These analyses generally have the highest quality if performed manually, by studying proteins through in-depth inspection. Automated approaches often have certain biases that make them perform well for some proteins, but not for others [158]. Manual analysis allows one to apply a customized approach suitable to the protein of interest [159]. Cellular biologists would greatly benefi t from doing such a manual analysis themselves. In fact, due to their knowledge on a protein and its functional regions, they frequently give complementary evolutionary interpretations compared to bioinformaticians. In collaboration with cellular biologists, we studied diverse eukaryotic proteins, varying from chromatin remodelers and bZIP transcription factors to components of the fl agellum and 30 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 31

Box 1. What can happen to a protein during evolution? Most evolutionary interrogations are focused on, or at least include, determining what happened to a protein and its coding DNA sequence during its evolution across different lineages. By ‘this protein’ we here refer to the protein sequences that constitute an orthologous group (OG), and which protein sequences are ‘this protein’ depends on the level at which the OG is defi ned (see Box 2).

Event Defi nition Frequency in Eukaryotic Proteins with eukaryotes clades with elevated frequency 2 compared to elevated prokaryotes frequency Genesis/de Birth of a coding Unknown Parasitic lineages Host-parasite novo gene sequence from non- interactions origin coding sequence

Duplication Duplication of a High Land plants Regulation (signal whole or partial (whole-genome transduction, coding sequence, duplications), transcription either part of a small- animals factors), metabolic scale or of a whole- enzymes,host genome duplication. pathogen interaction

Loss Pseudogenization Low Parasitic lineages, Genes with few and/or elimination of obligatory interaction partners, coding sequence symbionts lower expression, higher mutation rates

Horizontal Transfer of coding Low Fungi Plant pathogenicity Gene Transfer sequence whereby effectors, metabolic (HGT) donor and acceptor enzyme clusters are not in a parent- offspring relationship

Fusion Joining of two (Higher than Metazoa Signal transduction coding sequences fi ssion) into a single ORF.

Sequence Mutations in coding High Intra cellular Genes with fewer divergence sequence that alter parasites, interaction partners, the AA sequence Caenorhabditis lower expression elegans rates. host-pathogen interaction 32 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 33

kinetochore [14, 160-162]. We experienced that for non-bioinformaticians various challenges arise, regarding technical practicalities, the knowledge of frequent genome evolutionary events and of the species tree, and, importantly, intuiting how complicated evolutionary histories are represented in a myriad of computational tools and sequence databases. If not overcome, such challenges may cause erroneous inferences such as incorrect, or incomplete, function prediction [159]. In this article we outline what we think are the most important principles for studying (or reading a study about) the evolutionary relationships and history of a protein. We argue that the most crucial yet overlooked skill 2 is to be able to recognize, postulate and revise different evolutionary scenarios, a skill that is typically diffi cult to automate.

Studying the evolution of a protein: what do we mean?

Although evolutionary analysis of a protein can be applied in different ways (Table 1), it typically involves answering these two main questions: 1) When did this protein originate? 2) Which other current-day species have this protein? These questions come down to asking which events occurred during the evolutionary history of a protein, such as origination, loss and duplication. The evolutionary history also entails sequence divergence and the gain and loss of specifi c domains or functional protein motifs. For an overview of such evolutionary events, see Box 1. The resolved evolutionary history can for example be used to infer whether the protein is an ancient or a novel component of the cellular process it is involved in, or to predict if after a gene duplication, the protein is likely to have maintained its function. Note that, although these two main questions are seemingly straightforward, the devil is in the semantics. What do we mean with ‘origin’, or with ‘this protein’? As we illustrate in the next sections, we sometimes actually redefi ne the ‘origin’ and ‘this protein’ during the analysis.

It’s orthologs we (mostly) care about Unraveling the evolutionary history of a gene entails fi nding the protein in other species. Note that, particularly at longer evolutionary distances, protein sequences are more informative than DNA sequences. When searching for a given protein in other species, we actually search for orthologs of this protein. Orthologs are homologous proteins that diverged due to speciation. For those not aware of the distinction between different types of homologs, we recommend to read Box 2 (‘Key defi nitions: Homology, orthology, paralogy orthologous groups (OGs), inparalogs and outparalogs). Why are we – and others- often specifi cally interested in the orthologs, rather than all homologs? The answer is found in the ‘ortholog conjecture’, which states that orthologs have biologically equivalent functions in different organisms, whereas paralogs have different 32 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 33

functions due to divergence after gene duplication [122, 125]. After all, if paralogs did not diverge functionally in some way after gene duplication, they are redundant and therefore most often one copy is lost [163]. Proteins that descend from a single ancestral protein together constitute an orthologous group (OG, see also Box 2). For example, proteins may form an orthologous group if they descended from a single protein in the last eukaryotic common ancestor (LECA). In this example, the OG was defi ned on the level of ‘eukaryotes’, but in other cases one might choose a lower level, like ‘animals’ (single protein in the last common ancestor of all animals), or a higher level, like ‘all cellular life’ (single protein in the last universal common ancestor). For details on the OG 2 composition and the relationships between members of the same OG and between homologous proteins of different OGs, see Box 2. For some proteins, establishing an OG requires making a gene tree (explained in ‘A quick (and dirty) guide to inferring the evolutionary history of a protein’), for other proteins the OG can be established without it. Whether required or not, a gene tree will reveal the evolutionary events that occurred to the protein of interest (Box 1), and feed into subsequent investigations.

Here we will focus on the evolutionary investigation of proteins involved in eukaryotic cell biology. Therein, we assume we search for orthologs of a given human protein in eukaryotes, and aim to infer the evolutionary events that happened since LECA (although we also touch upon what happened before). However, the same principles hold when studying e.g. a budding yeast protein, or studying the evolution of a protein since the last common ancestor of animals. We assume readers have general knowledge of the eukaryotic species tree (Box 3), suffi cient to use more detailed resources like NCBI (https://www.ncbi.nlm.nih.gov/taxonomy) [15] or the Tree of Life Web Project (http://tolweb.org/tree/) [164]. Also, we assume readers have some basic experience with computational analyses, such as similarity searches with NCBI’s BLASTP [15] or HMMER’s phmmer [165] and multiple sequence alignments, such as Clustal Omega at the EMBL-EBI web interface [166]. Such skills provide a solid base to execute the research strategies that we propose. We will outline the initial steps of the analysis, their potential results, and what these results suggest about the protein’s evolution. The latter, in turn, will guide subsequent steps in the analysis. We will elaborate on how we iterate from initial results to a fi nal hypothesis using different approaches, which we select based on the specifi c challenges posed by the protein of interest.

A quick (and dirty) guide to inferring the evolutionary history of a protein

We here describe how to infer the evolutionary history of a protein from the perspective of a cellular biologist studying a eukaryotic cellular process in humans. This researcher 34 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 35

Box 2. Key defi nitions: Homology, orthology, paralogy, orthologous groups (OGs), inparalogs and outparalogs. Homology refers to an evolutionary relationship between two biological features. Features are homologous if they descend from a single common ancestral origin, regardless of whether they are morphological structures or molecules. Hence, proteins that do not share a common ancestor but that do share a similar function should not be referred to as ‘functional homologs’: those proteins are better named ‘analogs’. Homology is a qualitative trait, not a quantitative one, such that saying that “sequences 2 A and B are 60% homologous” does not make sense. In addition, it is a transitive trait, which means that if sequences (in the narrow meaning of: a sequence of nucleotides or amino acids, not necessarily a full-length protein or gene) A and B are homologous, and

A Homo sapiens (Metazoa)

Acanthamoeba castellanii (Amoebozoa) 1 LECA

Naegleria gruberi (Excavata) 2

Protein X Plasmodium falciparum (Alveolata)

Protein Y

Protein Z Arabidopsis thaliana (Archaeplastida) LECA: last eukaryotic common ancestor

B OG protein X homo_sapiens X X acanthamoeba_castellanii X naegleria_gruberi X arabidopsis_thaliana Xα arabidopsis_thaliana Xβ 1 homo_sapiens Yα homo_sapiens Yβ Y acanthamoeba_castellanii Yβ plasmodium_falciparum Y OG orthologous group 2 OG protein Y arabidopsis_thaliana Y speciation acanthamoeba_castellanii Z Z duplication naegleria_gruberi Z inparalogs plasmodium_falciparum Z outparalogs (not all indicated) OG protein Z arabidopsis_thaliana Z orthologs (not all indicated)

other homologs (eukaryotic or prokaryotic)

may aim to perform such an analysis in order to for example make a multiple sequence alignment based on the correct sequences, or to fi nd which other organisms have the protein of interest. A standard evolutionary analysis ideally consists of searching for 34 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 35

B and C are homologous, A and C are homologous as well. In molecular evolution of eukaryotes, the two most important ways in which two proteins can be related are by speciation (orthology) or by duplication (paralogy). Proteins are orthologs if they result from a species divergence, and they are paralogs if they result from a gene duplication in a given species. The terms ‘homologs’, ‘orthologs’ and ‘paralogs’ are used to describe the relationship between two genes or proteins. However, comparative genomics is often about studying many species, and therefore many proteins, at the same time. Therefore, we use the concept of the orthologous group (OG). An OG encircles the set of proteins that diverged from a single protein in the last common ancestor, such 2 as ‘the last eukaryotic common’ (LECA) or ‘the last common ancestor of all animals’. Within the orthologous group, two proteins can be paralogs if they resulted from a gene duplication that occurred in a more recent lineage than the ancestor we chose to defi ne the orthologous group for. These proteins are called ‘inparalogs’. Two proteins that result from a gene duplication prior to the last common ancestor of choice are called ‘outparalogs’; they are in different orthologous groups.

A. The evolution of hypothetical proteins X (green), Y (violet) and Z (orange) before and after LECA, projected onto the species tree. Before LECA an ancestral protein duplicated twice (‘1’ and ‘2’) to give rise to three proteins in LECA. These duplications gave rise to the protein orthologous groups (OGs) X, Y, and Z. After LECA, protein X duplicated in the lineage leading to Arabidopsis thaliana, and protein Y duplicated before the common ancestor of Acanthamoeba castellanii and Homo sapiens. Various eukaryotic lineages lost a protein, e.g. human lost Z. The bullets behind the species indicate the supergroups to which they belong (Box 3). B. Gene tree of hypothetical proteins X, Y and Z, rooted on distant homologs. Regularly, the nodes in the tree that unite sequences from the same species are inferred to be duplication nodes, the nodes in the tree that unite sequences from different species are mostly speciation nodes. However, when the protein topology differs from the species topology, ancient duplication nodes and lineage-specifi c losses may have to be inferred, although there are no sequences from the same species on both sides of these nodes [36]. Duplication nodes prior to LECA separate outparalogs, duplication nodes after LECA separate inparalogs. Proteins that belong to the same OG descent from a single protein in LECA (internal nodes ‘X’, ‘Y’ and ‘Z’). Note that within this fi gure, not all pairs of outparalogs and pairs of orthologs are indicated. For example, protein Y of A. thaliana and protein Z of A. thaliana are also outparalogs. The protein Z of A. castellanii and Plasmodium falciparum form a pair of orthologs, so do proteins Yα of H. sapiens and Y of P. falciparum. Since Yβ (H. sapiens) is also orthologous to Y of P. falciparum, human Yα and Yβ are called ‘co-orthologs’ to this P. falciparum protein.

homologous sequences, aligning these, using this multiple sequence alignment to infer a gene tree, and interpreting this tree in light of the species tree to reconstruct the evolutionary history of the protein (‘tree reconciliation’). While method sections in scientifi c 36 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 37

papers read like this, the analysis often requires more steps and not-so-straightforward decisions, which are not explicitly reported. In fact, most analyses will have gone through several iterations, while often only the last one is documented. In order to shed light on this process, we here discuss how we start and proceed our analysis. In this process, we continuously take into account a scenario (a.k.a. draft hypothesis or model) for the evolution of the protein of interest. We do not provide a guide on how to use specifi c bioinformatics tools, but aim to provide practical stepping-stones and their conceptual underpinnings. 2 The draft hypothesis is based on initial observations Inferring the evolutionary history of a protein starts with generating the observations. These basic observations provide information to draft a hypothesis, which in our case is the evolutionary scenario for the protein. Such a scenario should account for potential technical problems, because these problems might inhibit us to directly deduce the ‘true’ evolutionary history of a protein from the results. We acquire the fi rst basic observations by collecting information about the protein of interest, such as its domain architecture, functional sequence motifs, and suggested orthologs from experiments in other species or from databases with precomputed comparative genomics information, such as InParanoid [167], PANTHER [168] or Ensembl Compara [169]. Then, we perform one or a few similarity searches with the protein sequence of interest (‘the query’) using for example BLASTP or phmmer [165, 170] (Figure 1A). In order to obtain results that are feasible to interpret, we recommend selecting a subset of species, for example 20 species from diverse eukaryotic clades including at least human. In Box 4, we suggest such a species set. The hits we obtain are statistically likely homologous to our protein if they have a low E-value (e.g. below 0.001 or 0.01, see also ‘Life is more complicated: Exploring and validating grey zone hits’) [171]. The obtained homologs will tell us (1) which organisms contain homologs, and which do not, and (2) how many homologs are present in human (and other vertebrates or animals, if included in the search set). Together, the external information and this initial result(s) enable us to draft an initial scenario for the evolutionary history of our protein of interest. This initial scenario can broadly be categorized into four classes. After having determined which scenario applies to our protein of interest, we can decide how to infer a gene tree (what sequences to use for it) and what sort of information we expect to retrieve from this tree. Moreover, some proteins demand more complicated strategies, which we discuss in ‘Life is more complicated’.

Scenario 1 (‘easy’): Clear initial observations that generate a simple scenario The fi rst scenario is straightforward, and comes with clear-cut observations. If a single search gives us highly signifi cant single hits across a range of species, these results indicate that the protein is relatively well conserved (Figure 1B). Moreover, the protein 36 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 37

has a clear single point of origin: Since likely no ancient duplications occurred, the protein can be safely inferred to have originated in or before the last common ancestor of the species containing the homologs, which then by defi nition are also all orthologs. Being present in eukaryotes that are most distantly related to human, such Plasmodium falciparum or plants (see Box 3) thus implies an origin in or before LECA. If the protein is

A query similarity search database >PROTEIN_X_HUMAN Homo sapiens MAAPEAEVLSSAAVPDLEWYEKSEETHASQIEL Xenopus tropicalis LETSSTQEPLNASEAFCPRDCMVPVVFPGPVS Drosophila melanogaster species tree QEGCCQFTCELLKHIMYQRQQLPLPYEQLKHF Saccharomyces cerevisiae YRKPSPQAEEMLKKKPRATTEVSSRKCQQALA Schizosaccharomyces pombe Homo sapiens ELESVLSHLEDF Arabidopsis thaliana Plasmodium falciparum Xenopus tropicals 2 Trichomonas vaginalis Drosophila melanogaster

Saccharomyces cerevisiae

Schizosaccharomyces pombe

Arabidopsis thaliana B hits make msa + gene tree Plasmodium falciparum Trichomonas vaginalis

homo_sapiens

xenopus_tropicalis

drosophila_melanogaster

schizosaccharomyces_pombe

arabidopsis_thaliana

highly-weakly similar

Scenario 1: Easy protein origin

Figure 1. Outline of the initial analysis steps and their results in case of scenario 1: easy. A. After having collected basic protein information, the initial analytic step is to perform sequence similarity searches with a query sequence (here illustrated as a human sequence) against a proteome database. This database may consist of a subset of species, ideally species of which we know the phylogenetic relationships, as indicated by the species tree. The colored bullets indicate the supergroups to which the species belong, as in Box 3. Here we illustrated the proteome database as a subset of eukaryotic species. The search yields sequence similarity hits that inform about potential scenarios. After hit protein sequences are collected, a multiple sequence alignment (msa) and gene tree can be inferred. B. The easy scenario is hinted at by highly similar hits, mostly single hits in a variety of species. Likely the protein of interest is old and relatively well conserved at the sequence level. Not all species in the database might have it due to gene loss. As a result, the gene tree does not include all species. NB: The gene tree within the species tree in the left panel is an illustration of the protein’s evolutionary history. Unlike the gene tree in the right panel, the gene tree within the species tree has a root that indicates the protein’s origin and branches for lost protein sequences. The gene tree indicated here does not contain branch length values and support values, such as bootstrap percentages. Furthermore, while we here use the term ‘gene tree’, because the evolution occurs at the genetic level, it also describes the evolution of the protein as a consequence of that. Generally amino acid sequences are being used to infer this gene tree, particularly when studying evolution on longer evolutionary timescales. These remarks also apply to Figure 2A-C. 38 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 39

search hits gene tree highly-weakly similar protein origin

A Scenario 2: Taxonomically limited? protein duplication

homo_sapiens

xenopus_tropicalis

drosophila_melanogaster

2

B Scenario 3: Lineage-specific duplication

homo_sapiens xenopus_tropicalis drosophila_melanogaster homo_sapiens xenopus_tropicalis drosophila_melanogaster schizosaccharomyces_pombe arabidopsis_thaliana trichomonas_vaginalis

C Scenario 4: Ancient duplication

homo_sapiens

xenopus_tropicalis

drosophila_melanogaster

schizosaccharomyces_pombe

arabidopsis_thaliana

homo_sapiens

drosophila_melanogaster

saccharomyces_cerevisiae

schizosaccharomyces_pombe

arabidopsis_thaliana

plasmodium_falciparum

short (e.g. < 150 amino acids) or if the sequence divergence is somewhat high (as observed by the fi rst search) we often use a single iteration of an iterative searching tool like PSI-BLAST or jackhmmer [165, 170]. Such an iterative search may help to fi nd some additional orthologs in species that were not present in the fi rst search. Iterative searching tools make a ‘sequence profi le’ of the protein, based on the multiple sequence alignment of the hits found in the fi rst search. They then use this profi le to again search through the database. Because this profi le contains much more information than a single query sequence, it is more sensitive and able to detect divergent orthologous sequences. With such a profi le-based iteration we might obtain a similar result as with a single search applied to a longer or less diverged sequence, implying the same, simple 38 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 39

Figure 2. Similarity search results and expected gene tree for three alternative scenarios. A. The taxonomically limited (?) scenario delivers few hits in the sequence similarity searches, only in species closely related to that of the query sequence. Although this might suggest a recent origin (upper left star), the protein might in fact be older (faded star below). This protein might evolve rapidly in different lineages, due to which sequence similarity is below signifi cance and hence various orthologs are not seen in the similarity output. If no further, more advanced searches are executed, the gene tree includes only sequences from closely related species. B. The lineage-specifi c duplication scenario predicts highly similar, multiple hits in the query species and in species that are closely related. The gene tree aids in determining when this duplication occurred. In this case, this duplication took place in a common ancestor of vertebrates and fruit 2 fl y, after their divergence from fungi. This may have been the common ancestor of animals. C. The ancient duplication scenario predicts multiple hits across the species that are part of the database, in which one has a higher similarity to the query sequence than the other(s). In this scenario, sequences are hit from multiple OGs (Box 2) and a gene tree is needed to discriminate which sequences belong to which OG.

evolutionary scenario. An example of a protein that adheres to this straightforward scenario is Incenp, a protein that localizes to the inner centromere, which is present as single copy in the majority of eukaryotes [14].

Scenario 2 (‘taxonomically limited?’): Limited taxonomic distribution, protein recently invented or homologs that are diffi cult to detect? We hypothesize the second scenario if we retrieve only hits in closely related organisms, with low sequence similarity (e.g. a <75% similarity hit in mouse when searching with a human protein sequence). At fi rst sight this might indicate a recent origin of this protein. However, it more likely refl ects the evolutionary history of an older protein that diverged rapidly and therefore does not allow us to detect the more distant orthologs in the fi rst search (Figure 2A). Proteins that are typically prone to such detection failures are those having many non-globular regions (mainly coiled-coil or intrinsically disordered). In fact, many proteins are much older than suggested by simple similarity searches such as BLASTP. Under closer bioinformatic or biochemical scrutiny, these proteins often can be shown to have orthologs in more distantly related species than observed before [172]. Truly lineage-specifi c proteins are the exception, especially if examining proteins involved in core cellular processes. Advanced search strategies to cope with this scenario are discussed in ‘Life is more complicated’, under ‘Exploring and validating grey zone hits’.

Scenario 3 (‘lineage-specifi c duplication’): Multiple human sequences are highly similar and point to lineage specifi c duplication(s) We encounter the third scenario if we search with a human query protein and fi nd hits in human with a higher similarity than hits in more distantly related species. Such results indicate that the protein underwent recent lineage-specifi c duplications, for example 40 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 41

in the ancestor of animals (Figure 2B). Although this evolutionary scenario is easy to comprehend, it does not necessarily give an easy answer to the question “when did this protein originate”. If the functional differentiation between these duplicates is minimal, they likely both still have the function of the pre-duplication protein. Hits in species that diverged before this duplication (such as a hit in the choanofl agellateSalpingoeca rosetta in the example of a duplication in the metazoan ancestor, see Box 4) are also expected to perform this ancestral function, and therefore we would also consider these hits to be the “same” protein. However, if there was functional innovation in the lineage leading 2 to the query protein (‘neofunctionalization’) [124], it makes more sense to consider those pre-duplication hits not to be the “same protein”. For example, haemoglobin evolved after duplication of an ancestral globin in the vertebrate lineage [173]. Hence, while globins clearly exist in other animals and eukaryotes, it is not very meaningful to posit that plants contain haemoglobin even though they are technically orthologous (see Box 2). In this case, we might opt to defi ne the orthologous group not on the level of all eukaryotes, but on the level of vertebrates. To distinguish between these two alternative histories, we use functional information about the protein of interest, about the (human) paralog and about the pre-duplication orthologs to assess whether the protein of interest neofunctionalized. Furthermore, in order to pinpoint the timing of the duplication (was it indeed in the common ancestor of vertebrates, or maybe already in the ancestor of all chordates?), we need to make a gene tree.

Scenario 4 (‘ancient duplication’): Similar proteins in all eukaryotes likely are paralogs that result from an ancient duplication In the fourth scenario, the search result contains homologs in human, but many hits in more distant eukaryotes are more similar to the query than this human sequence (Figure 2C). This suggests that before LECA (a) gene duplication(s) resulted in two or more proteins that probably functionally diverged, and that each gave rise to a separate orthologous group (OG, see Box 2). After LECA, these proteins likely conserved their functions in different eukaryotic lineages. This scenario applies to for example tubulins and Rab GTPases [151, 174, 175]. In case of well-conserved folds, such as kinases, protein sequences that descended from different proteins in LECA are highly likely to be mingled in the similarity search output. In order to assign these sequences to the correct OG, and hence to establish the OG of the protein of interest, we have to make a gene tree.

How to make and use gene trees under the different scenarios? Although particularly important for the third and fourth scenario, we also make and interpret (‘reconcile’) gene trees for proteins we assign to the other scenarios. Note that while we here use the term ‘gene trees’ - this term highlights that they model the evolution of a gene, as apposed to the ‘species tree’ - for longer evolutionary distances 40 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 41

Box 3. Eukaryotic phylogeny Eukaryotic cells are generally estimated to have arisen from prokaryotes approximately 2 billion years ago [1, 2]. All eukaryotes diverge from a single common ancestor, the so-called ‘last eukaryotic common ancestor’, or LECA. Eukaryotes evolved from within the Archaea [7, 8], likely from the newly identifi ed Asgard superphylum [9, 10]. Also Alphaproteobacteria had a role in eukaryotic evolution: an alphaproteobacterium became the endosymbiont of this Asgard-related archaeal lineage that would evolve into the mitochondrion [17]. Eukaryotes often are taxonomically categorized into fi ve different supergroups [29]: Opisthokonta (animals and fungi), Amoebozoa, Archaeplastida (algae 2 and land plants), SAR (Stramenopila, Alveolata and Rhizaria) and Excavata. The exact position of the root of eukaryotes and the defi nitions of the supergroups are still under debate [31-33], but these uncertainties do not compromise the ideas we put forward in this work. Species tree showing the position of Bacteria the eukaryotes in the tree of life. According to the two primary domains model, the last universal common LUCA Archaea ancestor (LUCA) diverged to give rise to Bacteria and Archaea, and the latter Opisthokonta gave rise to the pre-eukaryotic host of Amoebozoa the pre-mitochondrial endosymbiont LECA [16]. Current-day eukaryotes descend Archaeplastida from a single common ancestor, called the last eukaryotic common ancestor SAR (LECA). After LECA, eukaryotes diverged Excavata quickly into the major lineages that we refer to as ‘supergroups’: Opisthokonta, Archaea Amoebozoa, Archaeplastida, SAR

LUCA: last universal common ancestor (Stramenopila-Alveolata-Rhizaria), LECA: last eukaryotic common ancestor SAR: Stramenopila-Alveolata-Rhizaria Excavata.

we typically use protein (amino acid) sequences to make such a gene tree. We make gene trees for two reasons. First, by making a tree we test whether the proposed scenario was actually the correct one. The lineage-specifi c and ancient duplication scenarios will clearly yield very different gene trees (Figure 2B,C, right-sided panels). Second, the gene tree uncovers the evolutionary history of the protein into greater detail. Importantly, when making the gene tree, we should be sure that the sequences that we input are in fact homologous to one another: only homologous protein sequences can be aligned and phylogenetically scrutinized. Although we use the gene tree to test the predicted 42 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 43

scenario, our predicted scenario itself determines how we execute the gene tree analysis. In the simple and taxonomically limited scenarios, we simply input all homologs found, or, if the database is very large, we select a subset that represents the diversity of the hits and the species in which they are found. In the lineage-specifi c duplication scenario, if we think a gene duplicated in e.g. animals, we include all sequences from organisms that diverged just before (e.g. the choanofl agellateS. rosetta) and after (e.g. the sponge Amphimedon queenslandica) the hypothesized duplication. Conversely, under the ancient duplication scenario we include two or more sequences from various more 2 distant genomes, while ensuring that each eukaryotic supergroup is well-represented [32]. After having constructed the tree, we compare it to the initially proposed scenario in order to assess whether it was drafted correctly. Moreover, we use the tree to explain in detail what happened during the evolution of this protein (Box 1). When, in which ancestral lineages, did it duplicate, got lost or even horizontally transferred? This process of interpreting the gene tree by comparing it to the species tree is called tree reconciliation, on which detailed guides can be found in bioinformatic textbooks [176, 177]. Tree reconciliation can be quite challenging, particularly if the gene tree contains errors (see ‘Life is more complicated: Gene trees may be incorrect’). New computational methods are being developed regularly in order to facilitate tree reconciliation [178].

Life is more complicated

While executing the evolutionary analysis as described above, we often encounter challenges in either the data or in the analysis itself. We consider it important to be aware of these issues, in order to recognize them and apply the appropriate strategies to cope with them, or to make the appropriate disclaimers in a manuscript.

Multi-domain proteins The protein of interest may consist of multiple domains, which is diffi cult if each of these domains has its own evolutionary history. The latter is quite common, given that domains seem to fuse more often than they split [179, 180], and the evolution of new domain architectures has been relatively frequent in metazoa [181]. Different evolutionary histories may be hinted at by inconsistencies in the similarity search results, for example if hits from different taxa align to different regions of the query. If the domains indeed have different histories, we suggest three ways to investigate and represent the evolutionary history of the protein. First, if all domains are relevant, all deserve a thorough investigation and representation of their evolutionary history. Second, if one domain can functionally be considered the main domain and the other an accessory domain, we perform primarily a detailed analysis of the main domain. Afterwards, we may infer when in evolution the accessory domain was acquired and/or lost. This is the way others and we represented 42 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 43

the evolution of the kinetochore protein kinase Mps1, which in human is joined to a TPR domain (Figure 4). We also approached the evolution of human RasGRP proteins this way, whereby the Cdc25 homology domain, the main domain, fused to different other domains in different lineages [182]. Third, sometimes the domain combination itself is relevant for the biological function of interest. In this case, although the composite domains each have their own possibly byzantine history, only after they fused “this protein” arose, and therefore we focus on determining the phylogenetic time point of this fusion and on the evolution thereafter. Although orthologous sequences in species that diverged before the fusion likely exist (having only one of the domains), these are 2 probably not very informative. An example of this third case is R-spondin, a protein of the Wnt signaling pathway, that only came into existence after two Fu domains joined a TSP1 domain in the ancestor of deuterostomes [183].

Exploring and validating grey zone hits As discussed in the taxonomically limited scenario, quite often we expect a protein to be widely present, e.g. because it is involved in a core cellular process, but end up with a set of sequences from only a limited set of species. This often occurs if the protein under investigation is largely unstructured, and therefore its amino acid sequence is poorly conserved. In this situation, we often scrutinize hits with a higher E-value than the default cutoff, which we call ‘grey zone’ hits. Such a grey zone hit may be supported by additional, less conventional similarity searches. For example, we may search iteratively during multiple rounds, thereby starting the search not with the initial (e.g. human) query sequence, but with another trusted ortholog that may serve as a ‘bridge’. We experienced that searching with another sequence occasionally yields orthologs in many species that previously had appeared to lack them. We select such ‘bridging’ queries by their taxonomic position of the species. To verify if indeed no ortholog exists in a given species, we search with a trusted ortholog in a species that is closely related, for example in another ascomycete in case of budding yeast. Sometimes, newly found sequences themselves will trace even other, previously undetected sequences (homology is a transitive feature, Box 2). For example, in order to establish whether the Sos7 protein (fi ssion yeast) is orthologous to the Kre28 protein (budding yeast), we searched with various other ascomycete fungal hits of Sos7, and this indeed brought us to Kre28 (Figure 3). Similarly, stepping from one orthologous sequence to another also helped us to identify orthology between Sos7 and Zwint-1 (human).

We regularly attempt to increase confi dence in the grey zone hit with other information, for example from the actual alignment: does the grey zone hit contain certain functionally characterized sequence motifs? It may be truncated, due to which scores are lower than expected. Other information that is worthwhile checking includes whether the grey zone hit is a bidirectional best hit to the query, whether it has been reported to execute the 44 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 45

same function, or whether it interacts with a protein that is an ortholog of the query’s interaction partner [184]? The grey zone hit may have a similar secondary or higher-level structure as the query sequence, which can be assessed with various tools. Using this multimodal approach we were able to determine a much broader species distribution of the RPGRIP1 proteins, which are crucial components of the ciliary transition zone [185], than previously found [186].

2 Homo sapiens (Zwint-1) 5 Animals Oncorhynchus mykiss 4 Strongylocentrotus purpuratus 3 Capsaspora owkzarzaki

Saccharomyces cerevisiae (Kre28) 4 Zygosaccharomyces rouxii 3 Wickerhamomyces ciferrii 2 2 Ascomycetes Bipolaris oryzae 1 Fungi Schizosaccharomyces pombe (Sos7) 1 Trichosporon asahii

other eukaryotic species

Kre28-like single similarity search

Sos7-like iterative similarity search

Zwint-1-like similarity search E-value > 0.001

Figure 3. Orthology established via bridging sequences. Zwint-1 (human), Kre28 (budding yeast) and Sos7 (fi ssion yeast) are orthologous to one another and belong to a single OG. This conclusion was based on successive similarity searches that indicated homology between various Zwint-1-like, Kre28-like and Sos7-like proteins of different fungal, animal and animal- related species [14]. Starting with the Sos7 protein (fi ssion yeast,Schizosaccharomyces pombe), the ordered searches indicate which ‘bridging’ sequences were used to establish orthology to Kre28 and Zwint-1. Note that in one instance, iterative homology searching was needed, and in two others, hits with E-values higher than 0.001 were included, which can be considered grey zone hits. The species tree indicates which species’ sequences were used to establish this OG. 44 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 45

Iterative searches may be risky Note that sometimes, iterative searches do not converge at all and instead continue to return new hits. In this case, non-homologous sequences might get included, a risk typically faced by e.g. highly charged proteins and by coiled-coil proteins [187-189]. Such protein sequences are prone to evolve convergently due to compositional bias in their amino acid frequencies or simple, repeated sequence motifs. In fact, inferring common descent for such sequences is notoriously diffi cult [187], and therefore we think that sets of orthologs that are only based on sequence similarity of coiled-coil regions should be treated with suspicion. However, if a candidate sequence shares a coiled-coil 2 region with the protein of interest in addition to other putative homologous regions, the coiled-coil may serve as additional evidence for their homology.

Gene trees may be incorrect In each of the scenarios discussed, the evolutionary analysis is aided by making a gene tree, and a gene tree is particularly essential in the scenarios involving duplications. What sort of problems might a gene tree encounter? Quite regularly, the tree suffers from poor statistical supports, as indicated by low bootstrap values, approximate likelihood- based measures or posterior probabilities. These undermine the tree’s reliability as representation of the proteins’ evolutionary history. Another, often coinciding reason for distrusting the tree could be that it is typically hard to reconcile with the species tree. Such reconciliation might require an inconceivable number of duplications and losses, or multiple horizontal gene transfers. Note that, even for proteins that are present in single copy in most of the species studied, the gene tree in fact seldom exactly follows the species tree [190]. Problematic trees may arise if the aligned sequences are not homologous across the full length, if sequences are very short and hence contain little information, or if they are highly divergent, which may lead to so-called long-branch attraction (the tendency of rapidly evolving sequences to cluster together in the gene tree, although they are in reality not closely related) [191, 192]. The fi rst problem can easily be solved by selecting the homologous positions of the multiple sequence alignment, for which also specifi c tools exist [193]. The other problems are more diffi cult to overcome, although it may help to add more sequences, and particularly ones that increase the phylogenetic diversity. Moreover, sequences that align poorly and/or have very long branches could be removed from the multiple sequence alignment, and eventually replaced with a more conserved ortholog from a closely related species.

Mixed scenarios, multiple problems In contrast to the relatively clear outlines of the previously discussed scenarios, the reality might be a mix. For example, in investigating the evolution of the plant Polycomb repressive complex 1 (PRC1) component EMF1, through sensitive sequence analysis we fi rst discovered multiple fl owering plant-specifi c duplicates before we discovered 46 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 47

divergent orthologs in gymnosperms [194]. Here, lineage-specifi c duplications and ‘invisible’ orthologs thus conspired to make the previous inferences in the literature fl awed. Moreover, even proteins with common domains, such as certain kinases, might have diverged extensively, due to which the initial gene tree is very likely to lack orthologs. As a result, some lineages may incorrectly appear to have lost the protein. In this case, again, iterative similarity searching with a sequence profi le might help to fi nd sequences that should be added to the tree. Such a search is most sensitive if it starts with an OG- specifi c profi le, i.e. a profi le that is based on the multiple sequence alignment of the OG 2 that was already identifi ed from an earlier tree.

Truly absent, truly present? Quite often, an evolutionary analysis yields unexpected absences of orthologs in certain species, or unexpected presences. Are these unexpected observations true or not? Absences may be false if the similarity searches fail to detect homologous sequences, of if the predicted proteome or assembled genome is incomplete. We assessed that protein-coding sequences quite frequently are not, or incompletely, predicted [195]. It might therefore pay off to search for homologous sequences on the (six frame translated) genomic DNA. Less frequently, unexpected presences are observed, for example if a protein that seemed eukaryote-specifi c is found in a few bacteria. These presences could be true, for example resulting from a eukaryote-to-bacterium horizontal gene transfer. It may also be false, if for example the bacterial genome is contaminated with eukaryotic sequences [196]. It is not always easy to identify between these, because A) contamination can be diffi cult to prove, and B) because, although horizontal transfer is rare in eukaryotes, it does occur [197].

When and from where did my protein originate? We above referred to ‘the origin of a protein’ mostly as the position on the species tree in which the protein came into existence. Logically, errors in the proposed evolutionary history of a protein, as discussed above, may also result in errors in the inferred origin. A protein may be estimated too young, due to undetected homologs in more distantly related species, which is mainly observed in the taxonomically limited scenario. In some instances, the age of a protein might simply be underestimated because crucial genomes that point to an older age are not (yet) available. The TPR domain of the protein kinase Mps1 had been inferred to have originated in the ancestor of deuterostomes, and its presence in the oomycete Albugo laibachii was regarded as contamination or horizontal gene transfer [198]. However, when additional species were included, this TPR domain was also found in various other clades across the eukaryotic tree of life (Figure 4). Hence, it likely was already present in LECA but lost in multiple eukaryotic lineages in parallel [199]. In other cases, a protein may be estimated too old if a wide taxonomic distribution is actually due to horizontal gene transfer. Incorrect estimation of a protein’s age may of 46 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 47

Mps1-TPR study with limited species set Mps1-TPR study with expanded species set

Deuterostomes Homo sapiens Homo sapiens Strongylocentrotus purpuratus Drosophila melanogaster contamination? transfer? Drosophila melanogaster Monosiga brevicollis Protostomes Schistosoma mansonii Saccharomyces cerevisiae Saccharomyces cerevisiae Spizellomyces punctatus LECA Acanthamoeba castellanii Dictyostelium discoideum LECA Arabidopsis thaliana Arabidopsis thaliana Physcomitrella patens Physcomytrella patens Albugo laibachii Chlamydomonas reinhardtii Paramecium tetraurelia Albugo laibachii Trichomonas vaginalis Phaeodactylum tricornutum Paramecium tetraurelia gain/presence Mps1 protein LECA: last eukaryotic common ancestor loss Mps1 protein Naegleria gruberi gain/presence TPR domain Trichomonas vaginalis loss TPR domain 2 Figure 4. More ancient origin of a protein’s domain after addition of species. Left panel: Evolutionary analysis of the protein kinase Mps1 initially showed that this kinase was likely ancient, present in LECA, and seemed to have fused to a TPR domain recently. This TPR domain was inferred to have fused to the kinase in the common ancestor of deuterostomes [198]. The presence of this domain in the Mps1 protein of Albugo laibachii was hypothesized to result from contamination of the genome or from horizontal gene transfer. Right panel: After genomes of other species were included in the analysis, various early branching species turned out to possess the TPR domain in their Mps1 proteins [199]. Hence, likely the Mps1 protein of LECA already had this domain. As a result, it must have gone lost in various eukaryotic clades. The presented phylogenies are species trees with the branches colored according to the eukaryotic supergroups to which the species belong. Purple: Opisthokonta (animals and fungi), blue: Amoebozoa, green: Archaeplastida (land plants, algae), red: SAR (Stramenopila, Alveolata and Rhizaria), orange: Excavata (see Box 3). The species that were important for inferring the ancient origin of Mps1’s TPR domain are indicated in bold.

course also result from an incorrect species tree. Sometimes we are not just interested in the time point of origin, but also in the ‘source’ of its origin: where did the protein come from? Was it invented de novo? Did it arise by duplication, and if so, what may have been the function of the pre-duplication protein? Does the protein have prokaryotic homologs, and may it thereby be derived from these prokaryotes? Determining this ‘source’ may not be so straightforward. For example, some previously identifi ed eukaryote-specifi c proteins turned out to have homologs in prokaryotes after all (Table 1), and other core eukaryotic proteins may seem to be invented de novo just prior to LECA, but do in fact have eukaryotic homologs, hence they arose by a pre-LECA duplication. In these cases, after speciation (from prokaryotes) or duplication (in the pre- LECA lineage), the sequences evolved rapidly, due to which sequence similarity searches sometimes fail to uncover these distantly related homologs.

Using a tailor-made database yields more comprehensible and more interesting results Since public databases nowadays contain tons of sequence data, it is computationally and practically quite a challenge to interpret the results of a simple similarity search. 48 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 49

Box 4. A species selection for eukaryotes We consider it useful to work with a selection of genomes as a search database, because this often gives more comprehensible results. Using such a subset prevents that only closely-related species are visible in the sequence similarity search output, which may be the case in large databases such as ‘nr’, the non-redundant protein database from NCBI [15]. A smaller selection also facilitates checking for species that lack hits. We would always suggest to include the species of the query in this database, because the search results then can reveal lineage-specifi c or ancient duplications. Moreover, we 2 suggest including species that represent the diversity of the taxon that one studies, such as eukaryotes. Thereby, one should make sure that all major clades are represented by multiple species. Ideally, one includes for each of these major clades species that do not have strongly reduced genomes. For this reason, we for example recommend to avoid having only pathogenic species for a given clade. Such species often lack orthologs of the protein of interest, and may therefore for example lead to erroneous conclusions about the age of a protein (‘When and from where did my protein originate?’). Moreover, it may be helpful to include model organisms if these harbor additional experimental data, which often applies to budding and fi ssion yeast, and sometimes to Arabidopsis thaliana. Experimental data may give hits about conservation of function, and/or indicate candidate orthologs (‘Exploring and validating grey zone hits’). Although for specifi c proteins other subsets may be more useful (‘Using a tailor-made database yields more comprehensible and more interesting results’), as an initial search database we suggest the species in this table. Of course, if any species not in this list is likely to be relevant for the protein of interest, one may add it or let it replace closely related species. Species suggested can be selected when using BLASTP or phmmer online, or their proteome sequences can be downloaded from NCBI when running similarity searches locally. The species are colored according to the supergroup to which they belong (Box 3).

Furthermore, this output does not reveal which species lack a hit. For this reason, in the previous section we recommended to select a subset of species as a search database (see also Box 4). As an alternative, we prefer to use a tailored, local proteome database of which we beforehand know the species and the (approximate) species tree. Although it may seem quite some work, we think it does pay off if one intends to study the evolution of multiple proteins. An in-house database also facilitates the detection of co-evolution among different proteins (Table 1). The database may be revised or recompiled if a protein turns out to have a particularly interesting function or evolutionary history in a specifi c clade. Various studies demonstrated that zooming in into the species tree and adding species at key phylogenetic positions yield highly interesting patterns, such as 48 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 49

Species Taxon Eukaryotic supergroup Taxonomy ID Homo sapiens Chordata Opisthokonta 9606 Xenopus tropicalis Chordata Opisthokonta 8364 Drosophila melanogaster Arthropoda Opisthokonta 7227 Salpingoeca rosetta Choanofl agellida Opisthokonta 946362 Saccharomyces cerevisiae Ascomycota Opisthokonta 4932 Spizellomyces punctatus Chytridiomycota Opisthokonta 109760 Thecamonas trahens Apusozoa 529818 2 Acanthamoeba castellanii Longamoebia Amoebozoa 5755 Dictyostelium discoideum Mycetozoa Amoebozoa 44689 Amborella trichopoda Streptophyta Archaeplastida 13333 Klebsormidium fl accidum Streptophyta Archaeplastida 3175 Chlamydomonas reinhardtii Chlorophyta Archaeplastida 3055 Bathycoccus prasinos Chlorophyta Archaeplastida 41875 Cyanidioschyzon merolae Rhodophyta Archaeplastida 45157 Ectocarpus siliculosus Stramenopila SAR 2880 Phytophthora infestans T30-4 Stramenopila SAR 403677 Plasmodium falciparum Alveolata SAR 5833 Paramecium tetraurelia Alveolata SAR 5888 Plasmodiophora brassicae Rhizaria SAR 37360 Naegleria gruberi Discoba Excavata 5762 Bodo saltans Discoba Excavata 75058 Giardia intestinalis Metamonada Excavata 5741 Trichomonas vaginalis Metamonada Excavata 5722

when the protein turns out to be present in species having a certain (cellular) biological feature. A nice example is provided by the centromeric histone variant CenH3, a protein that was found to be absent from species that have a specifi c type of centromere (a so- called holocentromere, which runs along the length of the chromosome), in the lineage of insects [101].

Conclusions & Outlook

With this article, we hope to have given relevant insights into our evolutionary analysis 50 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 51

approaches, and we hope that these insights serve as guide for doing this analysis. Essentially, it most often is a process that entails many feedback loops that revise the initial evolutionary scenario. This process is often unfi nished: new data that become available, such as a resolved 3D structure or newly sequenced genomes, may alter the scenario, as may technological advances, such as in homology detection and improved tree building models and algorithms. The manually solved evolutionary history should therefore be considered the best estimate at the moment, not necessarily the defi nitive one. To address this uncertainty, we suggest that authors report doubtful cases: which 2 set of orthologs possibly contains false positives or false negatives, and why? Which evolutionary events may require reexamination?

Various studies demonstrated that new information or new tools revise the evolutionary history of a protein. For example, genomes of Archaea showed that previously labeled eukaryote-specifi c proteins are in fact older than eukaryotes [200]. Very likely certain scientifi c trends will bring in information that alters existing scenarios for the evolution of certain proteins, like increasing diversity in available genomes and evolutionary cell biology. Various evolutionary biologists called for a better representation of eukaryotic diversity by studying and sequencing non-parasitic species, including unicellular heterotrophs [201, 202]. Indeed, the fi rst sequenced non-parasitic excavate (Box 3), Naegleria gruberi, turned out to contain many more genes present in other eukaryotic lineages than parasitic excavates, and as a result many of these genes could be assigned to have been present in LECA [203]. In light of the up-and-coming fi eld of ‘evolutionary cell biology’, more non-model organisms will be studied on the cellular and molecular level, which allows us to validate our evolutionary predictions. Does the predicted ortholog indeed fulfi ll the same function in this organism, as postulated by the ortholog conjecture? Or, if no ortholog was found in the genome of this species, does it have another, analogous, protein fulfi lling this role? By answering these and related questions, evolutionary cell biology will shed light on the association between the evolution of proteins and the evolution of function.

The common efforts of many researchers generated detailed hypotheses on the evolution of a wide array of proteins. Likely, many cellular biologists would be interested to quickly retrieve this information. Unfortunately, these data are now often hidden in research articles. If others and we would share manually defi ned orthologs on a wider, and homogeneously formatted platform, these would be easier to access. This does not necessarily need to be a new platform: maybe our research community could aid to improve existing databases such as EggNOG, PANTHER or TreeBASE by adding the knowledge of the proteins we studied [34, 168, 204, 205]. 50 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 51

Acknowledgments We kindly thank Carlos Sacristan, Simona Antonova and Mathilde Galli for their critical and useful feedback on our manuscript.

Author contributions BS designed the manuscript. JJEH, ET, TJPD, GJPLK and BS wrote the manuscript.

2

3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics

Jolien JE van Hooff, Eelco Tromer, Leny M van Wijk, Berend Snel# and Geert JPL Kops#

# joint senior authors

EMBO Reports, 2017 54 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 55

Abstract

During eukaryotic cell division, the sister chromatids of duplicated chromosomes are pulled apart by microtubules, which connect via kinetochores. The kinetochore is a multiprotein structure that links centromeres to microtubules, and that emits molecular signals in order to safeguard the equal distribution of duplicated chromosomes over daughter cells. Although microtubule-mediated chromosome segregation is evolutionary conserved, kinetochore compositions seem to have diverged. To systematically inventory kinetochore diversity and to reconstruct its evolution, we determined orthologs of 70 kinetochore proteins in 90 phylogenetically diverse eukaryotes. The resulting ortholog sets imply that the last eukaryotic common ancestor (LECA) possessed a complex kinetochore and highlight that current-day kinetochores differ substantially. These kinetochores diverged through gene loss, duplication and, less frequently, invention and displacement. Various kinetochore components co-evolved with one another, albeit 3 in different manners. These co-evolutionary patterns improve our understanding of kinetochore function and evolution, which we illustrated with the RZZ, TRIP13, the MCC and some nuclear pore proteins. The extensive diversity of kinetochore compositions in eukaryotes poses numerous questions regarding evolutionary fl exibility of essential cellular functions.

Key words: kinetochore, co-evolution, eukaryotic diversity, gene loss, evolutionary cell biology

Introduction

During mitotic cell division, eukaryotes physically separate duplicated sister chromatids using microtubules within a bipolar spindle. These microtubules pull the sister chromatids in opposite directions, toward the spindle poles from which they emanate [206]. Current knowledge indicates that all eukaryotes use microtubules for chromosome separation, suggesting that the last eukaryotic common ancestor (LECA) also did. Microtubules and chromatids are connected by the kinetochore, a multi-protein structure that is assembled on the centromeric chromatin [207, 208]. Functionally, the kinetochore proteins can be subdivided into three main categories: proteins that connect to the centromeric DNA (inner kinetochore), proteins that connect to the spindle microtubules (outer kinetochore), and proteins that perform signaling functions at the kinetochore in order to regulate chromosome segregation. These signaling functions consist of the spindle assembly checkpoint (SAC), which prevents sister chromatids from separating before all have stably attached to spindle microtubules, and attachment error correction, which ensures that these sister chromatids are attached by microtubules that emanate 54 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 55

from opposite poles. Together, the SAC and error correction machineries ensure that both daughter cells acquire a complete set of chromosomes.

Although microtubule-mediated chromosome segregation is conserved across eukaryotes, their mitotic mechanisms differ. For example, some species, such as those in animal lineages, disassemble the nuclear envelope during mitosis (‘open mitosis’), while others, such as yeasts, completely or partially maintain it (‘(semi-)closed mitosis’) [209]. Species differ also in their kinetochore composition, both in the inner and in the outer kinetochore. For example, Drosophila melanogaster and Caenorhabditis elegans lack most components of the constitutive centromere-associated network (CCAN), a protein network in the inner kinetochore. In the outer kinetochore, diverse species employ either the Dam1 (e.g. various Fungi, Stramenopila and unicellular relatives of Metazoa) or the Ska complex (most Metazoa and Viridiplantae and some Fungi) for tracking depolymerizing microtubules [210]. The kinetochore of the Excavate species Trypanosoma brucei mostly consists of proteins that do not seem homologous to the 3 ‘canonical’ kinetochore proteins [113, 116]. Studying the evolution of kinetochore proteins revealed how kinetochore diversity was shaped by different modes of genome evolution: The inner kinetochore CenpB-like proteins were recurrently domesticated from transposable elements [211], the outer kinetochore protein Knl1 displays recurrent repeat evolution [212], SAC proteins Bub1/BubR1/Mad3 (MadBub) duplicated and subfunctionalized multiple times in eukaryotic evolution [143, 144] and the SAC protein p31comet was recurrently lost [213].

Prior comparative genomics studies reported on kinetochore compositions in eukaryotes [213, 214]. These studies raised various questions, such as: Are kinetochores in general indeed highly diverse? How often do kinetochore proteins evolve in a recurrent manner in different lineages? How frequent is loss of kinetochore proteins? Does the kinetochore consist of different evolutionary modules? To address these and other questions, we studied the eukaryotic diversity of the kinetochore by scanning a large and diverse set (90) of eukaryotic genomes for the presence of 70 kinetochore proteins. We deduced the kinetochore composition of LECA and shed light on how, after LECA, eukaryotic kinetochores diversifi ed. To understand this evolution functionally, we detected co- evolution among kinetochore complexes, proteins and short linear motifs: Co-evolving kinetochore components are likely functionally interdependent. Furthermore, we found that certain species contain yet inexplicable kinetochore compositions, such as absences of proteins that are crucial in model organisms. We nominate such species for further investigation into their mitotic machineries. 56 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 57

Results

Eukaryotic diversity in the kinetochore network We selected 70 proteins that compose the kinetochore (see Materials and Methods). For comparison, we also included proteins that constitute the Anaphase-Promoting i i a yi yi io ia us us xa us us r a ata um te r um a r r ydis i n i n ucei ipes idae vum ulea ersa vskii allax

Opisthokonta assa utum utum r ube r a v mans erens iabilis i f visiae ad o ahens n asitica asinos vicollis estans aurelia ticillata ub r o r usculus aginalis a r e mophila f x ca r ectensis a c r e r ichopoda r o Amoebozoa ia g r aia culicis alcipa r wczarzaki ico r v yza sati v ow al e esleea n yces lactis v a in f Danio re erguelense a pa r on merolae ugia mal a pa r v r k icha t r yces pombe ol v ugo laibachii O r wia lipolytica a a o wiella natans la k ia anguillulae V y z ia sulphu r idium pa r B r yces hansenii r m V a pseudonana Mus m ium limaci n Homo sapiens pus siliculosus oon intestinalis yces punctatus Excavata kinsus ma r Guillardia theta yces cer e ero m Al b Ustilago m tierella elongata akifugu r xoplasma gondii ium k arr o o m Emiliania huxl e yt r Ciona intestinalis e r Candida glab m T ycoccus p r Oxyt r tierella Naegle r Edhazardia aedis o yces macrogy n Mnemiopsis leidyi Leishmania major ium dendrobatidis Y y abidopsis thaliana Chlorella v Xenopus tropicalis Giardia intestinalis yxa subellipsoidea ypanosoma b r P T Aquilegia coe r r Neurospo yt r ymena the r r Bigel o yscomitrella patens Coemansia r e anopho r ichoplax adhaerens yces b amecium tet r Mucor circinelloides Mo r Salpingoeca rosetta Anopheles gambiae ytophtho r Monosiga br ichomonas v A r Blastocystis hominis r Stramenopila-Alveolata-Rhizaria yt r T ello m ostelium discoideum Micromonas species anchiostoma flo Galdie r ydomonas reinhardtii Kluy v r Thecamonas t r Catena r Mo r Amborella t r T a h a r Bat h P h Schistosoma mansoni Entamoeba histolytica T C y Ectoca r yptospo r P h anidiosc h B r Nematostella Allo m P Plasmodium f Symbiodinium mi n Conidiobolus coronatus Caenorhabditis elegans yptococcus neo Capsaspo Deba antioc h et r osaccharo olysphondylium pallidum C r Spi z yco m Drosophila melanogaster Archaeplastida Selaginella moellendorffii Dict y T C y Acanthamoeba castellanii Saccoglossus Ostreococcus lucima r aloperonospo r P Nannochloropsis gaditana u r C r Thalassiosi r Cocco m Saccharo Chla m Phaeodactylum t r achoc h A P h Encephalito z Aplanoc h Amphimedon queenslandica H y ureococcus anophagef f Schi z present absent A Bat r Mad2 • MadBub • Borealin • Cdc20 • Spc24 • Zwint-1 • Mad1 • Mps1 • 3 Knl1 • Nnf1 • Mis12 • Nsl1 • Dsn1 • Bub3 • ZW10 • CenpX • CenpS • Spc25 • CenpC • CenpA • Ndc80 • Nuf2 • Zwilch • Rod • Spindly ARHGEF17 Survivin • Cep57 • Plk • CenpI • CenpN • CenpP • CenpH • CenpK • CenpT • CenpL • CenpO • CenpW • CenpU CenpQ CenpF CenpM • Astrin SKAP CenpR Apc15 • Sgo • Ska3 • Ska1 • Ska2 • BugZ • TRIP13 • p31comet • CenpE • Aurora • Dad2 Duo1 Spc19 Hsk3 Spc34 Dad3 Dad4 Ask1 Dam1 Dad1 Ctf13 Ndc10 Cep3 Incenp • Skp1 •

Figure 1. The kinetochore network across 90 eukaryotic lineages. Presences and absences (“phylogenetic profi les”) of 70 kinetochore proteins in 90 eukaryotic species. Top: Phylogenetic tree of the species in the proteome set, with colored areas for the eukaryotic supergroups. Left side: Kinetochore proteins clustered by average linkage based on the pairwise Pearson correlation coeffi cients of their phylogenetic profi les. Protein names have the same colors if they are members of the same complex. Proteinsinferred to have been present in LECA are indicated (●). The orthologous sequences (including sets of APC/C subunits, NAG, RINT1, HORMAD, Nup106, Nup133, Nup160) are available as fasta fi les in Dataset 1, allowing full usage of our data for further evolutionary cell biology investigations. 56 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 57

Complex/Cyclosome (APC/C), which is targeted by kinetochore signaling. We identifi ed orthologous sequences of these kinetochore and APC/C proteins in 90 diverse eukaryotic lineages by performing in-depth homology searches. Our methods were aimed at maximizing detection of a protein’s orthologs even if it evolves rapidly, which is the case for many kinetochore proteins (as we discuss below). The resulting sets of orthologous sequences are available (Dataset 1). We projected the presences and absences of proteins (‘phylogenetic profi les’) across eukaryotes (Figure 1, Materials and Methods). In spite of our thorough homology searches, for some proteins the ortholog in a given species might have diverged too extensively to recognize it, resulting in a ‘false’ absence. We however think that, globally, our analysis gives an accurate representation of kinetochore proteins in eukaryotes (Discussion).

We inferred the evolutionary histories of the proteins by applying Dollo parsimony, which allows only for a single invention and infers subsequent losses based on maximum parsimony. Of the 70 kinetochore proteins, 49 (70%) were inferred to have been present 3 in LECA (Figure 1, Figure 2A, C). CenpF, Spindly and three subunits of the CenpO/P/Q/ R/U complex probably originated more recently. The Dam1 complex likely originated in early fungal evolution and may have propagated to non-fungal lineages via horizontal gene transfer [210]. i i a yi yi io ia us us xa us us r a ata um te r Opisthokonta um a r r ydis ipes i n i n ucei idae vum ulea ersa vskii allax assa utum utum r ube r a v mans erens iabilis i f visiae ad o ahens n asitica asinos vicollis estans aurelia ticillata ub r o r usculus aginalis a r e mophila f x ca r

Amoebozoa ectensis a c r r e r ichopoda r o ia g r aia culicis alcipa r wczarzaki ico r v yza sati v ow al e esleea n yces lactis v a in f Danio re erguelense a pa r on merolae ugia mal a pa r v r k icha t r yces pombe ol v ugo laibachii O r wia lipolytica a a o wiella natans la k ia anguillulae V y z ia sulphu r idium pa r B r

Excavata yces hansenii r m V a pseudonana Mus m ium limaci n Homo sapiens pus siliculosus oon intestinalis yces punctatus kinsus ma r Guillardia theta yces cer e ero m Al b Ustilago m tierella elongata akifugu xoplasma gondii ium k arr o o m Emiliania huxl e yt r Ciona intestinalis e r Candida glab m T ycoccus p r Oxyt r tierella Naegle r Edhazardia aedis o yces macrogy n Mnemiopsis leidyi Leishmania major ium dendrobatidis Y y abidopsis thaliana Chlorella v Giardia intestinalis Xenopus tropicalis yxa subellipsoidea ypanosoma b r P T Aquilegia coe r Stramenopila-Alveolata-Rhizaria r Neurospo yt r ymena the r r Bigel o yscomitrella patens Coemansia r e anopho r ichoplax adhaerens yces b amecium tet r Mucor circinelloides Mo r Salpingoeca rosetta Anopheles gambiae ytophtho r Monosiga br ichomonas v A r Blastocystis hominis r yt r T ello m ostelium discoideum Micromonas species anchiostoma flo Galdie r ydomonas reinhardtii Kluy v r Thecamonas t r Catena r Mo r Amborella t r T a h a r Bat h P h Schistosoma mansoni Entamoeba histolytica T C y Ectoca r yptospo r P h anidiosc h B r Nematostella Allo m P Plasmodium f Symbiodinium mi n Conidiobolus coronatus Caenorhabditis elegans yptococcus neo

Archaeplastida Capsaspo Deba antioc h et r osaccharo olysphondylium pallidum C r Spi z yco m Drosophila melanogaster Selaginella moellendorffii Dict y T C y Acanthamoeba castellanii Saccoglossus Ostreococcus lucima r aloperonospo r P Nannochloropsis gaditana u r C r Thalassiosi r Cocco m Saccharo Chla m Phaeodactylum t r achoc h A P h Encephalito z Aplanoc h Amphimedon queenslandica H y ureococcus anophagef f Schi z

present absent A Bat r

Cdh1 Apc8 Apc6 Apc2 Apc3 Apc5 Apc4 Apc13 Apc7 Apc1 Apc16 Apc15 Apc12 Cdc20 Apc10 Apc11 Apc9

Figure EV1. Anaphase-promoting complex/cyclosome (APC/C) subunits across 90 eukaryotic lineages. Presences and absences (“phylogenetic profi les”) of APC/C subunits in 90 eukaryotic species. Top: Phylogenetic tree of the species in the genome set, with colored areas for the eukaryotic supergroups. Left side: APC/C proteins clustered by average linkage based on the pairwise Pearson correlation coeffi cients of their phylogenetic profi les. The orthologous sequences are available as fasta fi les in Dataset 1, allowing full usage of our data for further evolutionary cell biology investigations. 58 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 59

A Homo sapiens B Tetrahymena thermophila

ARHGEF17

Ska1 Ska2 Mps1 Ska3

Ndc80 Ndc80 Outer kinetochore Nuf2 Spc24 Zwilch Astrin Spindly Spc25 E F E SKAP ZW10 Rod Dsn1 Nsl1 Knl1 MadBub Mad1 Cep57 Bub3 TRIP13 Zwint-1 Mad2 comet Mis12 Apc15 p31 Nnf1 Cdc20 Plk Plk Cdc20 BugZ O U BugZ Inner kinetochore R P C Q H I L

N K M

W T Centromeric DNA A S X A Sgo Aurora Borealin Aurora Inner centromere Incenp Survivin

C Saccharomyces cerevisiae D Cryptococcus neoformans

Dad1 Dad1 Duo1 Duo1 Mps1 Dam1 Mps1 Dam1 Dad2 Dad2 Dad3 Dad3 Ndc80 Ask1 Ndc80 Ask1 Nuf2 Spc19 Nuf2 Spc19 Outer kinetochore Dad4 Dad4 Hsk3 Hsk3 3 Spc24 Spc24 Spc34 Spc34 Spc25 Spc25

Dsn1 Dsn1 Nsl1 Knl1 MadBub Mad1 Nsl1 Knl1 MadBub Mad1 Zwint-1 Bub3 Mad2 Zwint-1 Bub3 Mad2 Mis12 Apc15 Mis12 Apc15 Nnf1 Cdc20 Nnf1 Cdc20 O U Inner kinetochore Q P H C I C L N K

Skp1* Skp1*

W T Ndc10 Cep3 Centromeric DNA A S X A Ctf13

Sgo Sgo Inner centromere Aurora Survivin Aurora Survivin Incenp Incenp

0-0.4 0.4-0.6 0.6-0.8 0.8-0.1

Protein in LECA: Protein not in LECA Protein present Protein absent Frequency

Figure 2. Kinetochores of model and non-model species. A. The human kinetochore. The colors of the proteins indicate if they were inferred to be present in LECA and their occurrence frequency across eukaryotes (see Materials and Methods). B. The predicted kinetochore of Tetrahymena thermopila projected onto the human kinetochore. C. The budding yeast kinetochore. Similar to panel (B). D. The predicted kinetochore of Cryptococcus neoformans projected onto the budding yeast kinetochore.

Kinetochore proteins are less conserved than APC/C subunits (Figure EV1, Appendix Table S1, [215]). Species on average possess 48% of the kinetochore proteins, compared to 70% of the APC/C subunits. Species that we predict to contain relatively few kinetochore proteins include Tetrahymena thermophila (Figure 2B) and Cryptococcus neoformans (Figure 2D). Some kinetochore proteins are absent from many different lineages, likely resulting from multiple independent gene loss events. We counted losses of kinetochore and APC/C proteins during post-LECA evolution using Dollo parsimony. On average, kinetochore proteins were lost 16.5 times since LECA, while APC/C proteins were lost 58 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 59

A B Sgo Skp1 Apc10 Bub3 BugZ

100 Cdh1 0.6 Apc15 Apc3 Apc13 Apc8 Apc6 Apc2 Plk Mad2 Apc4 TRIP13 Apc5 Apc1 Knl1 Apc7

Apc12 CenpP ZW10 Ska3 0.5 Ndc80 CenpC Cdc20 Mad1 Zwilch Rod CenpM Borealin 80 Mis12 Ska2 Aurora Cep57 CenpX CenpK Survivin Ska1 CenpL Spc25 CenpH Spc24 Nuf2 CenpI Mps1 CenpI 0.4 MadBub Incenp CenpN Ska2 CenpP CenpO CenpK CenpA Nsl1 p31comet CenpN CenpH CenpE CenpO MadBub Ska3 CenpW Mis12 Ska1 CenpE 0.3

Nuf2 60 Nsl1 Zwint−1 p31comet Survivin Knl1

Zwilch Mps1 Zwint−1 CenpL Borealin CenpW Cep57 CenpC dN/dS human−mouse Incenp % Identity human−mouse Spc25 Sgo 0.2 Cdc20 CenpA Apc12 Rod Aurora Ndc80 ZW10 CenpM CenpX

Spc24 40

0.1 Apc1 Apc5 TRIP13 Mad2 Apc4 Plk Apc8 Mad1 Apc13 Apc3 Apc6 Apc2 BugZ Apc7 Skp1 Apc10 Cdh1 Bub3 Apc11 Apc11 0.0

0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 Losses Losses

Figure EV2. Loss frequencies and sequence evolution of kinetochore and APC/C proteins. A, B. Scatter plots for loss frequencies and dN/dS values (A) and percent identity (B) of human–mouse orthologs for the kinetochore and APC/C proteins that were inferred to have been present in LECA. 3 Loss frequencies and dN/dS values positively correlate (P = 3.9e-5, Spearman correlation), whereas loss frequencies and percent identity negatively correlate (P = 0.0005, Spearman correlation)

13.1 times (not signifi cantly different for kinetochore vs. APC/C). Our homology searches hinted at some kinetochore proteins evolving also rapidly on the sequence level. The kinetochore proteins indeed have relatively high dN/dS values, a common measure for sequence evolution: when comparing mouse and human gene sequences, kinetochore proteins scored an average dN/dS of 0.24, compared to 0.06 for the APC/C proteins (p=0.0016) and 0.15 for all human proteins (p=4.8e-5). The loss frequency and sequence evolution seem to be correlated, suggesting a common underlying cause for poor conservation (Figure EV2, Discussion). Overall, the kinetochore seems to evolve more fl exibly than the APC/C.

We not only mapped the presences and absences of kinetochore proteins, we also counted their copy number in each genome (Figure EV3). As observed before, MadBub and Cdc20 are often present in multiple copies. These proteins likely duplicated in different lineages and subsequently the resulting paralogs subfunctionalized [143, 144, 213]. CenpE, Rod, Survivin, Sgo and the mitotic kinases Aurora and Plk also have elevated copy numbers. Possibly these proteins also underwent (recurrent) duplication and subfunctionalization, as for example suggested for Sgo: In the lineages of Schizosaccharomyces pombe, Arabidopsis thaliana and mammals, Sgo duplicated and likely subsequently subfunctionalized in a recurrent manner [216-218]. 60 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 61 i i a yi yi io ia us xa us us us r a ata um te r Opisthokonta um a r r ydis i n i n ucei ipes idae vum ulea ersa vskii allax assa utum utum r ube r a v mans erens iabilis i f visiae ad o ahens n asitica asinos vicollis estans aurelia ticillata ub r o r usculus aginalis a r e mophila f x ca r

Amoebozoa ectensis a c r e r ichopoda r o ia g r aia culicis alcipa r wczarzaki ico r v yza sati v ow al e esleea n yces lactis v a in f Danio re a pa r erguelense on merolae ugia mal a pa r v r k icha t r yces pombe ol v ugo laibachii O r wia lipolytica a a o wiella natans la k ia anguillulae V y z ia sulphu r idium pa r B r yces hansenii

Excavata r m V a pseudonana Mus m ium limaci n Homo sapiens pus siliculosus oon intestinalis yces punctatus kinsus ma r Guillardia theta yces cer e ero m Al b Ustilago m tierella elongata akifugu r xoplasma gondii ium k arr o o m Emiliania huxl e yt r Ciona intestinalis e r Candida glab m T ycoccus p r Oxyt r tierella Naegle r Edhazardia aedis o yces macrogy n Mnemiopsis leidyi Leishmania major ium dendrobatidis Y y abidopsis thaliana Chlorella v Xenopus tropicalis Giardia intestinalis yxa subellipsoidea ypanosoma b r P T Aquilegia coe r r Neurospo

Stramenopila-Alveolata-Rhizaria yt r ymena the r r Bigel o yscomitrella patens Coemansia r e anopho r ichoplax adhaerens yces b amecium tet r Mucor circinelloides Mo r Anopheles gambiae Salpingoeca rosetta ytophtho r Monosiga br ichomonas v A r Blastocystis hominis r yt r T ello m ostelium discoideum Micromonas species anchiostoma flo Galdie r ydomonas reinhardtii Kluy v r Thecamonas t r Catena r Mo r Amborella t r T a h a r Bat h P h Schistosoma mansoni Entamoeba histolytica T C y Ectoca r yptospo r P h anidiosc h B r Nematostella Allo m P Plasmodium f Symbiodinium mi n Conidiobolus coronatus Caenorhabditis elegans yptococcus neo Archaeplastida Capsaspo Deba antioc h et r osaccharo olysphondylium pallidum C r Spi z yco m Drosophila melanogaster Selaginella moellendorffii Dict y T C y Acanthamoeba castellanii Saccoglossus Ostreococcus lucima r aloperonospo r P Nannochloropsis gaditana u r C r Thalassiosi r Cocco m Saccharo Chla m Phaeodactylum t r achoc h A P h Encephalito z Aplanoc h Amphimedon queenslandica H y

Copy numbers ureococcus anophagef f Schi z A 0 2 4 6 8 ≥10 Bat r Aurora Incenp Survivin Borealin Sgo CenpA Ndc10 Ctf13 Cep3 Skp1 CenpT CenpW CenpS CenpX CenpC CenpL CenpN CenpH CenpI CenpK CenpM CenpO CenpP CenpQ CenpR CenpU Plk Mis12 3 Nnf1 Dsn1 Nsl1 Cep57 SKAP Astrin Knl1 Zwint-1 Spc24 Spc25 Nuf2 Ndc80 Mps1 ARHGEF17 Ska1 Ska2 Ska3 Dam1 Duo1 Dad1 Dad2 Dad3 Dad4 Hsk3 Ask1 Spc19 Spc34 Rod Zwilch ZW10 Spindly MadBub Bub3 BugZ CenpF CenpE Mad1 Mad2 Cdc20 Apc15 p31comet TRIP13

Figure EV3. Copy numbers of kinetochore proteins. Heatmap indicating the copy numbers of each kinetochore protein in the 90 eukaryotic lineages. Please note that these copy numbers might contain some over- and underestimates due to unpredicted or imperfectly predicted genes and database errors.

Co-evolution within protein complexes of the kinetochore Subunits of a single kinetochore complex tend to co-occur across genomes: they have similar patterns of presences and absences (‘phylogenetic profi les’, Figure 1A). Such co-occurring subunits likely co-evolved as a functional unit [28]. To quantify how similar phylogenetic profi les are, we calculated the Pearson correlation coeffi cient (r) for each kinetochore protein pair. We defi ned a threshold of r=0.477 for protein pairs likely to be interacting, based on the scores among established interacting kinetochore pairs 60 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 61

(Appendix Figure S1). All pairwise scores were used to cluster the proteins (Figure 1 including Dataset 1 and Dataset 2) and to visualize the proteins using t-Distributed Stochastic Neighbor Embedding (t-SNE, Appendix Figure S2) [12]. Many established interacting proteins correlate well and, as a result, cluster together and are in close proximity in our t-SNE map. Examples include the SAC proteins Mad2 and MadBub, centromere proteins (CENPs) located in the inner kinetochore (discussed below), the Ska complex and the Dam1 complex. Such complexes, with subunits having highly similar phylogenetic profi les, evolved as a functional unit.

While some kinetochore proteins have highly similar phylogenetic profi les, others lack similarity, pointing to a more complex interplay between evolution and function. First, two proteins might have strongly dissimilar, or inverse, phylogenetic profi les, potentially because they are functional analogs [132]. In the kinetochore network, phylogenetic dissimilarity is observed for proteins of the Dam1 complex and of the Ska complex, which are indeed analogous complexes [210, 219, 220]. Second, proteins that do interact in 3 a complex might nevertheless have little similarity in their phylogenetic profi les. Either such a complex did not evolve as a functional unit since its subunits started to interact only recently [221], or because one of its subunits serves a non-kinetochore function and thus also co-evolves with non-kinetochore proteins [222]. An example of a potentially recently emerged interaction is BugZ-Bub3, that form a kinetochore complex in human [223, 224], but have little similarity in their phylogenetic profi les, measured by their low correlation (r=0.187). In general, BugZ’s phylogenetic profi le is different from other kinetochore proteins’, hence this protein might be recently added to the kinetochore [225, 226]. An example of a kinetochore protein that co-evolves with non-kinetochore proteins is ZW10, which joins Rod and Zwilch in the RZZ complex. The phylogenetic profi le of ZW10 is dissimilar from those of Rod and Zwilch (r=0.218 for Rod, r=0.236 for Zwilch), while those are very similar to each other (r=0.859, Figure 3), due to ZW10 being present in various species that lack Rod and Zwilch. In those species, ZW10 might not localize to the kinetochore but perform only its role in vesicular traffi cking, in a complex with NAG and RINT1 (NRZ complex [227]). Indeed, the ZW10 phylogenetic profi le is much more similar to that of NAG (r=0.644) and RINT1 (r=0.512) compared to Rod and Zwilch. Hence, ZW10 more strongly co-evolves with NAG and RINT1. The Rod and Zwilch phylogenetic profi les are similar to that of Spindly (r=0.730 for Rod, r=0.804 for Zwilch), a confi rmed RZZ-interacting partner [228-230]. These similarities argue for an evolutionary ‘Rod-Zwilch-Spindly’ (RZS) module, rather than an RZZ module.

The phylogenetic profi les of kinetochore proteins shed new light on these proteins’ (co-) evolution and on their function, examples of which are discussed in detail below. 62 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 63

Tree scale: 0.1 Bootstraps: 1-100 Centromeric H3 CenpA cluster

3

The CCAN evolved as an evolutionary unit that is absent from many lineages The kinetochore connects the centromeric DNA, mainly via CenpA, to the spindle microtubules, mainly via Ndc80. In human and yeast, CenpA and Ndc80 are physically linked by the constitutive centromere-associated network (CCAN, reviewed in [231]). Physically, the CCAN comprises multiple protein complexes (Figure 2). Evolutionarily, however, it comprises a single unit, as the majority of CCAN proteins have highly similar phylogenetic profi les (Figure 1, average r=0.513). Four CCAN proteins are very different from the others: CenpC, CenpR, CenpX and CenpS. CenpC is widely present and is suffi cient to assemble at least part of the outer kinetochore in D. melanogaster and humans [104, 232]. CenpR seems a recent gene invention in animals. CenpX and CenpS have a more ubiquitous distribution compared to other CCAN proteins, possibly due to their non-kinetochore role in DNA damage repair [233, 234].

Our study confi rmed that most CCAN proteins have no (detectable) homologs in C. 62 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 63

Figure EV4. Gene phylogeny of histone H3 homologs. To fi nd the putative orthologs of CenpA, we fi rst aligned candidate orthologous sequences, which were experimentally identifi ed centromeric H3 variants in divergent species (indicated with a pink branch in this phylogeny). From this alignment, we constructed a profi le HMM and performed multiple HMM searches through our local proteome database. From these searches, we selected 831 sequences (belonging to the histone H3 family), aligned these and constructed the gene phylogeny, which is presented in this fi gure (see also Materials and Methods). We rooted the phylogeny on the cluster that contained all of these experimentally identifi ed centromeric H3 variants and some additional sequences that, based on best blast hits, were also likely to be orthologous to CenpA. The cluster did not contain the candidate orthologs in Toxoplasma gondii [18]. Since we do not know whether this is due to an error in the gene phylogeny, or to parallel invention of a centromeric H3 variants in this species, which would mean that it is not orthologous to CenpA. Nevertheless, we included these sequences in the orthologous group. The candidate centromeric H3 variants that are part of the CenpA cluster include sequences from all fi ve eukaryotic supergroups:Homo sapiens [30], Saccharomyces cerevisiae [37], Drosophila melanogaster [38], Caenorhabditis elegans [40], Schizosaccharomyes pombe [41] (Opisthokonta), Dictyostelium discoideum [43] (Amoebozoa), Arabidopsis 3 thaliana [44] (Archaeplastida), Tetrahymena thermophila [45], Plasmodium falciparum [46] (SAR), Giardia intestinalis [48] and Trichomonas vaginalis [51] (Excavata). The original gene tree in newick format is provided (Dataset 3).

elegans and D. melanogaster. The CCAN is not only absent from these model species, but also from many other lineages, such as various animals and fungi, and all Archaeplastida. Because the CCAN is found in three out of fi ve eukaryotic supergroups, it likely was present in LECA, and subsequently lost multiple times in diverse eukaryotic lineages. Alternatively, the CCAN was invented more recently and horizontally transferred among eukaryotic supergroups. However, under both scenarios the CCAN was recently lost in various lineages, for example in the basidiomycete fungi: while Ustilago maydis has retained the CCAN, its sister clade C. neoformans eliminated it (Figure 2D). The fi nding that most of the CCAN (with the exception of CenpC) is absent in many eukaryotic lineages poses questions about kinetochore architectures in these species. Since they generally possess a protein binding to the centromeric DNA (CenpA, see Figure EV4 for details on identifying the orthologs of CenpA) and a protein binding to the spindle microtubules (Ndc80), their kinetochore is not wholly unconventional. Is the bridging function of the CCAN simply dispensable, as proposed for D. melanogaster [235] or is it carried out by other, non-homologous protein complexes? In order to answer these questions, the kinetochores of diverse species that lack the CCAN should be experimentally examined in more detail.

Absence of co-evolution between RZS and its putative kinetochore receptor Zwint-1 Various studies suggested that the RZZ/RZS complex is recruited to the kinetochore 64 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 65

present

absent t 0 1 Metazoa Fungi Stramenopila Alveolata Embryophyta Chlorophyta r NAG ZW10 RINT1 Zwint-1 Knl1 Zwilch Rod Spindly Rod Knl1 NAG ZW10 RINT1 Zwilch Zwint-1 Spindly

Figure 3. Phylogenetic profi les of the Rod–Zwilch–ZW10 (RZZ) complex, its mitotic interaction partners (Knl1, Zwint-1, and Spindly), and ZW10’s interphase interaction partners in the NRZ (NAG and RINT1) complex Presences and absences across eukaryotes of the RZZ subunits, Spindly, Zwint-1, and Knl1, and of the NRZ subunits, NAG and RINT1. Colored areas indicate eukaryotic supergroups as in Figure 1. Right side: Pairwise Pearson correlation coeffi cients (r) between the phylogenetic profi les including a heatmap. The indicated 3 threshold t represents the value of r for which we found a sixfold enrichment of interacting protein pairs (see Appendix Fig S1). See also Appendix Fig S3 for the procedure by which homology between Zwint-1, Sos7, and Kre28 was detected.

primarily by Zwint-1. Zwint-1 itself localizes to the kinetochore by binding to Knl1 [236, 237]. We compared the phylogenetic profi le of Zwint-1 to the profi les of these interaction partners: RZZ/RZS and Knl1 (Figure 3). While we searched for orthologs of Zwint-1, we concluded that Zwint-1, Kre28 (S. cerevisiae) and Sos7 (S. pombe) likely belong to the same orthologous group [238, 239], collectively referred to as ‘Zwint-1’. Although these sequences are only weakly similar, they can be linked by multidirectional homology searches (Appendix Figure S3).

Our set of 90 species contains many species that possess a Zwint-1 ortholog (36 species), but lack RZS, and vice versa (11 species, -0.065 < r < 0). This lack of correlation strongly suggests that, at least in a substantial amount of lineages, RZZ/RZS is not recruited to kinetochores by Zwint-1, but by another, yet unidentifi ed factor. Support for this inference was recently presented in studies using human Hela cells [240, 241]. Compared to RZS, the phylogenetic profi le of Zwint-1 is more similar to that of Knl1 (Figure 3, r=0.506), and of Spc24 and Spc25 (Figure 1, r=0.529 for Spc24, r=0.499 for Spc25), two subunits of the Ndc80 complex that are located in close proximity to Knl1-Zwint-1 [242]. Perhaps Zwint-1 stabilizes the largely unstructured protein Knl1 [241], thereby indirectly affecting the recruitment of RZZ/RZS.

Higher-order co-evolution between the AAA+ ATPase TRIP13 and HORMA domain proteins SAC activation and SAC silencing are both promoted by the AAA+ ATPase TRIP13. 64 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 65

TRIP13 p31comet HORMAD Mad2

A Mad2 B p31comet Metazoa

ATP ADP TRIP13

HORMAD

NTD Fungi ATP ADP AAA+

C TRIP13 p31comet 3 42

23 concordant

1 species TRIP13 HORMAD_p31comet 24 65 r=0.526 discordant 17 concordant

7 species TRIP13 HORMAD

1 Stramenopila 55 r=0.766 discordant 17 concordant

7 species

11 Alveolata discordant r=0.517 Embryophyta Chlorophyta

Figure 4. The co-evolutionary patterns of the multifunctional protein TRIP13. A. Model for the mode of action of TRIP13 as recently suggested [11]. By hydrolyzing ATP, TRIP13 would change the conformation of HORMAD and Mad2 from closed to open, the latter via binding to co-factor p31comet, which forms a heterodimer with Mad2. TRIP13 has a C-terminal AAA+ ATPase domain (AAA+) and a N-terminal domain (NTD) and forms a hexamer [20]. B. Presences and absences of TRIP13 and of its interaction partners p31comet and HORMAD. Colored areas indicate eukaryotic supergroups as in Figure 1. C. Numbers of lineages in which TRIP13 is present or absent, compared to the presences p31comet, HORMAD or their joint presences. Also the Pearson correlation coeffi cient of the phylogenetic profi les as inB ( ) is given. 66 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 67

TRIP13 operates by using the HORMA domain protein p31comet to structurally inactivate the SAC protein Mad2, also a HORMA domain protein (Figure 4A). Since the SAC requires Mad2 to continuously cycle between inactive and active conformations, TRIP13 enables SAC signaling in prometaphase. In metaphase, however, when no new active Mad2 is generated, TRIP13 stimulates SAC silencing. [243-245]. The TRIP13 ortholog of budding yeast, Pch2, probably has a molecularly similar function in meiosis: Pch2 is proposed to bind oligomers of the HORMA domain protein Hop1 (HORMAD1 and HORMAD2 in mammals, hereafter referred to as ‘HORMAD’) and to structurally rearrange one copy within the oligomer, resulting in its redistribution along the chromosome axis. HORMAD, p31comet and Mad2 are homologous and belong to the family of HORMA-domain proteins that also includes Rev7 [246] and autophagy-related proteins Atg13 and Atg101 [247, 248]. All of these proteins likely descend from an ancient HORMA-domain protein that duplicated before LECA.

3 Although the TRIP13 phylogenetic profi le is relatively similar to both the profi les of p31comet (r=0.526) and HORMAD (r=0.517), TRIP13 does not co-occur with these proteins in multiple species (Figure 4B). These exceptions to the co-occurrences of TRIP13/p31 and TRIP13/HORMAD can be explained by the dual role of TRIP13, which is to interact with both p31comet and with HORMAD. If we combine profi les of p31comet and HORMAD, the similarity with TRIP13 increases: the joint p31comet and HORMAD profi le strongly correlates with the TRIP13 profi le (r=0.766, Figure 4C). TRIP13 was indeed expected to co-evolve with both of its interaction partners, as has been demonstrated for other multifunctional proteins [222]. Based on the phylogenetic profi les, we conclude that TRIP13 is only retained if at least p31comet or HORMAD is present (with the exception of the diatom Phaeodactylum tricornutum). We predict that TRIP13-containing species that lost p31comet but retained HORMAD, such as S. cerevisiae and Acanthamoeba castellanii, only use TRIP13 during meiosis and not in mitosis.

The phylogenetic profi les of SAC proteins predict a role for nuclear pore proteins in the SAC. Because similar phylogenetic profi les refl ect the functional interaction of proteins, similar phylogenetic profi les also predict such interactions. We applied this rationale by comparing the phylogenetic profi les of the kinetochore proteins (Figure 1) to those of proteins of the genome-wide PANTHER database in search of unidentifi ed connections. PANTHER is a database of families of homologous proteins from complete genomes across the tree of life. We assigned all proteins present in our eukaryotic proteome database to these homologous families (see Materials and Methods). For each kinetochore protein in Figure 1, we listed the 30 best matching (with the highest Pearson correlation coeffi cient) families in PANTHER, and screened which PANTHER families occur often in these lists (Appendix Table S3). Within this list, we considered the nuclear 66 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 67

pore protein Nup160 an interesting candidate, because it is part of the Nup107-Nup160 nuclear pore complex that localizes to the kinetochore [249, 250]. The phylogenetic profi le of Nup160 (as defi ned by PANTHER) was particularly similar to that of the SAC protein MadBub (r=0.718). In order to improve the phylogenetic profi le of Nup160, we manually determined the orthologous group of Nup160 in our own proteome dataset. We also determined those of Nup107 and Nup133, two other proteins of the Nup107- Nup160 complex. The Nup160, Nup133 and Nup107 phylogenetic profi les strongly correlated to those of SAC proteins MadBub (0.541 < r < 0.738) and Mad2 (0.528 < r < 0.715, Figure 5) - even stronger than these three nuclear pore proteins correlated with one another (0.475 < r < 0.601). Furthermore, Nup160, Nup133 and Nup107 correlate better with MadBub and Mad2 than these SAC proteins do with the other SAC proteins (MadBub: average r=0.563, Mad2: average r=0.511) and far better than these SAC proteins do with all kinetochore proteins (MadBub: average r=0.290, Mad2: average r=0.239). While previous studies have shown that the Nup107-Nup160 complex localizes to the kinetochore in mitosis, our analysis in addition suggests that these proteins may 3 function in the SAC and that they potentially interact with Mad2 and MadBub.

0.2 t 1 r

Bub3 Cdc20 Mad1 Mps1 Nup107 Nup160 Nup133 Mad2 MadBub Bub3 Mad2 Mps1 Mad1 Cdc20 Nup133 Nup160 Nup107 MadBub

Figure 5. Correlations between proteins of the Nup107-160 complex and proteins of the SAC. Heatmap indicating the pairwise Pearson correlation coeffi cients (r) of the phylogenetic profi les of proteins of the Nup107-160 complex and of the SAC. The clustering (average linkage) on the left side of this heatmap was also based on these correlations. The indicated threshold t represents the Pearson correlation coeffi cient for which we found a 6-fold enrichment of interacting protein pairs (see Appendix Figure S1). 68 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 69

A Mutation of canonical MIM Arabidopsis thaliana...... Aquilegia coerulea...... Oryza sativa...... Amborella trichopoda...... Selaginella moellendor i...... Physcomitrella patens...... Streptophyta Klebsormidium accidum...... Chorokybus atmophyticus..... Chlamydomonas reinhardtii. Viridiplantae Volvox carteri...... Coccomyxa subellipsoidea..... Chlorella variabilis*...... canonical MIM Ostreococcus lucimarinus...... land plant MIM

Chlorophyta Bathycoccus prasinos...... Micromonas species*...... ‘transition’ MIM

B Mad1 MIM C Mad2 Mad1 Mad2 Mad1 66 66

R QL L AD 10 6 3 R F S S KSN E I T VTMY VMKG H X X L S E PT G VH K N F V Q T C I H AY M QT Q W I R HDNQ A RHAI VLF KAHGNE S Cdc20 MIM P X 7 X 1 species P X 7 X 6 RS A KT R S NE R Q GAA N QQT SQ R PKN L K N D T K Y KP T V AI S WL S G L F HQH V ST C P I K H E E YV L P L YDMM QRY GMFWV I A HN r=0.492 r=0.609 G L HGGD KEAVI I EHF C T F F C MIM de nition Φ Φ P P or L + Φ Φ D Mad2 Cdc20 Mad2 Cdc20 72 69

X X 5 3

12 6 species X X X protein absent 1 6 X X MIM r=0.440 r=0.519

Figure EV5. Evolution of the Mad2-interacting motif (MIM) in green plants and co-occurrences of Mad2 with the MIM under a less strict motif defi nition. A. Viridiplantae (green plants) phylogeny [3] and the occurrences of the canonical MIM or the ‘land plant’ MIM in Mad1 orthologs of the associated species. *Species lacking an aligned MIM, possibly caused by incomplete gene prediction of Mad1 orthologs. B. The sequence logos of the MIMs of Mad1 (upper panel) and Cdc20 (lower panel) based on the alignments of the motifs present in the right-sided panels of C and D. Below is indicated the required amino acid sequence of the MIM (+: positive residue, Φ: hydrophobic residue, P: proline). In contrast to Figure 6, the MIM is considered present if it agrees with the pattern [ILV] (2)X(3,7)P or [RK][ILV](2), in order that the land plant motif suffi ces. C, D. Left side: numbers of presences and absences of Mad2 in 90 eukaryotic species and its interaction partners Mad1 (C) and Cdc20 (D). Right side: frequencies of Mad2 and MIM (according to defi nition in B) occurrences in species having Mad1 C( ) or Cdc20 (D), respectively. Also the Pearson correlation coeffi cients (r) for the corresponding phylogenetic profi les are shown. 68 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 69

A Mad1 MIM B Mad2 Mad1 Mad2 Mad1 66 52

R L K L 10 14 D QF AM E S S S TM N X I X H Y G E G R T V V PT F V AYH L Q V R S QT I HQI K K CWMANGA R I VL H N Cdc20 MIM P X 7 X 0 species P X 7 X 7 S RT NE AS KSAA Q K NQR T PDS RYQK L Q GH I R CW T K E P V V V EQK H M L L YF NML NKS E HG I T D L A KVFI I HF T GGAC r=0.492 r=0.512 MIM de nition + Φ LΦ P P C Mad2 Cdc20 Mad2 Cdc20 72 65

X X 5 7

X 12 X 0 species X 1 X 12 r=0.440 r=0.755

X Protein absent Mad2-interacting motif (MIM) 3

Figure 6. Phylogenetic co-occurrence of Mad2 with its interaction partners Mad1 and Cdc20 and their Mad2-interacting motifs (MIMs). A. The sequence logos of the MIMs of Mad1 (upper panel) and Cdc20 (lower panel) based on the multiple sequence alignments of the motifs. Below is indicated the required amino acid sequence of the MIM (+: positive residue, Φ: hydrophobic residue, P: proline) which is restricted by the pattern [RK] [ILV](2)X(3,7)P. B, C. Left side: numbers of presences and absences of Mad2 in 90 eukaryotic species and its interaction partners Mad1 (B) and Cdc20 (C). Right side: frequencies of Mad2 and canonical MIM occurrences in species having Mad1 (B) or Cdc20 (C), respectively. Also the Pearson correlation coeffi cients (r) for the corresponding phylogenetic profi les are shown.

The Mad2-interacting motif (MIM) in Mad1 and Cdc20 is coupled to Mad2 presence While interacting proteins are expected to co-evolve at the protein-protein level, as exemplifi ed by many complexes within the kinetochore, interacting proteins might also co-evolve at different levels, such as protein-motif. Co-evolution between a protein and a protein motif has been incidentally detected before, for example in case of CenpA and its interacting motif in CenpC [251] and in case of MOT1 and four critical phenylalanines in TBP [160]. We here explore potential co-evolution of Mad2 with the protein motif it interacts with in Cdc20 and Mad1: the Mad2-interacting motif (MIM). Both the Mad2- Mad1 and the Mad2-Cdc20 interactions operate in the SAC [252, 253]. We defi ned the phylogenetic profi les of the MIM in Mad1 and Cdc20 [254, 255] (Figure 6A) by inspecting the multiple sequence alignments of Mad1 and Cdc20. These alignments revealed that the MIM is found at a similar position across the Mad1 and Cdc20 orthologs, hence the motif likely predates LECA in both these proteins. Notable differences exist between the 70 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 71

MIMs of Cdc20 and Mad1, which could refl ect differences in binding strength to Mad2.

The phylogenetic profi les of Mad2 and of the MIM in Cdc20 or Mad1 orthologs correlated stronger than the full-length proteins (Figure 6B,C). In particular, species lacking Mad2, but having Mad1 and/or Cdc20, never contained the canonical MIM in either their Cdc20 or their Mad1 sequences (hypergeometric test: p<10-4, p<10-9 for Mad1 and Cdc20, respectively). Such species hence likely lost Mad2 and subsequently lost the MIM in Mad1 and Cdc20, because it was no longer functional. Moreover, absence of the MIM in Mad1/Cdc20 supports that in these species Mad2 is indeed absent. While we expected to only fi nd a MIM in species that actually have Mad2, we also expected the reverse: that species that have Mad2 also have a MIM in their Mad1/Cdc20. This is however not the case, most notably for Mad1: many lineages (14) have both Mad1 and Mad2 but lack the MIM in Mad1. A substantial fraction (six) of this group belongs to the land plant species that have a somewhat different motif in Mad1 that is conserved within this lineage (Figure 3 EV5A). This altered land plant motif might mediate the Mad1-Mad2 interaction, which has been reported in A. thaliana [256]. If we consider this plant motif to be a ‘valid’ MIM, the Mad1-MIM and Mad2 correlate substantially better (Figure EV5B-D). Overall, under both motif defi nitions the protein-motif correlations are higher than the protein-protein correlations. Hence, including protein motifs can expose that interaction partners co- evolve, albeit at a different level, and may aid to predict functional interactions between proteins de novo.

Discussion

Our evolutionary analyses revealed that since LECA, the kinetochores of different lineages strongly diverged by different modes of genome evolution: kinetochore proteins were lost, duplicated and/or invented, or diversifi ed on the sequence level. In addition to straightforward protein-protein co-evolution, we found alternative evolutionary relationships between proteins that hint at a more complex interplay between evolution and function. Some established interacting proteins have not co- evolved (Zwint-1 and RZS, Bub3 and BugZ) which has been previously shown for other interaction partners to refl ect evolutionary fl exibility [221]. Lack of co-evolution may also refl ect that a protein has multiple different functions, for which it interacts with different partners. The phylogenetic profi le of such a multifunctional protein differs from either of its interaction partners, and instead is similar to the combined profi les of its interaction partners [222], as we showed for HORMAD and p31comet with TRIP13. Some co-evolutionary relationships predicted novel protein functions, such as nuclear pore proteins operating in the SAC, which should be confi rmed with experiments. Finally, not only proteins, but also functional protein motifs co-evolved with their interaction 70 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 71

partner, as we found for Mad2 and the MIMs in Cdc20/Mad1. Probably, including more proteins and (known and de novo predicted) motifs/domains will not only improve the correlation between known interaction partners, but will also enhance predicting yet unknown interactions and functions.

While we carefully curated the orthologous groups of each of the kinetochore proteins, their phylogenetic profi les might contain some false positives and/or false negatives: incorrectly assigned presences (because a protein sequence in fact is not a real ortholog) and incorrectly assigned absences (because a species does contain an ortholog, but we did not detect it). For the majority of kinetochore proteins, we estimate the chance of false negatives larger than of false positives, mainly because they likely are vulnerable to homology detection failure, given that their sequences evolve so rapidly (Appendix Table S1, Results). Such false negatives of a particular protein will result in falsely inferred gene loss events. A failure to detect homology might therefore also cause sequence divergence to correlate to loss frequency (Figure EV2). Specifi c examples of suspicious 3 absences (potential false negatives) include the inner centromere protein Borealin in S. cerevisiae and the KMN network proteins Spc24, Spc25, Nsl1/Dsn1 in D. melanogaster and C. elegans, and possibly Ndc80 in T. brucei, since functional counterparts of these proteins have been characterized in these species [113, 257-262]. Moreover, species that we predicted to have very limited kinetochore compositions, such as T. thermophila (Figure 2B), might actually contain highly divergent orthologs that we could not detect. If such a species’ kinetochore would be examined biochemically, its undetected orthologs might be uncovered. Although the phylogenetic profi les of the kinetochore proteins presented here might contain some of such errors, we think that our manual curation of the orthologs groups (see Materials and Methods) yields an accurate global representation of the presences and absences of these proteins among eukaryotes. We think this accuracy is supported by the high similarity of phylogenetic profi les of interacting proteins.

The set of kinetochore proteins we studied here is strongly biased towards yeast and animal lineages; lineages that are relatively closely related on the eukaryotic tree of life. This bias is due to the extensive experimental data available for these lineages. Highly different kinetochores might exist, such as the kinetochore of T. brucei [113, 116]. If in the future we know the experimentally validated kinetochore compositions of a wider range of eukaryotic species, we could sketch a more complete picture of kinetochore evolution and could potentially expand and improve our functional predictions.

Since the kinetochore seems highly diverse across species, several questions arise. Is the kinetochore less conserved than other core eukaryotic cellular systems/pathways, as comparing it to the APC/C suggested? And if so, why is it allowed to be less conserved, 72 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 73

or are many of the alterations adaptive to the species? Why do certain lineages (such as multicellular animals and plants) contain a particular kinetochore submodule (such as the Ska complex) while others (such as most fungi) lack it, or have an alternative system (such as Dam1)? Do these genetic variations among species have functional consequences for kinetochore-related processes in their cells? To answer such questions, our dataset should be expanded with specifi c (cellular) features and lifestyles, when this information becomes available for the species in our genome dataset. Together with biological and biochemical analyses of processes in unexplored species, an expanded dataset may reveal the true fl exibility of the kinetochore in eukaryotes and show how chromosome segregation is executed in diverse species. The comparative genomics analysis that we presented here provides a starting point for such an integrated approach into studying kinetochore diversity and evolution, since it allows for informed decisions about which species to study. 3 Materials and Methods

Constructing the proteome database To study the occurrences of kinetochore genes across the eukaryotic tree of life, we constructed a database containing the protein sequences of 90 eukaryotic species. This size was chosen because we consider it to be suffi ciently large to represent eukaryotic diversity, but also small enough to allow for manual detection of orthologous genes. We selected the species for this database based on four criteria. First, the species should have a unique position in the eukaryotic tree of life, in order to obtain a diverse set of species. Second, if available we selected two species per clade, which facilitates the detection of homologous sequences and the construction of gene phylogenies. Third, widely used model species were preferred over other species. Fourth, if multiple proteomes and/ or proteomes of different strains of a species were available, the most complete one was selected. Completeness was measured as the percentage of core KOGs (248 core eukaryotic orthologous groups [263]) found in that proteome. If multiple splice variants of a gene were annotated, the longest protein was chosen. A unique protein identifi er was assigned to each protein, consisting of 4 letters and 6 numbers. The letters combine the fi rst letter of the genus name with the fi rst three letters of the species name. The versions and sources of the selected proteomes can be found in Appendix Table S2.

Ortholog detection The set of kinetochore proteins we studied were selected based on three criteria: (1) localizing to the kinetochore, (2) being present in at least three lineages and (3) having an established role, supported by multiple studies, in the kinetochores and/or kinetochore signaling in human or in budding yeast. We applied a procedure comprising two different 72 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 73

methods to fi nd orthologs for the kinetochore proteins in our set within our database of 90 eukaryotic proteomes, and the same procedure was followed for determining orthologs of the APC/C proteins, NAG, RINT1, Nup107, Nup133, Nup160 and HORMAD. The method of choice depended on whether or not it was straightforward to fi nd homologs across different lineages for a specifi c protein. In both methods, initial searches started with the human sequence, or, if the protein is not present in humans, with the budding yeast sequence. Method 1. If many homologs were easily found, the challenge was to distinguish orthologs from outparalogs. Here we defi ned an orthologous group as comprised of proteins that result from speciation events and that can be traced back to a single gene in LECA, whereas outparalogs are related proteins that resulted from a pre-LECA duplication. For example, Cdc20 and Cdh1 are homologous proteins, both having their own orthologous groups among the eukaryotes. They resulted from a duplication before LECA, therefore members of the Cdc20 and Cdh1 group are outparalogs to each other. To fi nd homologs, we used blastp online to search through the non-redundant protein 3 sequences (nr) as a database [264]. We aligned the sequences found with MAFFT [265] (version v7.149b, option einsi, or linsi in case of expected different architectures) to make a profi le HMM (www.hmmer.org, version HMMER 3.1b1). If the homologs are known to share only a certain domain, that domain was used for the HMM, otherwise we used the full-length alignment. This HMM was used as input for hmmsearch to detect homologs across our own database of 90 eukaryotic proteomes. From the hits in this database, we took a substantial number of the highest scoring hit sequences, up to several hundreds. We aligned the hit sequences using MAFFT and trimmed the alignment with trimAl [266] (version 1.2, option automated1). Subsequently, RAxML version 8.0.20 [267] was used to build a gene tree (settings: varying substitution matrices, GAMMA model of rate heterogeneity, rapid bootstrap analysis of 100 replicates). We interpreted the resulting gene tree by comparing it to the species tree and thereby determined which clusters form orthologous groups. These orthologous groups were identifi ed by fi nding the cluster that contained sequences from a broad range of eukaryotic species and had a sister cluster that also has sequences from this broad range of species. The cluster that contained the initial human query sequence was the orthologous group of interest, while the sister cluster is the group of outparalogs. In our search of orthologs of CenpA, we applied this fi rst method. CenpA is part of the large family of histone H3 proteins and has long been recognized to diverge rapidly, due to which it is a challenge to reconstruct CenpA’s evolution [268]. We determined this orthologous cluster with help of experimentally identifi ed centromeric histone H3 variants in a wide range of species and we included two Toxoplasma gondii sequences that were not part of this cluster. For details, see Figure EV4. The tree in this fi gure was visualized using iTOL [269]. Method 2. If homologs were not easily found, no outparalogs were obtained by these searches and hence the homologs defi ned the orthologous group. For these cases 74 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 75

we used a different strategy to fi nd the orthologous group in our database. Iterative searching methods (jackhmmer and/or psi-blast) were applied to fi nd homologs across the nr and UniProt database [270]. In specifi c cases we cut the initial query sequence, for example to remove putative coiled-coil regions. If a protein returned very few hits, we tried to expand the set of putative homologous sequences by using some of the initially obtained hits as a query. If candidate orthologous proteins were reported in experimental studies in species other than human or budding yeast, but not found by initial searches, we specifi cally searched using those as a query. If this search yielded hits overlapping with previous searches, these candidate orthologous sequences were added to the set of hits. The sequences in this set were aligned to obtain a refi ned profi le HMM. In addition, we searched for conserved motifs in the hit sequences using MEME [271] (version 4.9.0), which aided in recognizing conserved positions that could characterize the homologs. The obtained profi le HMM was used to search for homologs across in local database. The resulting hits were checked for motifs identifi ed by MEME 3 and applied to online (iterative) homology searches to check whether we retrieved sequences already identifi ed as orthologous. Based on this evaluation of individual hits, we defi ned a scoring threshold for the hmmsearch with this profi le HMM and searched our database until no new hits were found. The resulting set of sequences was the orthologous group of interest. The sequences of the orthologous groups can be found in the Dataset 1.

Calculating correlations between phylogenetic profi les In order to study the co-evolution of the kinetochore proteins and to infer potential functional relationships of these genes based on co-evolution, we derived the phylogenetic profi les of these genes. The phylogenetic profi le of a gene is a listof its presences and absences across our set of 90 eukaryotic genomes based on the composition of the orthologous groups. The phylogenetic profi le consists of a string of 90 characters containing a “1” if the gene is present in a particular species (either single- or multi-copy), and a “0” if it is absent. To reveal whether two genes often co-occur in species, we measured how similar their phylogenetic profi les were using the Pearson correlation coeffi cient [4]. All pairwise scores can be found in Dataset 2. To identify pairs of proteins that potentially have a functional association, we applied a threshold of r=0.477. Appendix Figure S1 clarifi es why the Pearson correlation coeffi cient was opted for and how the threshold was set. The Pearson correlation coeffi cients of all gene pairs were converted into distances (d = 1-r) and the genes were clustered based on their phylogenetic profi les using average linkage. The Pearson correlation coeffi cients were also used to map the kinetochore proteins in 2D by Barnes-Hut t-SNE (Appendix Figure S2) [12]. 74 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 75

Detecting the MIM in Mad1 and Cdc20 orthologs We made multiple sequence alignments of the Cdc20 and Mad1 orthologous groups using MAFFT (option einsi). We used these alignments to search for the Mad2-interacting motif (MIM). The typical MIM is defi ned by [KR] [IVL](2)X(3,7)P for both Mad2 and Cdc20 [254, 255], but we also used an alternative defi nition: [ILV](2)X(3,7)P or [RK][ILV](2). We inferred that the location of the motif in the protein is conserved in Mad2 as well as in Cdc20, because the position of the MIM in the multiple sequence alignments was the same in highly divergent species (e.g. plants and animals). For all orthologous sequences, we checked whether the motif, either the typical MIM (Figure 6) or the alternative MIM (Figure EV5) was present on these conserved positions.

Finding novel proteins functioning in the kinetochore To fi nd new proteins performing essential roles at the kinetochore by phylogenetic profi ling, a reference protein set was needed. This reference set was based onthe protein families present in PANTHER. More specifi cally, we assigned the proteins within 3 our proteome database of 90 eukaryotic species to PANTHER (sub)families [272] (version 10). This assignment was done by applying hmmscan to the protein sequences of our database, using the complete set of PANTHER family and subfamily HMMs as a search database. Each protein was assigned to the PANTHER (sub)family to which it had the highest hit, if signifi cant. If a protein was assigned to a subfamily, it was also assigned to the full family to which that subfamily belongs. For each PANTHER (sub)family, a phylogenetic profi le was constructed and compared to the phylogenetic profi les of the kinetochore proteins. For each kinetochore protein, the best 30 matches of PANTHER (sub)families were selected. The proteins often occurring in these top lists can be found in Appendix Table S3.

Comparing diversity of kinetochore and APC/C proteins For the kinetochore and APC/C proteins in this dataset, we calculated their occurrence frequencies and entropies across 90 eukaryotic species. The entropy refl ects a protein’s diversity of presences and absences across species: a protein that is present in half of the species has the highest entropy. We also calculated and compared all pairwise Pearson correlation coeffi cients of the phylogenetic profi les for both of these protein datasets. To assess how complete the kinetochores and APC/C complexes of the species in our dataset are, we calculated the percentage of present kinetochore proteins in species having Ndc80 and CenpA (because those species are expected to have a kinetochore), and we calculated the percentage of present APC/C proteins in species having the main APC/C enzyme Apc10. Loss frequencies were inferred from Dollo parsimony for all kinetochore and APC/C proteins inferred to have been present in LECA. Transitions (also a measure for the evolutionary dynamics of proteins) were measured for each protein by counting all changes in state (so from present to absent, or from absent to present) along 76 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 77

a phylogenetic profi le. Since the ordering of the species in the phylogenetic profi le is an indication of their relatedness, these transitions are expected to refl ect the evolutionary fl exibility of proteins as well. dN/dS and percent identity scores for human and mouse sequences were derived from from Ensembl [273] (downloaded via Enseml BioMart on November 24, 2016). If multiple one-to-one orthologs for a single orthologous group/ family exist, the average dN/dS or percent identity was taken. The results of these kinetochore-APC/C comparisons can be found in Appendix Table S1.

Acknowledgments We thank the members of the Kops and Snel labs for critical reading and helpful discussion on the manuscript. We thank John van Dam for his contribution to compiling the eukaryotic genome database. This work was supported by the UMC Utrecht and the Netherlands Organisation for Scientifi c Research (NWO-Vici 865.12.004 to GK).

3 Author contributions BS and GK designed the research. JH and ET performed the research. LW contributed the eukaryotic genome database. JH, BS and GK analyzed the data and wrote the paper.

Supplementary Material

Appendix Table 2 and Datasets 1-3 can be found online: http://bioinformatics.bio.uu.nl/ jolien/thesis/chapter3_eukaryotic_kinetochore_evolution/

Appendix Table S1. Measures of protein diversity in the set of kinetochore and APC/C proteins. Scores present the average across the proteins, except in the case of completeness (average across species). Statistical validity was assessed by performing an unpaired, two-sided t-test.

Diversity feature Kinetochore (average) APC/C (average) (p-value kinetochore vs. APC/C) Frequency (p=0.0051) 0.463 0.689 Entropy (p=0.0248) 0.731 0.578 Pearson correlation coeffi cient (p=0.0006) 0.219 0.267 Completeness (p=6.98E-13) 0.485 0.701 Losses (Dollo parsimony, p=0.146) 16.4 13.1 Transitions (p=0.672) 0.173 0.184 % Identity (human-mouse, p=0.0016) 74.8% 89.2% dN/dS (human-mouse, p=1.80e-5) 0.245 0.059 76 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 77

Appendix Table S3. Phylogenetic profi les of the kinetochore proteins were compared to PANTHER10 (sub)families. The PANTHER10 (sub)families that have similar phylogenetic profi les as many kinetochore proteins (as measured by the Pearson correlation coeffi cient, indicated by their frequency in the top 30 of each kinetochore protein) are shown here.

PANTHER10 (sub)family Frequency Pearson correla- Protein Informa- (top 30) tion coeffi cient tion from (r, average) PTHR11444.SF3|ARGININOSUC- 12 0.491 ARGININOSUCCI- human CINATE LYASE NATE LYASE PTHR28080|FAMILY NOT NAMED 12 0.486 Pex3:Peroxisomal human Biogenesis Factor 3 PTHR12309.SF12|CENTROMERE 12 0.770 CenpN human PROTEIN N PTHR14582|FAMILY NOT NAMED 11 0.683 CenpO human 3 PTHR23342.SF0|N-ACETYLGLU- 11 0.621 NAGS:N-Acetylglu- human TAMATE SYNTHASE, MITO- tamate Synthase CHONDRIAL PTHR28262|FAMILY NOT NAMED 10 0.773 Spc19 human PTHR12856|TRANSCRIPTION INI- 10 0.568 GTF2H1:General human TIATION FACTOR IIH-RELATED Transcription Factor IIH Subunit 1 PTHR14401|FAMILY NOT NAMED 10 0.715 CenpK human PTHR10606.SF39|6-PHOS- 10 0.773 Similar to 6-phos- yeast PHOFRUCTO-2-KINASE/FRUC- phofructo-2-kinase TOSE-2,6-BISPHOSPHATASE enzymes YLR345W-RELATED PTHR14778|FAMILY NOT NAMED 10 0.642 Dsn1 human PTHR31749|FAMILY NOT NAMED 10 0.624 Nsl1 human PTHR18460|UNCHARACTERIZED 10 0.544 TTI1: TELO2 Inter- human acting Protein 1 PTHR34832|FAMILY NOT NAMED 10 0.636 CenpW human PTHR10555.SF136|VACUOLAR 10 0.773 Vps17 yeast PROTEIN SORTING-ASSOCIAT- ED PROTEIN 17 PTHR28017|FAMILY NOT NAMED 10 0.773 Dad3 yeast PTHR21286|NUCLEAR PORE 10 0.533 Nup160 human COMPLEX PROTEIN NUP160 78 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 79

PTHR31382.SF4|NA(+)/H(+) 9 0.764 NHA1: Na+/H+ yeast ANTIPORTER antiporter PTHR24343.SF137|SERINE/ 9 0.761 RTK1 yeast THREONINE-PROTEIN KINASE RTK1-RELATED PTHR11689.SF93|ANION/PRO- 9 0.761 Gef1 yeast TON EXCHANGE TRANSPORT- ER GEF1 PTHR11266.SF8|MPV17-LIKE 9 0.592 MPV17L2:Mito- human PROTEIN 2 chondrial Inner Membrane Protein Like 2 PTHR31740.SF2|CENTROMERE 9 0.667 CenpL human PROTEIN L 3 PTHR28051.SF1|RESISTANCE 9 0.709 REG1:Regulatory yeast TO GLUCOSE REPRESSION subunit of type 1 PROTEIN 1 protein phospha- tase Glc7p PTHR28113|FAMILY NOT NAMED 9 0.799 Dam1 yeast PTHR23139.SF60|CENTROMERE 9 0.661 CenpI human PROTEIN I PTHR12064.SF29|PROTEIN 9 0.756 Mam3 yeast MAM3 PTHR28036|FAMILY NOT NAMED 9 0.787 Dad2 yeast PTHR23168|MITOTIC SPINDLE 9 0.550 Mad1 human ASSEMBLY CHECKPOINT PRO- TEIN MAD1 MITOTIC ARREST DEFICIENT-LIKE PROTEIN 1 PTHR28662|FAMILY NOT NAMED 9 0.655 CenpH yeast PTHR28077|FAMILY NOT NAMED 9 0.731 Kei1:Kex2-cleavable yeast protein Essential for Inositol phosphoryl- ceramide synthesis PTHR11365.SF11|SUBFAMILY 9 0.731 OXP1:OxoProlinase yeast NOT NAMED 78 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 79

15 14 13 12 Measure 11 Phylogeny-insensitive 10 Chance co−occurrence probability distribution 9 Jaccard index 8 Mutual information ichment 7 Pearson correlation coefficient En r 6 Phylogeny-sensitive (Dollo parsimony) Dollo Fisher's exact 5 Differential Dollo 4 Dollo overall 3 2 1 0 3 0 50 100 150 200 250 300 350 400 450 500 Coverage

Appendix Figure S1. Performance of various measures that compare phylogenetic profi les in predicting physically interacting proteins. Various metrics quantify the similarity between phylogenetic profi les, such as Pearson correlation coeffi cient, hamming distance, chance co-occurrence probability distribution, jaccard index, mutual information [4, 5] and various phylogeny-sensitive measures such as those based on Dollo parsimony [28]. We compared these metrics by assessing how well they return known physically interacting genes. A set of physically interacting proteins was obtained for our proteins of interest (kinetochore proteins) using the BioGRID [35]. For each metric, we calculated the enrichment of these confi rmed interacting protein pairs among pairs having a given phylogenetic profi le similarity score (converted into the coverage of all possible protein pairs at that similarity score). Across most scores, the Pearson correlation coeffi cient returns the highest number of interacting pairs. For this Pearson correlation coeffi cient (r), the threshold t was set at the r value that yields 6-fold enrichment of interacting pairs relative to pairs for which no interaction is observed. 80 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 81

Borealin Incenp Mad2 Cdc20 Bub3 TRIP13 MadBub Mis12 Mps1 p31comet ZW10 Mad1 Nnf1 BugZ Ska1 Knl1 Dsn1 Ska3 Spc24 Zwint−1 Nuf2 Nsl1 Ska2

CenpE Ndc80 Aurora Spc25

CenpC CenpA

Skp1 Rod Zwilch Spindly Cep57 Survivin ARHGEF17 Plk CenpS

CenpX SKAP CenpF Sgo Astrin

3 t-SNE dimension 2 Apc15 CenpR CenpK CenpU

CenpM CenpH CenpI CenpQ CenpL CenpP

CenpN CenpT CenpO

CenpW

Dad1 Ndc10 Spc19 Cep3 Ask1 Duo1 Ctf13

Dad2 Dam1 Hsk3

Dad3 Dad4 Spc34

t-SNE dimension 1

Appendix Figure S2. T-SNE map of kinetochore proteins. The kinetochore proteins were visualized using a Barnes-Hut implementation of t-Distributed Stochastic Neighbor Embedding (t-SNE) [12] based on their pairwise distances measured by the Pearson correlation coeffi cient of the phylogenetic profi les. The protein names are colored according to their complex memberships, identical to Figure 1. 80 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 81

Schizosaccharomyces pombe (Sos7, U3H042)

1.1e-19 0.00012

Bipolaris oryzae (W6YUM4) Trichosporon asahii (J5Q385)

0.0019 6.5e-05

Wickerhamomyces ciferrii (K0KQQ5) Capsaspora owczarzaki (A0A0D2WQB7)

2nd iteration, 0.0016 0.00018

Zygosaccharomyces rouxii (C5DZ48) Strongylocentrotus purpuratus (W4Z8J2) 3 7.8e-33 1.1e-09

Saccharomyces cerevisiae (Kre28, Q04431) Oncorhynchus mykiss (A0A060WEI3)

0.0054 Kre28-like Sos7-like Homo sapiens (Zwint-1, O95229) Zwint-1-like

single homology search

iterative homology search

Appendix Figure S3. Establishing homology between Zwint-1, Sos7 and Kre28. Sequences that link Sos7 (Schizosaccharomyces pombe), Kre28 (Saccharomyces cerevisiae) and Zwint-1 (Homo sapiens) with homology searches (arrows) and corresponding e-values (searches against UniProtKB database performed online on October 1 2015, http://www.ebi.ac.uk/Tools/hmmer/), indicated by species and UniProt IDs. Colors represent to which protein that sequence is most similar. Of the sequences indicated here, in addition to Sos7, Kre28 and Zwint-1 also the hit in Capsaspora owczarzaki is in the proteome database used in this study. We used sequences in additional species to connect sequences in the database, as indicated by this scheme.

4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1

Jolien JE van Hooff, Berend Snel# and Geert JPL Kops#

# joint senior authors

Genome Biology and Evolution, 2017 84 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 85

Abstract

Faithful chromosome segregation relies on kinetochores, the large protein complexes that connect chromatin to spindle microtubules. Although human and yeast kinetochores are largely homologous, they track microtubules with the unrelated protein complexes Ska (Ska-C, human) and Dam1 (Dam1-C, yeast). We here uncovered that Ska-C and Dam1-C are both widespread among eukaryotes, but in an exceptionally inverse manner, supporting their functional analogy. Within the complexes, all Ska-C and various Dam1-C subunits are ancient paralogs, showing that gene duplication shaped these complexes. We examined various evolutionary scenarios to explain the nearly mutually exclusive patterns of Ska-C and Dam1-C in present day species. We propose that Ska-C was present in the last eukaryotic common ancestor, that subsequently Dam1-C displaced Ska-C in an early fungus and got horizontally transferred to diverse non-fungal lineages, displacing Ska-C in these lineages too.

Key words: kinetochore, analogs, gene displacement, horizontal gene transfer, gene duplication, protein complex evolution

4 Main text Distributions of Ska-C and Dam1-C are wide, phylogenetically coherent and inversely correlated During eukaryotic cell division, duplicated sister chromatids are separated by the microtubules of the mitotic spindle. These microtubules connect to the sister chromatids via kinetochores, large protein structures that assemble onto the centromeric DNA [207]. Microtubules depolymerize to pull sister chromatids apart, while maintaining their connection to the kinetochore. Kinetochores track these depolymerizing microtubules using the Ska complex (Ska-C, three subunits) in human and the Dam1 complex (Dam1-C, ten subunits) in yeast (Figure 1A) [207]. While the human and yeast kinetochores are largely homologous, these complexes instead seem analogous. This raises the question when and how these complexes were invented and whether kinetochores of other eukaryotic species may use homologous complexes to track microtubules.

To trace the evolutionary histories of Dam1-C and Ska-C, we determined the occurrences (‘phylogenetic profi les’) of their subunits across the eukaryotic tree of life. We expected that microtubule-tracking complexes are broadly present in eukaryotic lineages, because microtubule-based chromosome segregation is conserved in eukaryotes [209]. Indeed, Ska-C subunits had been detected also in non-metazoan genomes and Dam1-C subunits had been detected in non-fungal ones [274]. To search for orthologs of Ska-C and Dam1-C 84 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 85

subunits as well as of Ndc80, their interactor at kinetochores, we constructed a proteome database of 102 diverse eukaryotic species. This database was enriched for lineages reported to contain Dam1-C, in order to facilitate fi nding homologs of this apparently less abundant complex (see Materials and Methods). Finding homologs of subunits of these complexes is complicated, since their sequences are highly divergent and since the Dam1-C subunit sequences are short. Therefore, we performed vigorous homology searches and de novo gene prediction (see Materials and Methods). We detected Ska-C subunits in all and Dam-C subunits in four out of fi ve eukaryotic supergroups (Figure 1B). Only the Dam1-C subunit Spc19 seems restricted to Fungi.

Within each complex the subunits had highly similar phylogenetic profi les. This similarity indicates that both complexes evolved each as a single evolutionary unit and refl ects the interdependencies of their subunits [28, 130]. Despite these similarities, various species lack subunits. These absences may be due to severe sequence divergence escaping our homology detection (i.e. false negatives: see Supplementary Text), or they might indicate that functional complexes can consist of a subset of the subunits or have incorporated other proteins. Moreover, 19 species that contain an Ndc80 ortholog – suggesting they use microtubule-based chromosome segregation – lack Ska-C as well as Dam1-C subunits. Whether these species do not need a microtubule-tracking complex at the kinetochore or whether they contain yet other, non-homologous complexes is unknown but of great interest to further investigation. 4

Although most species have Ska-C or Dam1-C (74% (75/102), defi ned as at least one Ska-C subunit or at least three Dam1-C subunits), very few have both (7% (7/102)) (Figure 1B). To quantify how dissimilar the phylogenetic profi les of the complexes are, we calculated the Pearson correlation coeffi cient (r) between any two subunits [4]. As expected, the intra-complex correlations were high (Ska-C: 0,72 < r < 0,81, Dam1-C: 0,51 < r < 0,91, Supplementary Table S1). Strikingly, however, the inter-complex correlations were negative (-0,38 < r < -0,19) (Figure 1C). We estimated that such negative correlations are only found in 1.6% of all possible protein pairs in a genome-wide screen (Supplementary Text). This strong negative correlation suggests that Ska-C and Dam1-C are disfavored to co-occur in a species. It furthermore supports functional analogy of the complexes and predicts that their kinetochore functions are conserved across eukaryotes [132].

Ska-C and Dam1-C are distributed in a wide and scattered, yet inverse manner. Such distributions are rare in eukaryotes, but, as also indicated by our genome-wide screen, they are not unique: translation elongation factors eEF-1α and EFL form another example [275]. Such distributions form a challenge for evolutionary reconstruction, because the reported low incidence of horizontal gene transfer (HGT) in eukaryotes argues for timing the origin of genes in the last common ancestor of species carrying that gene [53, 54]. 86 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 87 1 0.8 0.6 SAR 0.4 Excavata Amoebozoa Amoebozoa Opisthokonta 0.2 Archaeplastida 0

-0.2 Guillardia theta Guillardia

xa o ad r pa a r anopho y C

ia r a r ia sulphu ia r Galdie

on merolae on z y h anidiosc y C

Pearson correlation coefficient (r) Pearson correlation coefficient pureum r pu idium r y h p r o P

ispus r c us r Chond

Micromonas species Micromonas

asinos r p ycoccus h Bat

us n i r lucima Ostreococcus

iabilis r a v Chlorella

yxa subellipsoidea yxa m Cocco

i r te r ca x o v ol V

ydomonas reinhardtii ydomonas m Chla

midium flaccidum midium r Klebso

yscomitrella patens yscomitrella h P

Selaginella moellendorffii Selaginella

ichopoda r t Amborella

a v sati yza r O

ulea r

Spc19 coe Aquilegia

abidopsis thaliana abidopsis r

Dad1 A yi e huxl Emiliania

assicae r b a r

Dad2 Plasmodiopho

wiella natans wiella o Bigel

allax f i r t icha r Dad4 Oxyt

mophila r the ymena h a r et

Ask1 T aurelia r tet amecium r a P

vum r pa idium r yptospo r Dam1 C

xoplasma gondii xoplasma o

Duo1 T um r alcipa f Plasmodium

us n i r ma kinsus r e

Hsk3 P

utum n

Dad3 mi Symbiodinium Blastocystis hominis Blastocystis

erguelense k ium r yt h

Spc34 Aplanoc

um n limaci ium r yt h antioc r u A

ium catenoides ium r yt h Ska2 Hyphoc

asitica r

Ska3 pa Saprolegnia ugo candida ugo b Al

ugo laibachii ugo b Ska1 Al

asitica r pa a r aloperonospo y

Ndc80 H estans f in a r ytophtho h P

a pseudonana a r Thalassiosi

utum n r ico r t Phaeodactylum

erens f anophagef ureococcus A

Spc19 Dad1 Dad2 Dad4 Ask1 Dam1 Duo1 Hsk3 Dad3 Spc34 Ska2 Ska3 Ska1 Ndc80 pus siliculosus pus r Ectoca

Nannochloropsis gaditana Nannochloropsis

i r ube r g ia r

C Naegle

Leishmania major Leishmania

ucei r b ypanosoma r

LECA T

Giardia intestinalis Giardia

aginalis v ichomonas r T

Acanthamoeba castellanii Acanthamoeba

Entamoeba histolytica Entamoeba

olysphondylium pallidum olysphondylium P

ostelium discoideum ostelium y Dict

ahens r t Thecamonas

onticula alba onticula F

ycis m allo ella z Ro

aia culicis aia r v a V

ycis b bom Nosema

oon cuniculi oon z Encephalito

oon intestinalis oon z Encephalito

yces sp yces m

4 Piro yces punctatus yces m ello z

Dam1-C Spi

ium dendrobatidis ium r yt h achoc r Bat

ia anguillulae ia r Catena

us n macrogy yces m Allo

Conidiobolus coronatus Conidiobolus

ersa v e r Coemansia

is r irregula ophagus z Rhi

tierella elongata tierella r Mo

ticillata r e v tierella r Mo

us n esleea k la b yces m yco h Kinetochore P Microtubule

Mucor circinelloides Mucor

ydis a

P m Ustilago

mans r o f neo yptococcus r P C

yces pombe yces m osaccharo z Schi

Saccharomyces cerevisiae assa r c a r Neurospo

yces lactis yces m ero v Kluy

visiae e cer yces m Saccharo

antissima r ag r f Creolimax

ma arctica ma r o f Sphaero

Aurora wczarzaki o a r

Ndc80-C Capsaspo

Ndc80 loop rosetta Salpingoeca

vicollis e br Monosiga

P queenslandica Amphimedon

P Mnemiopsis leidyi Mnemiopsis

ichoplax adhaerens ichoplax r T

ectensis v Nematostella

Schistosoma mansoni Schistosoma

yi a mal ugia r B

Caenorhabditis elegans Caenorhabditis

Anopheles gambiae Anopheles

Drosophila melanogaster Drosophila

atus r pu r pu Strongylocentrotus

Kinetochore vskii e al ow k Saccoglossus Microtubule

Homo sapiens idae r flo anchiostoma r B

Ciona intestinalis Ciona

io r re Danio

ipes r ub r akifugu T

Ska-C Xenopus tropicalis Xenopus

usculus m Mus Homo sapiens Homo Spc19 Dad1 Dad2 Dad4 Ask1 Dam1 Duo1 Hsk3 Dad3 Spc34 Ska2 Ska3 Ska1 Ndc80 A B

Accordingly, Ska-C and Dam1-C would be inferred to have both been present in the LECA and in many other ancestral lineages, in contrast with what is observed in the majority of extant species. Subsequently, either Dam1-C or Ska-C got lost in most eukaryotic lineages. As such, this scenario would provide a unique case of parallel, 86 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 87

Figure 1. Presences and absences of the analogous Ska and Dam1 complexes. A. Illustration of Dam1-C and Ska-C at the kinetochore-microtubule interface. Analogy between the complexes consists of their function in tracking dynamic, depolymerizing microtubules, their regulation by Aurora kinases (Aurora B in human, Ipl1 in budding yeast) and their interaction with Ndc80 via its internal loop region. Ndc80 is part of the four-subunit complex Ndc80-C. B. Presences and absences (‘phylogenetic profi les’) of the Ska-C and Dam1-C subunits and of Ndc80 across eukaryotes. The eukaryotic super groups are color-coded according to the legend. C. Pairwise correlations between the phylogenetic profi les of all subunits.

reciprocal loss of non-homologous complexes. In another evolutionary scenario, one of the complexes was invented more recently than LECA and displaced the ancient complex, and subsequently spread to other clades of the eukaryotic tree of life via HGT, which would make this a unique case of eukaryote-to-eukaryote HGT and parallel gene displacement.

Ancient gene duplications contributed to the origin of Ska-C and Dam-C To shed light on the origins of Ska-C and Dam1-C, we searched for distant homologous protein families for each subunit in the Pfam database [34] using sensitive profi le-profi le comparisons. The Dam1-C subunits hit some prokaryotic families (Supplementary Table S2, Supplementary Text) but no homologous prokaryotic complex was identifi ed. Strikingly, all Ska-C and three Dam1-C subunits hit another subunit of the same complex, 4 indicating that these subunits are homologs. More signifi cant intra-complex hits were found after adding the query profi les (constructed from the multiple sequence alignments of the orthologous groups) to the Pfam database (see Materials and Methods). This latter search suggested that within Ska-C, all three subunits are homologous to one another, and that within Dam1-C, two sets of homologous subunits (Duo1-Dad2 and Dad1-Dad4- Ask1) exist (Figure 2, Supplementary Figure S1, Supplementary Table S3). Although Ska2 and Ska3 hit each other insignifi cantly (E-value = 13 or 22, dependent on the query profi le), their common ancestry is implied by the transitive nature of homology.

This intra-complex homology reveals that gene duplications contributed to the invention of Dam1-C and Ska-C. Dam1-C and Ska-C share duplication as a mode of invention with other protein complexes [276]. One mechanism explaining this phenomenon is that if a homodimer-forming protein duplicates, the interaction interface might be conserved, hence a heterodimer arises [276]. Such a scenario could apply to Ska-C because the interaction interfaces (the subunits’ N-terminal coiled-coils [6]) overlap with the homologous regions in at least Ska1 and Ska2. Since Ska3 also interacts with the other subunits via an N-terminal coiled-coil, we asked if the N-termini of Ska1 and Ska2 are homologous to that of Ska3. In support of this, the Ska3 profi le hits Ska1 sequences in their N-terminus. Hence we hypothesize that the Ska-C subunits are homologous along 88 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 89 MT-binding coiled-coil Ska1-Ska2 Ska1-Ska3 Ska2-Ska3 Dad1-Dad4 Dad1-Ask1 Dad4-Ask1 Duo1-Dad2 / / / / / / ), Duo1-Dad2 H. sapiens [6], for homology / interaction ) is proposed to be homologous to Ska1/2 coiled coil

4 Ska2 Ska1 ). For the Ska-C subunits, the coiled-coil and interaction regions are based on the structure of this complex in Dad2 Dad4 Ask1 Ska3 Dad1 ) and Dad1-Dad4-Ask1 ( C Duo1 B the Dam1-C subunits the coiled-coil regions are based on predictions and the interaction sites were derived from published cross-linking/mass spectrometry le searches, the red striped region in ( A le-profi analyses in S. cerevisiae [19]. Although not found by profi based on structural similarities.region Figure 2. Homologous regions among Ska-C and Dam1-C subunits 2. Homologous regions Figure Regions of microtubule-binding homology, coiled-coils (‘MT-binding’), and pairwise interactions in the homologous clusters Ska1-Ska2-Ska3 ( A ( A C B 88 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 89

their full lengths (Figure 2A, striped area indicates hit with human Ska1, Supplementary Figure S1). For Dam1-C, the interaction interfaces are less well specifi ed, because no crystal structure is available (Figure 2B,C) [19]. Similar to Ska-C, the homologous regions overlapped with (predicted) coiled-coil regions, suggesting this structure is an ancient and important feature of both complexes (see Supplementary Text).

For the three homologous clusters, we estimated gene trees in an attempt to elucidate the evolutionary histories of Dam1-C and Ska-C. We generated multiple sequence alignments of the combined orthologs of a homology cluster (Ska1-Ska2-Ska3, Duo1- Dad2, Dad1-Dad4-Ask1, Supplementary Figure S1) to build the trees. The trees clearly separate the orthologous groups containing the Dam1-C/Ska-C subunits (Figure 3). Hence, the gene phylogenies confi rm the phylogenetic profi les of the subunits (Figure 1B). In these phylogenies, the duplication node that unites the different orthologous groups indicates the origin of the subunits. Since all orthologous groups contain sequences from a wide range of species, and no pre-duplication sequences seem to exist, the duplications preceded the propagation of the complexes. In both the Ska-C and the Dam1-C trees many sequences have positions incongruent with the species phylogeny, which could indicate HGTs (Figure 3, Supplementary Figure S2). However, since nodes uniting sequences from unrelated lineages have low support values and since the topologies of the subunit clusters differ within a complex, these gene trees do not provide suffi cient evidence for HGT of either Ska-C or Dam1-C. 4 Apparently, the protein sequences of these subunits contained too little information to uncover their evolution. This lack of information is likely caused by the sequences diverging rapidly, and for Dam1-C subunits also by their short lengths. To increase the information content, we also made gene phylogenies from the concatenated alignments of Dam1-C and Ska-C subunits, assuming that for each complex the subunits evolved as a single evolutionary unit (Supplementary Figure S3, Materials and Methods, Supplementary Text). These trees were better supported and more congruent with the tree of life. They did not, however, allow for the identifi cation of the transmission mechanism because there is no information to decide where these trees should be rooted.

Comparing two evolutionary scenarios: ‘both in LECA’ versus ‘HGT of Dam1-C’ In an attempt to explain the inverse presences of Ska-C and Dam1-C in eukaryotes, we compared two evolutionary scenarios to assess which is more parsimonious. One scenario poses that LECA contained both Dam1-C and Ska-C and that no HGT events occurred, while another poses that one of the complexes was invented after LECA and spread to other eukaryotic clades by HGT. We do not consider a ‘both novel’ scenario, because we assume that LECA had a microtubule-tracking complex to enable microtubule-based chromosome segregation. The fi rst scenario (‘both in LECA’) involves both Ska-C and 90 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 91

A Tree scale: 1 C Tree scale: 1

Ska2

Ask1

Ska1 Dad1

Ska3

Dad4

Tree scale: 1 B Eukaryotic supergroup Opisthokonta Dad2 Amoebozoa

Excavata

SAR Archaeplastida Duo1 T. trahens / G. theta / internal branches 4 Bootstrap support ≥ 70 Bootstrap support ≥ 50

Figure 3. Gene trees of Ska-C and Dam1-C homologous subunits. Maximum-likelihood gene trees of the combined orthologs of Ska1, Ska2 and Ska3 (A), Duo1 and Dad2 (B) and Dad1, Dad4 and Ask1 (C). Triangles denote a collection of branches from the same eukaryotic supergroup.

Dam1-C being invented in the lineage leading to LECA, partially via the duplications reported above (Figure 4A). In the second scenario (‘HGT of Dam1-C’) we favor Dam1-C being invented post-LECA rather than Ska-C since Dam1-C is present in fewer species (47 vs. 35 in a database enriched for Dam1-C containing species, 47 vs 27 in a ‘backbone’ database, representing eukaryotic diversity– see Materials and Methods) and in fewer supergroups (5 vs. 3) compared to Ska-C. For this ‘HGT of Dam1-C’ scenario we specifi cally propose that Dam1-C was invented in a fungal ancestor, because this complex is most ubiquitous in fungi, and that it subsequently was horizontally transferred towards SAR, Ichthyosporea, the lineage of Capsaspora owkzarzaki and Rhodophyta (Figure 4B). Dam1-C in Guillardia theta might be derived from this species’ secondary endosymbiont; a red alga [277]. Please note that we here assume that all Dam1-C subunits were transferred together, as a single event, which we discuss in more detail 90 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 91 T. vaginalis T. G. intestinalis Discoba Metazoa + Choanoflagellata C. owczarzaki Ichthyosporea Dikarya Mucoromycotina R. irregularis C. reversa + coronatus Blastocladiomycota Chytridiomycota sp P. Microsporidia R. allomycis alba F. trahens T. Conosa A. castellanii N. gaditana E. siliculosus A. anophagefferens Diatomeae Oomycota + H. catenoides Labyrinthulomycota B. hominis marinus S. minutum + P. Apicomplexa C. parvum Ciliata Rhizaria E. huxleyi Streptophyta Chlorophyta Rhodophyta C. paradoxa G. theta ‘HGT of Dam1-C’ ) scenarios, including invention (solid arrows),(solid invention including scenarios, ) Dam1-C ) and ‘HGT of Dam1-C ( B Dam1-C of ‘HGT and ) LECA Ska-C B 4 T. vaginalis T. G. intestinalis Discoba Metazoa + Choanoflagellata C. owczarzaki Ichthyosporea Dikarya Mucoromycotina R. irregularis C. reversa + coronatus Blastocladiomycota Chytridiomycota sp P. Microsporidia R. allomycis alba F. trahens T. Conosa A. castellanii N. gaditana E. siliculosus A. anophagefferens Diatomeae Oomycota + H. catenoides Labyrinthulomycota B. hominis marinus S. minutum + P. Apicomplexa C. parvum Ciliata Rhizaria E. huxleyi Streptophyta Chlorophyta Rhodophyta C. paradoxa G. theta ‘both in LECA’ LECA A Ska-C HGT (dashed arrows), conservation and loss of Ska-C and Dam1-C. Recent, lineage-specifi c losses are not shown. c losses are conservation and loss of Ska-C Dam1-C. Recent, lineage-specifi HGT (dashed arrows), Figure 4. Two evolutionary scenarios for Ska-C and Dam1-C. 4. Two Figure ( A LECA’ in ‘both the are Depicted 1. Figure in eukaryotes the of phylogeny species Pruned Dam1-C 92 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 93

below. Of course, when allowing for HGT many alternative scenarios can be envisioned (e.g. transfer of Dam1-C from the SAR group to the Fungi, or HGT of Ska-C, or combinations thereof), but for reasons of feasibility we here only examined one. We compared this ‘HGT of Dam1-C’ scenario to the ‘both in LECA’ scenario. In the latter scenario, 26% of the ancestors (the internal nodes in the species tree in Figure 1B) would have had both Dam1-C and Ska-C, compared to 7% of current-day species. In the ‘HGT of Dam1-C’ scenario, only 14% of the ancestors would have had both complexes. We thus conclude that this scenario is more parsimonious in relation to the observed inverse presences of the complexes. In addition, while both scenarios entail 25 losses of Ska-C, ‘both in LECA’ infers 13 losses of Dam1-C whereas ‘HGT of Dam1-C’ infers only 4. Indeed, other HGT scenarios would likewise reduce the number of ancestral co-occurrences and losses relative to ‘both in LECA’.

The ‘HGT of Dam1-C’ scenario is also more likely than the ‘both in LECA’ scenario when considering their respective implications for the complexes’ functions. For Ska-C and Dam1-C to have co-existed in LECA and for long periods thereafter, their functions should most likely have been non-redundant. One complex might have had a non-kinetochore function or the complexes fulfi lled the same kinetochore function in different life cycle stages. Subsequently, during post-LECA evolution, their functions became redundant in multiple independent lineages. One of the complexes would have recurrently taken over 4 all ancestral functions previously performed by the distinct complexes, due to which the other complex got lost. In some lineages, the dominant complex would become Ska-C, while in others it would be Dam1-C. In other words, Dam-C and Ska-C evolved towards each other functionally, and this convergent evolution should have occurred in a parallel fashion in most eukaryotic lineages, with the exception of those encompassing species that still contain both complexes. Moreover, this scenario suggests that Dam1-C and/or Ska-C has a secondary, yet unknown “moonlighting” function.

In our ‘HGT of Dam1-C’ scenario, Ska-C had a single microtubule-tracking function in LECA. Dam1-C, functionally analogous to Ska-C, was invented in an early fungal ancestor, so Ska-C was lost due to redundancy. Likewise, after Dam1-C was horizontally transferred to other eukaryotic lineages, it displaced Ska-C in these lineages, with the exception of some species that still contain both complexes. In those lineages, the complexes might have differentiated their functions recently, or one of the complexes might actually becoming displaced at present.

In summary, while both scenarios present unique and complex evolutionary trajectories, we think the ‘both in LECA’ scenario is less likely given that it requires functions of two different complexes to converge, and to do so in an alternating (either Dam1-C or Ska-C ‘took over’) and independent manner in different lineages. Applying a similar reasoning, 92 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 93

eukaryote-to-eukaryote HTG was proposed in the case of the two inversely present translation elongation factors EFL and eEF-1α[275].

Potential mechanisms and drivers of Dam1-C HGT Various mechanism for eukaryotic HGT have been proposed, for example direct transformation, via viral vectors or transposable elements or via endosymbionts [278]. The latter might have played a role in HGT of Dam1-C to G. theta, as it contains a plastid derived from a secondary endosymbiosis of a red algae [277]. As for other Dam1-C– containing species: Many have a ‘fungi-like’, osmotrophic lifestyle, viz. oomycetes and Hyphochytrium catenoides, the Labyrinthulomycota, Plasmodiophora brassicae (Rhizaria) and the Ichthyosporea (Figure 1, Figure 4). Osmotrophy might facilitate HGT, and moreover, some of these species form hyphae or other fi lamentous structures, which may fuse (anastomosis) and thereby might mediate HGT [279]. Moreover, a shared lifestyle makes the donor and recipient species more likely to occupy similar niches and hence to physically co-localize. Interestingly, HGTs from fungi to oomycetes has been reported, and these occurred after oomycetes acquired osmotrophy and hyphae formation [140]. Regardless of the exact mechanism, if HGT of Dam1-C (or Ska-C) occurred, it likely occurred to all subunit-encoding genes simultaneously: A single HGT event minimizes the number of HGT events and increases the probability that the genes are retained in the recipient species. A single HGT could have been accommodated by endosymbiosis or by genomic clustering of the subunits. In fungi, genomic clusters of 4 functionally related genes exist, for example in secondary metabolism pathways, and some of these indeed have been horizontally transferred (reviewed in [279]).

What could have driven displacement of Ska-C by Dam1-C? Although they are considered analogous in their kinetochore function, they might differ slightly in their mechanisms of action (e.g. microtubule-tracking features, kinetochore localization, regulation). Such differences may have caused a preference for Dam1-C over Ska-C and vice versa in certain lineages. Maybe the common osmotrophic lifestyle shared by various Dam1-C containing species not only facilitated HGT, but also favored certain mechanistic alterations to the mitotic machinery. Studying mitosis in such species might yield a common theme that helps to explain the striking patterns of occurrence of Dam1-C versus Ska-C in eukaryotic species.

Materials and Methods

Compiling the proteome database For studying the presences and absences of subunits of Dam1-C, Ska-C and of Ndc80 across the eukaryotic tree of life, we compiled a backbone database containing the protein 94 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 95

sequences of 94 eukaryotic species. These species were selected in order to represent eukaryotic diversity. In order to avoid adding proteomes that relatively incomplete (containing many erroneously unpredicted genes) - which could lead to false absences in our ortholog detection - we assessed the completeness of candidate proteomes by the percentage of core KOGs present (248 core eukaryotic orthologous groups [263]). If multiple annotations of the genome of a given species were available, we chose the annotation containing the highest number of KOGs. This also applies to situations in which multiple strains of a given species are sequenced. After initial searches for orthologs in the UniProtKB database [280], this proteome database was supplemented with seven other species’ proteomes putatively having orthologs of Dam1-C subunits, in order to facilitate phylogenetic analyses (and later with H. catenoides homologs of the proteins of interest, for which we did not include the full predicted proteome - see below). The versions and sources of the selected proteomes can be found in Supplementary Table S4.

Ortholog detection To fi nd orthologs of Dam1-C subunits, Ska-C subunits and Ndc80, we started blastp homology searches with protein sequences form S. cerevisiae (Dam1-C subunits) and H. sapiens (Ska-C subunits, Ndc80) using blastp online [264] and non-redundant protein sequences (nr) as a database. We aligned the resulting sequences with MAFFT [265] (version v7.149b, option linsi, used for all other multiple sequence alignments in 4 this study), and constructed a profi le HMM. This HMM was used to initially check our local database for homologs, and it was submitted to jackhmmer online [165] versus UniProtKB [280]. Based on these results, interesting putative Dam1-C-containing species were added to our local database. Moreover, interesting hits from novel taxa, such as early-branching fungi and non-fungal lineages for Dam1-C subunits or plants for Ska-C subunits, were selected to serve as a query sequence for reciprocal homology searches, using either jackhmmer or psi-blast. The combined results of these homology searches were aligned to generate another profi le HMM, which was used to create the initial set of orthologous sequences in the local proteome database. This HMM was required to converge on this initial set of orthologous sequences: if making an HMM profi le from the obtained initial set, this second HMM should hit the sequences it was constructed from. This set was expanded by blastp searches versus the predicted genome of H. catenoides (which we were kindly provided access to by Thomas Richards, University of Exeter), using an oomycete query sequence. After addition of homologs not present in the predicted proteome (but present on the DNA – see ‘Gene prediction of putative homologs’), the HMMs derived from this sequence orthologous set was again used to search the local database, thereby confi rming convergence of the orthologous set. Moreover, for proteins for which we already observed that non-orthologous sequences were hit (e.g. Ska3 sequences by Ska1, these proteins correspond to the homologous clusters in Figure 2), indicating paralogy, we confi rmed the orthologous 94 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 95

groups by generating gene trees of the multiple sequence alignments of the combined orthologous sequences. The alignments were trimmed using trimAl [266] with variable gt settings. RAxML was used to build the maximum-likelihood gene tree [281] (version 8.0.20, automatic model detection with GAMMA model of rate heterogeneity, rapid bootstrap analysis of 100 replicates – settings used throughout this study). Sequences of the orthologous groups can be found in Supplementary Files S1-14, in which newly predicted genes are labeled ‘_p’.

Gene prediction of putative homologs To avoid false negatives due to improper gene prediction, we scanned the translated DNA sequences of the genomes with spurious absences. These spurious absences were selected based on the presence of the complex of interest (Ska-C: ≥1 subunits, Dam1-C: ≥3 subunits), except for Ndc80, for which we checked all absences. In these cases, the profi le HMM of the orthologous set was used to search against the translated DNA sequences. If a hit was found in the DNA sequence, this hit was verifi ed by searching with the hit region in the nr database using blastp. After confi rmation, the corresponding gene was predicted by selecting the region (-5000 bp, +5000 bp) neighboring the hit and submitting this region to the AUGUSTUS web interface [282] (multiple runs with various trained species, both strands, alternative transcripts: middle). In a few cases, no gene was predicted in the hit region, and we added the translated hit region to the orthologous group. In other cases the protein sequence of the predicted gene was 4 added. This approach returned 24 additional homologs of Dam1-C subunits, one of a Ska-C subunit and an Ndc80 homolog.

Calculating correlations between phylogenetic profi les For the Dam1-C subunits, Ska-C subunits and Ndc80 we derived a phylogenetic profi le (presences and absences) across our set of 102 eukaryotic genomes (genomes in Supplementary Table S4 + H. catenoides). For each protein, this results in a string containing a “1” if it is present in a particular species (either single- or multi-copy), and a “0” if it is absent. For each possible pair of proteins, we measured to what extend the profi les correlate using Pearson correlation coeffi cient [4]. The correlation coeffi cients were converted into distances (d = 1-r) and the proteins were clustered based on their phylogenetic profi les using average linkage.

Detecting distant homologs using profi le-profi le searches In order to detect distantly related homologs of the Dam1-C and Ska-C subunits, HMM-HMM searches were performed using PRC [21]. As input, the profi le HMMs of the Dam1-C and Ska-C subunit orthologous groups in our local database were used, derived from the trimmed (gt 0.1) multiple sequence alignments. The search database consisted of Pfam version 29.0 [34]. Standard options for PRC were used, except for the 96 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 97

maximum E-value (set to 100). For inferring homology between subunits of the same or the alternative complex, the search database was enriched with the query HMMs. We considered two subunits to be homologous if 1) they are each other’s best hit (or if there are no intervening hits except for within the same complex) and 2) the hit has an E-value < 10. Although the second criterion is usually considered to be too inclusive, hence yielding false positives, because of the fi rst criterion and because of the apparent rapid sequence evolution of the subunits, we think it is appropriate here.

The homologous regions in Figure 2 represent the hit regions within the respective profi les. Additional data were projected onto the illustrations of the proteins HMMs. The microtubule-interacting regions were based on studies in human [6] and budding yeast [19]. The coiled-coil regions were based on structural information of the human Ska-C [6] and on predictions for the budding yeast sequences using Pcoils [283] (input is alignment of orthologous sequences, settings: apply weighting, MTIKK matrix, probability > 0,5, window size 28). The interacting residues were based on the complex structure of the human Ska-C [6] and on cross-linking residues in Dam1-C [19].

Phylogenetic analyses The identifi cation of homology between various subunits of Dam1-C and Ska-C allowed for the construction of multiple sequence alignments of all homologous sequences 4 consisting of multiple orthologous groups. For the well-supported homologous clusters Ska1-Ska2-Ska3, Duo1-Dad2 and Dad1-Dad4-Ask1, we aligned the sequences of the combined orthologous groups per cluster, and trimmed these alignments using trimAl [266] (gt 0.7, 0.7, 0.3, respectively), keeping only the homologous regions. From these regions, gene phylogenies were inferred. In addition, multiple sequence alignments were derived for each orthologous group separately, selecting only sequences from species having a certain complex (Ska-C: ≥1 subunits, Dam1-C: ≥3 subunits). If a species had multiple copies of a given orthologous group, one was randomly chosen, given that these are all recent duplicates and showed little divergence. The resulting alignments were concatenated, resulting in a single sequence per Dam1-C- or Ska-C- containing species. For Dam1-C, the Spc19 subunit was excluded because of its limited phylogenetic profi le. The concatenated alignments were trimmed (gt 0.3 for Ska-C, gt 0.5 for Dam1-C) and the complex phylogenies were made. The resulting topologies of the maximum-likelihood phylogenies were tested for the signifi cance of their likelihoods compared to the species phylogeny, a pruned version of Figure 1, using the SH-test as recommended [284] provided by IQ-TREE [285].

Inferring ancestral states For the Dam1-C and Ska-C, we inferred the evolutionary histories along the species phylogeny in Figure 1 by applying Dollo parsimony, which allows for a single invention 96 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 97

only. As input, the phylogenetic profi les of the full (Ska-C: ≥1 subunits, Dam1-C: ≥3 subunits) in current-day species were taken. All internal nodes were labeled by their inferred status (having/lacking) Dam1-C and Ska-C. From these, co-occurrence analysis of the complexes in these internal nodes could be calculated. This procedure was repeated for the alternative scenario, where internal nodes were now labeled in a parsimonious manner except for six instances of Dam1-C invention, which indicate the proposed HTGs.

Acknowledgments The authors thank Leny van Wijk en John van Dam for providing the eukaryotic proteome set. We thank Thomas Richards for providing permission to the genome data of H. catenoides and Eelco Tromer for extensive discussions. This work was supported by the UMC Utrecht and is part of the VICI research programme with project number 016.160.638, which is (partly) fi nanced by the Netherlands Organisation for Scientifi c Research (NWO).

Author contributions JH defi ned orthologs and performed phylogenetic analyses. BS and GK conceived and managed the project. JH, BS and GK wrote the manuscript. 4 Supplementary Text

Uncertainties in the orthologous groups and rapid evolution of Dam1-C and Ska-C subunits Although we found homologs of Dam1-C and Ska-C subunits in a wider range of species than previous studies did, we cannot exclude the possibility that some of the absences in Figure 1 might be ‘false negatives’. Finding homologs for these families is diffi cult due to their rapid sequence evolution (Dam1-C and Ska-C subunits) and their short sequences (Dam1-C subunits only). This rapid sequence evolution is for example illustrated by the low sequence identity between the mouse and human orthologs, which is only 63% for Ska3. Why these microtubule-tracking complexes seem to allow for a high degree of divergence is unclear, although it also has been observed for other kinetochore proteins [212, 286, 287]. False negatives in our dataset might furthermore be caused by incomplete genome assemblies and by incomplete gene prediction. We circumvented the latter by predicting genes ourselves in case of spurious absences (see Materials and Methods).

Besides potential false negatives, there might also be false positives in our dataset. The (predicted) coiled coil regions are obvious candidates for erroneous homology detection 98 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 99

(because they might evolve convergently, see below). However, we think this likely not substantially affected our Dam1-C and Ska-C subunit orthologous groups, because the resulting phylogenetic profi les we obtain are highly similar within a complex. If the accuracy of our homology searching was low, this would have resulted in much more diverse and phylogenetically less restricted phylogenetic profi les.

The incidence of strong anticorrelating phylogenetic profi les in a genome- wide screen If two genes occur across species in an inverse fashion – where in a species often one of the genes is present, but in hardly any species they are co-present – this suggests functional redundancy of the proteins they encode [132]. We observed an apparently strong anticorrelation of the subunits of Ska-C versus the subunits of Dam1-C, and questioned how unique such anticorrelations are across a larger set of genes. We therefore generated orthologous groups including all genes present in our local eukaryotic genome database using PANTHER [272]. In order to be able to compare the results to Ska-C and Dam1-C, we only considered genes that, based on their occurrence across the eukaryotic tree of life, were inferred to have been present in the last eukaryotic common ancestor (LECA). More specifi cally, this requires that they were traced back to LECA based on Dollo parsimony, and that they were present in at least three out of fi ve eukaryotic supergroups. For each possible gene pair, we calculated the Pearson 4 correlation coeffi cient. We found that 1.6% of these pairs has a correlation that is at least as negative as the average Ska-C – Dam1-C pair (6.8% for the least negative Ska-C – Dam1-C pair, 0.5% for the most negative Ska-C – Dam1-C pair). We conclude that Ska-C and Dam1-C anticorrelate strongly, also relative to other gene pairs. The other inferred LECA gene pairs that strongly anticorrelate present an interesting database of candidate analogous gene pairs.

Potential homologous relationships besides the intracomplex homologous clusters and ambiguity of coiled-coils Ancient gene duplications played an important role in the evolution of Dam1-C and Ska-C, as revealed by our profi le-profi le searches. The duplications do however not fully explain the origin of the complexes. Possibly, the microtubule-binding C-terminus of Dam1 was derived from a bacterial helix-turn-helix protein, supported by the highly signifi cant hit between profi les of these proteins (E-value = 2.2e-05, Supplementary Table S2). Relieving strict E-value, best hit and reciprocality criteria, we observe that within Dam1-C additional links can be observed, for example between Dad2 and Ask1. Possibly, this hints at a larger homologous cluster comprised of the two Dam1-C clusters (Figure 2B,C). Moreover, if we loosen the criteria also ambiguous hits were revealed between Dam1-C and the Ska-C subunits. For example, the N-terminus of Dam1 was hit insignifi cantly by Ska1 (Supplementary Table S3), but also by the EAP30/Vsp36 98 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 99

protein family, which functions as a transcription factor/Golgi-to-endosome traffi cking. Dam1-C and Ska-C can therefore not be excluded to be distant homologs, but given the predicted coiled-coil structure of the N-terminus of Dam1 (50 – 97 in the S. cerevisiae sequence), these insignifi cant hits between Dam1 and Ska1 could also have resulted from convergent sequence evolution [187, 188]. Although coiled-coils are prone to false positive homology prediction due to the possibility of convergent sequence evolution, we think that the intra-complex homology is trustworthy because A) they score the highest among each other (reciprocal best hits) and B) complexes often consist of homologous proteins [276]. Functionally, coiled-coil regions could serve to mediate intra-complex protein interactions, as established for Ska-C [6], or they might facilitate interactions with other kinetochore complexes and/or motor proteins.

Increasing sensitivity of phylogenetic inferences by making full complex phylogenetic trees In addition to forming a challenge for homology detection, rapid sequence evolution (or low conservation) also affects the reliability of the multiple sequence alignments and possibly the adequacy of evolutionary models, which are used to construct gene phylogenies. These uncertainties are refl ected by the low bootstrap supports of these gene phylogenies (Figure 3, Supplementary Figure S2), which make them diffi cult to interpret in the light of the evolutionary histories of the complexes. Hence, from these phylogenies we could not confi dently infer HGT. In an attempt to solve this problem, 4 we increased the information content by using the combined evolutionary signal of all subunit sequences for the Dam1-C and Ska-C (see Materials and Methods) and constructing a phylogeny of both complexes. More specifi cally, we concatenated the multiple sequence alignments of the subunits of each complex per species and from these inferred phylogenies. This approach was applied previously, albeit with a phylogenetically limited species set [274]. Notably, it does assume that the individual subunits of a complex have the same evolutionary history. Based on the strongly similar phylogenetic profi les (Figure 1), we consider this a legitimate assumption. Although the resulting phylogenies cannot be rooted, they might expose strong affi liations between complexes from unrelated species, and thereby indicate HGT. The two resulting phylogenies are more in line with the species phylogeny (Supplementary Figure S3), as for example illustrated by R. irregularis now clustering with other fungal Ska-C sequences with intermediate bootstrap support (68). Moreover, Dam1-C sequences from the SAR supergroup form a monophyletic cluster in this gene tree of the full complex, while in the subunit trees their sequences were often polyphyletic. We tested the likelihood of these gene phylogenies compared to the (pruned) species tree. The gene phylogenies made from the concatenated alignment show a signifi cantly higher likelihood than the (pruned) species trees (SH-test as recommended by [284], p=0,00 for Ska-C, p=0,05 for Dam1-C). However, we do not think this rejects the possibility that both Ska-C and 100 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 101

Dam1-C propagated through the eukaryotic tree of life via vertical transmission only, and not via any HGT event. Due to rapid sequence evolution, the underlying multiple sequence alignments might still contain uncertainties, leading to low bootstrap support for nodes uniting sequences from different eukaryotic supergroups. Nevertheless, the relatively well-supported clusters of eukaryotic supergroups indicate that if HGT had occurred, it should be timed in early eukaryotic evolution, before radiation of these lineages. The complex trees do not support a scenario in which multiple parallel HGTs in recent ancestors are responsible for the phylogenetic profi les.

Supplementary Material

Supplementary Table S4 and Supplementary Files S1-14 can be found online: http:// bioinformatics.bio.uu.nl/jolien/thesis/ chapter4_unique_phylogenetic_distributions_ska_dam1_complexes/

Table S1. Pairwise Pearson correlation coeffi cients based on phylogenetic profi les

Ska1 Ska2 Ska3 Dam1 Duo Dad1 Dad2 Dad3 Dad4 Ask1 Hsk3 Spc19 Spc34 Ndc80

Ska1 1,00 0,76 0,81 -0,31 -0,33 -0,28 -0,38 -0,22 -0,32 -0,31 -0,38 -0,25 -0,33 0,26 4 Ska2 0,76 1,00 0,72 -0,31 -0,33 -0,33 -0,34 -0,27 -0,32 -0,31 -0,38 -0,25 -0,28 0,26 Ska3 0,81 0,72 1,00 -0,32 -0,29 -0,26 -0,37 -0,25 -0,34 -0,27 -0,29 -0,19 -0,23 0,21

Dam1 -0,31 -0,31 -0,32 1,00 0,76 0,75 0,82 0,82 0,91 0,91 0,71 0,55 0,78 0,14

Duo -0,33 -0,33 -0,29 0,76 1,00 0,80 0,83 0,82 0,74 0,76 0,83 0,66 0,81 0,08

Dad1 -0,28 -0,33 -0,26 0,75 0,80 1,00 0,68 0,78 0,64 0,75 0,75 0,56 0,72 0,11

Dad2 -0,38 -0,34 -0,37 0,82 0,83 0,68 1,00 0,75 0,86 0,82 0,73 0,54 0,75 0,12

Dad3 -0,22 -0,27 -0,25 0,82 0,82 0,78 0,75 1,00 0,86 0,77 0,82 0,57 0,90 0,11

Dad4 -0,32 -0,32 -0,34 0,91 0,74 0,64 0,86 0,86 1,00 0,82 0,74 0,51 0,81 0,13

Ask1 -0,31 -0,31 -0,27 0,91 0,76 0,75 0,82 0,77 0,82 1,00 0,71 0,55 0,73 0,14

Hsk3 -0,38 -0,38 -0,29 0,71 0,83 0,75 0,73 0,82 0,74 0,71 1,00 0,73 0,86 0,08

Spc19 -0,25 -0,25 -0,19 0,55 0,66 0,56 0,54 0,57 0,51 0,55 0,73 1,00 0,60 0,02

Spc34 -0,33 -0,28 -0,23 0,78 0,81 0,72 0,75 0,90 0,81 0,73 0,86 0,60 1,00 0,10

Ndc80 0,26 0,26 0,21 0,14 0,08 0,11 0,12 0,11 0,13 0,14 0,08 0,02 0,10 1,00 100 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 101

Table S2. Results PRC searches comparing HMMs of Ska-C/Dam1-C subunits (queries) to Pfam 29.0

Query Best non-self hit E-value Species Additional information Ska1 DUF3161 0.015 Eukaryotes Ska2 Ska2 DUF1395 2.6e-18 Eukaryotes Ska1

Ska3 DUF1395 0.00014 Eukaryotes Ska1 Dam1 HTH-IclR 2.2e-05 Bacteria + Archaea IclR helix-turn-helix domain Duo Pox_A3L 0.34 Virusses Limited number of sequences Dad1 DASH_Dad4 0.0052 Eukaryotes Dad4 Dad2 DASH_Duo1 0.068 Eukaryotes Duo Dad3 DUF2360 0.18 Eukaryotes WASH complex CCDC53 Dad4 DASH_Dad1 0.00089 Eukaryotes Dad1 Ask1 SRA1 1.3 Eukaryotes Steriod receptor RNA activator Hsk3 ABC_sub_bind 4.2 Bacteria ABC transporter substrate binder Spc19 HTH_9 0.67 Eukaryotes RNA polymerase III subunit helix-turn-helix domain. Spc34 Tn7_Tnp_TnsA_C 0.87 Bacteria TnsA enconuclease C-terminal

Table S3. Pairwise results PRC searches comparing HMMs of Ska-C/Dam1-C subunits (queries) to Pfam 29.0 + Ska-C/Dam1-C subunit HMMs. 4

Ska1 Ska2 Ska3 Dam1 Duo Dad1 Dad2 Dad3 Dad4 Ask1 Hsk3 Spc19 Spc34

Ska1 5.6e-08 1.5e-05 40 82

Ska2 1.3e-081 13 35

Ska3 2.9e-05 22 23 392

Dam1 22 43 36 21 11

Duo 6.6

Dad1 3.8e-05 0.18 21

Dad2 0.15 51 1.4 10 Query Dad3 25

Dad4 49 23 3.3e-05 46 0.3

Ask1 21 0.52 3.6 0.9

Hsk3 10 92 15 20 90

Spc19 48

Spc34 36 50

1Bold: No higher-scoring other (not Dam1-C/Ska-C subunit) hits 2Grey: Hit is a Pfam domain instead of manually curated profi le. 102 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 103 ), ), Duo1-Dad2 ( B Opisthokonta Amoebozoa Excavata SAR Archaeplastida E GR IS LLS LV ++++ ------KF NY RK RK 190 AR PG SR Eukaryotic supergroup NR QA QA LQ

LLA ML +L DT LD LD CE LL A G G N G II I+ IG LV EG GG GG HG GG GS --- TD EI RG SN RG HS IR IE -D LE ER AL ER AK ER VK 180 180 M TS KI SE KG VV QT IS IK IE IK IK IT EN E+ EV LK LE ------T KE RI RI RL RL SA S+ SL RRL KR KR AA GR GR QR GR GL FS YG LK LK LK MK KY KY SL KL AM GH HC HL HL HL HL HL TR PK PD LQ KS RN FP LP LD LQ LR LR LR AA LR ML MR TL EV VL FL LLS CL TV AM AT VA AI LLL 170 +E 170 NI ++ NA IL IA TK T RRE LQ MD IL IL ------L+ LA LL VF LL LLR LLLS VI VL T AL S AA S AL VI AY QL AD KA KA KA KA RS 11 D+ HV DC IE VD EEES VD SN AK PE SR EK 0 CK TG TG TG TG TG SV SA KC KF D VS LK ST VK LK WT Q- TT A+ AL SI PS GP QH GL GQ GL IE ID RT PA LG LP LG ----T SV KS KL KF KG RK 160 NE 160 DT DA IG IP LN LR LQ L +++ DT -N EI EL QP +L DI DM DV EEEA VQ EEL PD KD EN +Q DQ DE DE I AR ED EA ET EK KR C DD DD TV SE AE SR RA GA GA IT IT IV IT IT YN FG FT SSY QA TM KF FI FFL HL TI LK AAF EK VF LLL AF YYF AP DV +F H PPL P+ RP LP EE S ------FP SE SP 150 15 0 DS GE GE GRYF GK GK -S ------SF SSP LLP KG HG G- TN SSF S ER KY KR K+ KY RL KH KKKP NA DT IE L+ RR AN SK GN IS TD EL EL E+ EP EA AD 00 1 EA DL QE CH NS IE FE LD NN CA IL ------EF KF EL VV AI AV FFY YA YA VL LV ++ DF KA AF SV NE NS TP EY LS FS FN AS ED AQ 000 1 RE NN NE NS NK NA IQ FSS FN DE DE MG SD EP AD EI EI EI EI YI LV EF KV 140 140 ML NL DM TD PC EK G+ ND QK ID IT IG I+ IK IN IS AG RT YA DG HQ ND --- SD AI AV GD MI IS AA AAV K AA SS KV GS CS QL DM DT DA DV DQ NE IR IN IN TN ED VN VN LN LN LN VN LLS LN VR +++ DS NND GR I 00000000 1 QL EI YD EQ ED EF SV E+ EK SK ED 1 +I GL QK DE NQ DR QD EH LQ TV TI TY TL TL TL P+ PA PD PR KV SR PL SW KL 130 130 DK NR IS 000 ------ST ST ST VG RI RL RL RL RL AY LA RL RS NI QV QM I T R S S G G G GQ G -- TP -- RG SI KA RG DL IR LT LR LR LV MK MR MK TK TK SSY RE LT ST FFK YL HG NT SY KF SY SY SF KY N D GF TD SK PR SSY PS PQ ++++++ GT HA IP ------VI VVS AP VP LP L LP LP CP CA

SQ R SV AV EQ KI AL AI 120 120 GV ------TG ------AS SG AR NS NG N+ NS NG DS TL VR SR LD YE FL QQ +R Q GL PA PS EF E+ EY EFES VP PPR SS EEL EEF ED EEL ED EEY QP QE +P GP GE SA EEEY V EQ VA CK CD +E IK TS TS ------TD TV AA EA SS KKPPL SK ER FE DE CC IT IT KI AT VS VT VT VT TL PI LT RM ET 0000000 RQ KQ PV EI LV LA VI GS NN PF PT A+ AP DF WY IA TL AAK SP FC LA YV VE FS 110 DT 110 TV EM LI PL +++++ TL RW PA AL PPL D DE DA HE QS GP IGPL IK I+ VVP AS S NT TN ------A ------TS ------SS AG PE E FN +G HG QI E S S+ AG RL VVE GS D ------PD SD AD PT EES SP A SI NS NS IS VR EE EN VK SSSP ++D +S GQ TK TQ R RL RS LR AT AT SG AH

4 QC TD EG LDP LQ DQ MQ +Q ME ------VN SG VA SA RD SL SA 100 NI GD 100 NT L PQ SD SS EN CN WT CL QK VH SP LS AA AR SSSSY AP SH NF KKS PD PV KG PA DE NF DE SA AAV YP LP LE AAAF ++++++ NY QP WG II TL AAP KKP ES L D GL HM M CE CV DLNR +L AD LQ KH AD SF KF SN SSL EA +A GA NQ NE GT QS NS DDD NS DH TTT RH LK RT F AAS S VE NE PR ST EL VT EI RF DF QF DF QT ML DF TK KP AR KH AD 90 GP QP 90 IQ TTT LA YQ SG PPA VT AT A+ KE DP --- TP TA K PH AAE AS DN +++ HC TA L PS VVA AA LQ PS MI Q- DK TF FL RQ EL VP EEA AA AP PPA QT Q EEL LD ES SP LP LP FP -L TL RA EQ YYS FY YY YT EV LV LL VVE QS +++ TL PK KL AL SH KH SH TD SR SR PP KE SA AQ YA PK 000000000000000000000000 SF QK EK ESR RRL FQK AP LP 1 MN MN KL FL RL RL AC RL RL 80 QE 80 QL NVP NVP GL IY ET EY E LD PS 00 GI QK

IQ RL AA LE LD LA VS LE +++ CE MV LI KL RL LLE QV DI DI ------SSE VQ EH LA KD KD KL RQ KD E NM NV T T A ER 2 SM YK RQ Q CN TA TM TA ES SL K R R NV +++Q GP DE NQ QR QQG MQ NQ NY NA IR SN SG EA ER KN ED K FR VR DE QQQ NNQ DT GN IV IE ------SN SS KP 333 VG LV K EY EE KK LQ E+ 000000 1 AI CD GQ MV IE LS 0 ER LLV VE LL LV DT NL MM ++ MR ME CE IY IH TV TV SH VNN LV EP 1 LN AP LV AV LV KL 70 QP 70 70 GL DL TT ES AA SSS AR 4 EA KL EL VL EL KT ++N QF +A NT QD IN TG VQ VT FT 5 SQ K+ RA KKL KD SL MN QA GD CA NE NT HR

IH IA TV ------TK RT SI 2 YT AAAAS 00000 SR LK SK AQ GL +A SR SI 3 SSV RN S VS 4 KY K+ KS AA RI AL ET DL QE DS QA HC CI TS YD LR FH 2 SS KM SL 3 KA PKA EM GY HS GQ NS DK IS IE TE YE FH FF LF 7 YH YC PS LG VD LE KS EI VE MF +FSSV QK IS TL TN AF RW YV FI LI AI LI 00000000 1 KA LT 44 NL ++ QA TT KG SD RP 22 SN KRPS KA ES ET 1 NQ GT +N DN NL WF NN IA ID TE 6 LQ FE LN LR EEL EEA EEL EQ RN WK WV WR HS QD IW IC II VW VW VW VW 4 VW VW 55 3 EQ RK KK 60 WL 60 60 MF ++N NE QK TA 2 EV 4 EG SFEEL RE SA PI E+ YQ AK EE EK DL NI QT IM IS TE ------SQ SSV S+ AAP AV LQ EA VE LA VR SSV 2 CQ GA DS QA I IQ TP VS F SSR 44 VW 6 RQ SV AN AQ LR AAE MR CS CT WI WK WI IM

TI FT AS A KT 4 SA RF DS DS GW QA NW NW QI NI D NA TT RA 11 SQ 6 KA SM KD ET LE ED QS NNW DS DK DA +D HQ DR IM IH IE TS LT FQ VVA LH L+ 7 LE LD LD 5 RT RP AK YA 00000000 CE QS IL IK EF E KL EF LD LE LR AE LE LE LK LM AA 6 ME IL TL ED KS ER KD RQ KN SF LLLE S+ RV YI LLK PI AL +E HG GL NVV IM T AE EP AQ EA SF AD RQ PR 00 1 GR GQ NS NL Q+ NK QS IC IL IV TE FR VG VG 222 3 RN YE LN SL 333 VG YQ LS LP YQ 1 50 HV 50 50 DM +N AV 3 SV ST 0 AV EL LM LL DV QS MS DV HK D+ DV QV DL TE SE VQ 1 RRL FR SR KD KS EL SE 1 DS NN GQ HQ QR DD LLA SI LN 2 V AD KD AK ER EL HHQ DR +E NE QE QE CK MK IK IK TY TI KS KY 1 AT PI AAS LD AK AE LE +++Q DL ++ QN TL T ES ES EH RR KRYV 2 EP EV EG VI EQ VI VI SM RI MN QI IE VA LE AR 7 LV LK LG 0000 ED KKL RE RA SV 000000 ER QE NE NH HG IS I+ II TV TL ------SL VA VQ VT VA LS LE VH LN 1 NI MG CD IV TQ TI PH 11 SV SM 2 AV RI RI SF EL Q+ GQ QE QM TA 4 EE ER KE RE 3 RR SE RR AR AE RRL NDNM NI NQ CA QR ++M QR DE QR IN II IN IL IN IQ TR LN LN L VNQ LK 8 40 + 40 40 ML WR HH IL IL IG ID TL RI 2 KL 5 LN LH L+ LD LM LLK LLR GL QI QI NL NQ MI TI KE VI 1 AI VS KL K VY 2 EN QD GD GH GK QL HL IA IK TK LN AT VD 7 LE ER 1 LY LK LN LR LS LDLC VT MV MK HD NE +++ IA IN TI VL KI S+ SV LS L LP AH PM PV PT NVY DF DL TL TT TF TV RS EQ EL RI SA PD NG QE QV +P NP QI NP IP AAS VS FH RE 222 EG K+ KN 33 5 KI SSF AN PS HE NE ND NE DN NE MT IL * TI V AH AL LN PM PG SS LK AT EK LT LS LA WQ N+ NL RL EA FC RL RG KL RL YL 5 LH AN LN SA 00000000000 1 DV CL NL NL IQ -A -S -G - ED LN 22 7 EN AAL 3 EA AS AS PT EP AS NS CA GR QS DY NVS TT VA LS FV 7 4 VQ SA EI SM MQ 30 30 30 +Q Q- IK TM TV 4 LK LV VT LN VNM SL LD LE LQ L+ PA YA GL NI NF NVE NVK QE MK TL TL EE SN AE 3 EF EFSA R EL GS GL NT IG TD TE AQ EL 2 AQ AAV AQ 33 9 KN SG RA KV YE SR NE DA ++ NVE GN HS +E CG MY DG IR IA ID ID I IA IE TP 9 5 VA SSEFP DK DA II EM EI EI EI 4 RI RI RI LT 2 YS SSA RS LQ LR FN L+ RN A RN LE LR LD MI HT HHG 2 AL EL KL KL DE QK QE D DL Q IA TS TA VG LQ LQ LQ LR 4 VE LS LQ AR RD SL LLR YR AAA 33 RN MM DA +R HQ HF IE TP LLLR AL 5 8 LR LLR LQ EER EG NL HS MS IV II IA I IQ TL AI A YL AI AT 2 KI 3 KI LC ET KA SC LV AS VL ML NI NC +++ QV Q +L TV TL 3 E EE EA KD VV SK FL 20 20 20 IT IT RA RN KKR 4 YE LQ LQ LD LS KK KA KH ET EF QK QT QA QR DE DC QS D+ QE EI 3 LD LT YQ LT 666 SL QV QQH QV TL LE EQ KE AE 2 E LQ EA 3 AAL EL QR Q+ QV NI DL ML GL HL DL DLDY DLNY DL QS QM EEK EQ 3 2 LE RR KR SSL AS ED DQ HD IG IT IS IG IS -Q -D - FE FI FQ 8 RI EL EL EFES ESES AS ES AT LN H HE HS HE HE G MV CN ). The multiple sequence gurealignments were theused onesfor usedthis forfi inferring the gene phylogenies in Figure 3, Supplementary L- K- 1 EEV KE AS VT LD CS KA KY ER EK SR EQ KR SL SF KA KA KA LA 33 ER RK KT HS DP DS +P NP NP NP QQ QA GF LE LE 5 KN KL LLLE LLK SR AM 000000 11 00000000000 EK IK TD AT VA YT KV 00 4 3 A EK EK VN FA FT FN AR LR LLA LG LG F+ FQ LE FD M+ MD ME MP GV AE ET 4 AS A+ SY 10 10 10 GI IA IN FD ST RK SL S+ 1 FT FQS VD LS SH AS KS LM SQ GE QS NS QS NVF GK DT GQ GS KE RP RG 0 SF LI PR KS RS ER EQ RK LC LL LV LI FE LE LL FC AF LS LE PQ YL L+ 0000 1 CL QQQ NP ----- AS AL ID ---- -D -R SK R SI SL EQ RV KQ RS SSY KS GL DE DP DD NA DS DN DK DS ------000 1 EER VR +++++ ------TD - -S ------L VQ ST SS A- L- S SL AV A EI AL PI F AT AL SN EA QS Q QG QS+++ NM GF HV H 4 EEL 00 ------* --- 1 MQ MA MK MD MT MD MD ME MD ME MS MS MD MD MT MW M+ MA MT MA MP ME ME MS MQ MV MT MD MD MS MS M M+ Quality Quality Quality Consensus Consensus Consensus Conservation Conservation Conservation TVAG011239 TVAG011239 CFRA003348 SCER001876 SCER001876 HPAR011892 HPAR011892 CFRA001319 CFRA001319 SCER003478 SCER003478 HPAR002498 HPAR002498 CMER_p1 CMER_p1 HSAP005761 SPUN004043 SPUN004043 ACAS010147 ACAS010147 ESIL005305 ESIL005305 ATHA016462 ATHA016462 HSAP013165 HSAP013165 SPUN004641 SPUN004641 ACAS005208 ACAS005208 ESIL010426 ESIL010426 ATHA009876 ATHA009876 HSAP011086 HSAP011086 SPUN002190 SPUN002190 ACAS010008 ACAS010008 ESIL004397 ESIL004397 ATHA027165 ATHA027165 CFRA006653 SCER000905 SCER000905 HPAR005356 HPAR005356 CFRA000509 CFRA000509 SCER001205 SCER001205 HPAR000464 HPAR000464 CMER_p2 CMER_p2 CFRA004941 CFRA004941 SCER003242 SCER003242 HPAR010551 HPAR010551 CMER001261 CMER001261 Duo1 Dad2 Ska1 Ska2 Ska3 Dad1 Dad4 Ask1 A B C Figure S2. From these alignments, the sequences of one species from each eukaryotic supergroup were selected, indicated by the four-letter code in the identifi er identifi the in code four-letter the by indicated selected, were supergroup eukaryotic each from species one of sequences the alignments, these From S2. Figure selected. S4). For Opisthokonta, both a metazoan species and fungal were species, see Supplementary Table of the sequence (for corresponding Dad1-Dad4-Ask1 ( C Dad1-Dad4-Ask1 Figure S1. Multiple sequence alignments from homologous clusters. S1. Multiple sequence alignments from Figure Related to Figure 2, Figure 3. Multiple sequence alignments from a subset of the sequences in the homologous clusters of Ska1-Ska2-Ska3 ( A 102 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 103

GTHE018487 MVER007603 Tree scale: 0.1 A C

MVER006311 SCER003242 Tree scale: 1 UMAY005554CNEO005945 CCOR002294 KLAC003475 VCAR026498 PPUR007644

OSAT008390 MELO008922 PBRA006093 CMER001261 Ska2 RIRR023399 SARC014576 COWC000187 PBLA011407 SMOE p1 MCIR007852 NCRA005210

CREI011938 SPOM004382 CSUB006570 ACAS005208 GSUL006071 SPUN004641

ATHA009876 KFLA002926 HCAT001595 ACAN010346 TTRA004083 CFRA004941

PPAT011883 ACOE016498

CVAR006112 CMER000816 RALL003321 ACAN009964 TVAG011239 PBRA p1 BNAT000689 PINF006179 TRUB016827

MMUS001667 ATRI006703 COWC009983 HPAR010551ACAN002840 HSAP013165 COWC008568 ALAI009525 HSAP041940 KFLA006416 SPAR000411 SPAR006639 SPAR006285 BFLO050676 PFAL004892 ESIL010426 XTRO014123

DRER004059 TGON000536 DMEL006709 ALIM013961 BFLO046806 AKER004707 ALAI010688 TTRA008953 AGAM003985 HPAR005356ACAN005670 GTHE018146 ALIM005427 MBRE002540SROS002230 TADH003380 SKOW012267 AKER004189 PINF DS028 Ask1 SPUR012444 GTHE p1 CREV004001 CINT005946 RIRR022248 CPAR002546 BNAT007124 SCER000905 VCUL001679 FALB002290 NBOM003902 Dad1 KLAC004339 SPOM000690 EINT000449ECUN000479NBOM002730 CCOR p1 PBLA006248 GTHE011318 MCIR006496 MCIR006508 AKER002529

NCRA003424 CNEO001873 UMAYSARC007235 A0A0D ESIL004397 BDEN003350 TTRA008981 ALIM002538 CFRA006653 MELO011697MVER004729 BMAL000992 CREV003046 NBOM000825 SPUN002190 ACAS011259 VCUL Super FALB002806 SROS009619 PPUR001867 SPAR004763ALAI p1 HPAR000464 ECUN001455 HCAT010316 EINT W8P9D

HCAT006205ACAN005330 RIRR005396 AKER scaff NVEC017595 MBRE008616 PBLA p1 PBRA003772

ALIM002663 MCIR003284 PINF014388 GSUL000279 ALIM005751 AKER001265 SKOW018260

XTRO021676 BHOM003241 CPAR022558 MMUS003967 RALL000109 HSAP011086 ESIL005305 CMER p2 TRUB013675 RIRR006015 MELO010776 DRER018785 CREV p1

MVER006451 BDEN007645 KFLA013670 SPUR016789 TADH006592 SPOM004937 UMAY006176

CINT005481 ACAS010147 NCRA010008 CSUB007373VCAR001076 CINT003076 CMER001254 GTHE012108 PPAT015888 CREI030265

SKOW002290 CNEO001556

KLAC003296 PPAT032305 SCER001205 SMOE014375 SROS001128 KFLA014103 SMOE010747 Dad4

ATRI002003 CFRA000509 ACOE030476 ACAS010008 ATHA027165 ACOE006625 AGAM013652 ATHA016462 SARC018130

ATRI019336 PPAT025973 OSAT046382 CVAR006760 SMOE028077 PPUR001036 CREI004228 RIRR001570 CELE023176 SMOE017117 VCAR015295CSUB007013 Ska3 OSAT066315 Ska1 NVEC009117

MBRE001606 TTRA001449

SPUN004043

SPUR012057

HSAP092301 SKOW030097

HSAP005761 RALL004822 XTRO021605

MMUS004842 DRER008395

MLEI009534

TRUB019278

TADH008289 Eukaryotic supergroup B Opisthokonta

COWC009616 Tree scale: 0.1

SARC012282 GSUL002031

CFRA001319 Amoebozoa

SARC009450 BNAT004634 PBLA015780

SARC002167 KLAC003767 AKER p3 SCER003478

KLAC004769 HCAT014826 SCER001876 Excavata CFRA003348 BNAT015670 MCIR010373

AKER p2 SAR CNEO000272 HCAT004087 UMAY005929 SPOM000628 ALAI003120 NCRA001823 RIRR008480 MELO004690 ALAI005026 MVER000497 Archaeplastida ACAN p1 Dad2 HPAR011892 CREV001697 CMER p1 HPAR002513 PINF017009 PINF p1 SPAR008299 HPAR002498ACAN005517ALAI013207 BHOM000511 SPAR015019 T. trahens / G. theta / internal branches

NCRA004708 SPAR015815

MVER004879

RIRR014732 MELO011749

CREV005155 NBOM001065

PBLA009764 UMAY000278 CNEO001277

MCIR006890 SPOM004365 Bootstrap support ≥ 70 ECUN001980 Duo1 EINT p1 Bootstrap support ≥ 50 4

Figure S2. Gene trees Ska-C and Dam1-C homologous subunits. Related to Figure 3. Maximum-likelihood gene trees of the combined orthologs of Ska1, Ska2 and Ska3 (A), Duo1 and Dad2 (B) and Dad1, Dad4 and Ask1 (C). The leaves contain the identifi ers of the protein sequences (Supplementary Files S1-14). The fi rst, four-letter part of the identifi er corresponds to abbreviations of the species in the eukaryotic proteome database (Supplementary Table S4). 104 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 105

A Eukaryotic supergroup M. musculus Tree scale: 1 M. leidyi Opisthokonta T. vaginalis

H. sapiens X. tropicalis Amoebozoa

B. floridae Excavata

T. rubripes SAR D. rerio T. adhaerens S. purpuratus D. melanogaster Archaeplastida

A. gambiae S. kowalevskii C. intestinalis T. trahens / G. theta / internal branches

N. vectensis Bootstrap support ≥ 70

B. dendrobatidisS. punctatus Bootstrap support ≥ 50

R. allomycis

M. brevicollis R. irregularis S. rosetta C. elegans

B. malayi

C. merolae E. siliculosus G. sulphuraria

P. purpureum T. gondii F. alba

C. paradoxa P. falciparum T. trahens

C. variabilis K. flaccidum C. subellipsoidea

A. kerguelense A. castellanii C. reinhardtii V. carteri A. limacinum G. theta P. patens

S. moellendorffii

O. sativa

A. coerulea ichopoda

r

A. thaliana

A. t

C. merolae B C. coronatus Tree scale: 0.1

C. reversa

C. owczarzaki G. sulphuraria

V. culicis C. neoformans U. maydis S. cerevisiae M. elongata

M. verticillata K. lactis

E. intestinalis 4 E. cuniculi

S. pombe

N. bombycis N. crassa

R. irregularis

M. circinelloides

P. blakesleeanus

Guillardia theta

P. infestans C. fragrantissima R. allomycis H. parasitica S. arctica

A. laibachii A. candida

B. natans

P. brassicae

S. parasitica

H. catenoides

B. hominis

A. limacinum

A. kerguelense

Figure S3. Protein complex gene trees of Ska-C and Dam1-C Related to Figure 3. Maximum-likelihood tree of the concatenated alignment of the Ska-C (A) and Dam1-C (B) subunits (excluding Spc19, because of its phylogenetically limited distribution). See Supplementary Text. 104 Chapter 4 Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1 105

4

5 Mosaic origin of the eukaryotic kinetochore

Jolien JE van Hooff*, Eelco Tromer*, Berend Snel# and Geert JPL Kops#

* joined first authors # joined senior authors

Manuscript in preparation 108 Chapter 5 Mosaic origin of the eukaryotic kinetochore 109

Abstract

Chromosome segregation in eukaryotes is mediated by kinetochores, large multiprotein structures that assemble onto centromeres and bind to spindle microtubules. Like other eukaryotic characteristics, the kinetochore arose after eukaryotes diverged from their prokaryotic ancestors, raising the question of its origins. Using phylogenetic trees, profi le-versus-profi le similarity searches and structural information, we here show that the kinetochore of the last eukaryotic common ancestor (LECA) consisted of 52 proteins that are of mosaic origin. We fi nd that various kinetochore proteins are homologous to proteins involved in other eukaryotic cellular systems, such as ubiquitination, chromatin regulation and intrafl agellar transport. Other kinetochore proteins, like subunits of the Mis12 complex, have only discernible homology to other kinetochore proteins, suggesting they arose de novo before LECA and contribute solely to the kinetochore. Many kinetochore proteins are homologous to each other, which often resulted from intra-kinetochore gene duplications before LECA. We propose there was no single ‘kinetochore template’, but that the primordial kinetochore recruited novel proteins and ones from various (pre-)eukaryotic systems, after which some of these duplicated to give rise to the complex kinetochore of LECA.

Key words: kinetochore, mitosis, LECA, eukaryogenesis, duplication

Introduction

During cell division, eukaryotes divide their duplicated chromosomes over both daughter 5 cells by means of a microtubule-based apparatus called the spindle. Central to this process are kinetochores; large multi-protein structures that are built upon centromeric DNA and that connect chromosomes to microtubules. Although species vary hugely in how they exactly coordinate and execute chromosome segregation [88, 89, 209, 288], all eukaryotes use a spindle, and therefore the last eukaryotic common ancestor (LECA, Figure 1A) likely laboured one as well. Consequently, LECA’s chromosomes, which most likely were linear, probably contained a centromere and assembled a kinetochore. While the centromeric DNA sequences of current-day eukaryotes are strikingly different between species and too diverse to reconstruct LECA’s centromeric DNA [94], their proteomes did allow for the inference of its kinetochore. In previous work, we found that this LECA kinetochore was a complex structure, consisting of at least 49 different proteins [14].

The LECA kinetochore was not directly derived from the prokaryotic ancestors of eukaryotes, because prokaryotes seem to have chromosome (and plasmid) segregation 108 Chapter 5 Mosaic origin of the eukaryotic kinetochore 109

machineries that are highly different from the eukaryotic spindle [25-27] (Figure 1A). Like many other typical eukaryotic cellular systems, the complex LECA kinetochore must thus have originated after the fi rst eukaryotic common ancestor (FECA) diverged from prokaryotes, in a process called ‘eukaryogenesis’. Between FECA and LECA, the pre-eukaryotic lineage evolved from a relatively simple and small prokaryotic cell to a complex, organelle-bearing cell that was organized in a fundamentally different manner. What evolutionary events founded eukaryogenesis is a major question in evolutionary biology [289], to which answers are offered by investigations into specifi c eukaryotic systems [72]. Studies on for example the spliceosome, the intracellular membrane system and the nuclear pore revealed that (repurposed) prokaryotic genes played a role in their origination, as did novel, eukaryote-specifi c genes and gene duplications, albeit in varying degrees and in different manners [73, 290, 291].

In this study, we address the question how the kinetochore originated. Leveraging the power of improved sensitive sequence searches, novel structural insights and detailed phylogenetic analyses, we set out to trace the evolutionary origins of each protein of the LECA kinetochore. Based on our fi ndings, we propose that the LECA kinetochore is of mosaic origin: it recruited proteins involved in various other core eukaryotic processes as well as completely novel proteins. After recruitment, many of these proteins duplicated to give rise to the complex LECA kinetochore.

Results

The LECA kinetochore To study how the LECA kinetochore originated, we fi rst determined its protein content 5 (Figure 1B, Supplementary Table 1). For each protein present in current-day human and yeast kinetochores, we asked A) was it encoded in the genome of LECA, based on its presence in current-day eukaryotes? and B) if it was in LECA’s genome, did it likely function in the kinetochore? We inferred a protein to have been present in the LECA genome if it is found in both Opimoda and Diphoda, likely the root of the eukaryotic tree of life [31] (Supplementary Figure 4, Supplementary Text). We performed a similar analysis in our previous work [14], but here we added Nkp1, Nkp2 and Csm1, which we also infer to have joined the LECA kinetochore (for an in-depth discussion on Nkp1 and Nkp2 in the LECA kinetochore, see Supplementary Text and Figure 4A). Moreover, the in-depth investigations that we performed here allowed us to confi rm our previous conclusion that most of the CCAN (Constitutive Centromere Associated Network) proteins (the so-called ‘Cenp’ proteins) were part of the LECA kinetochore (Supplementary Text). Altogether, we propose that the LECA kinetochore consisted of 52 proteins (Figure 1B). 110 Chapter 5 Mosaic origin of the eukaryotic kinetochore 111 ZW10

Rod

TRIP13 p31

Zwilch

Mad1

Mad2 E

Mis12/Nkp-like (6x) Ska-like (3x) unclassified (9x) Cdc20 Centromeric DNA

MadBub Kinetochore-specific domains

Bub3

Ska1 Ska2 Ska3

2 BugZ f

u

N 0

8

c Knl1 d Zwint-1

H3

1 N

5 Sgo

s 2

p c p

4

S

M 2 c Plk1

p T

W S Survivin Borealin TPR (2x)

P 1 S U

X l

s

1 I kinetochore proteins

N M n

s O Q D Incenp Aurora

K

Csm1 H

Nkp2

1

f

Nkp1

n

2 A 1 N

s

i L LEC A M N C kinase (4x) + WD-40 (2x) TBP-like (2x) other domain (9, all 1x in kinetochore) Cep57 H3 (2x) Duplication hypothesis WD (8x) R Histone (5x) HORM A CH (2x) Outer KT Inner KT Inner Centromere B Legend Common domains ParB OriC parS Caulobacter crescentus kinetochore OriC ParA AspA-ParBA bacterial systems, e.g.: archaeal systems, e.g.: - SegAB - - role for SMC? ? LECA 5 Archaea Asgard α-Proteobacteria kinetochore evolution mitochondrial endosybiont FECA LUCA Archaea B acteria A 110 Chapter 5 Mosaic origin of the eukaryotic kinetochore 111

Figure 1. The eukaryotic kinetochore and mitotic machinery originated between FECA and LECA. A. How did the eukaryotic kinetochore originate and evolve between FECA and LECA? Eukaryotes (blue) originate from Archaea (green), likely closely related to the Asgard superphylum [9]. This Asgard-related lineage incorportated an alphaproteobacterium via endosymbiosis, and the latter gave rise to the eukaryotic mitochondrion. As far as currently characterized, Archaea and Bacteria (red) do not separate their duplicated chromosome(s) via a system similar to the mitotic spindle [25-27]. For example, Caulobacter crescentus (and likely various other Bacteria) operates the parABS partitioning system, in which parS sites, located near the origin, are recoginizce by ParB protein. ParB stimulates the ATPase activity of ParA, which in turn pulls or pushes the chromosomes apart [26]. Due to these differences between prokaryotic and eukaryotic chromosome segregation, likely the mitotic spindle, including the kinetochore and its proteins, originated between the fi rst eukaryotic common ancestor (FECA) and the last eukaryotic common ancestor (LECA). LUCA: last universal common ancestor. B. Model of the eukaryotic kinetochore. The kinetochore of LECA consisted of 52 proteins that are either made up of domains found in other eukaryotic proteins (not involved in the kinetochore) as well (‘common domains’), or that are unique to the kinetochore (‘kinetochore-specifi c domains’). Proteins were inferred to have been part of the LECA kinetochore based on their presences in current-day species, additional phylogenetic inference and in some cases functional information (Supplementary Table 1). The dashed line indicates the hypothesized single duplication of kinetochore proteins (Discussion). KT: kinetochore.

Identifying ancient homologs of kinetochore proteins In order to elucidate the ancient, pre-LECA homologs (either eukaryotic or prokaryotic) of LECA kinetochore proteins, we applied sensitive profi le-versus-profi le homology searches (Figure 2B, Supplementary Table 2), followed up with constructing phylogenetic trees (Figure 2A, Supplementary Figures 1A, 3), or, if available, interpreting published phylogenetic trees. If literature and/or structural studies provided additional information on ancient relationships, we also included these as evidence for a homologous relationship 5 of a kinetochore protein (Figure 2C). For each LECA kinetochore protein, we examined which proteins comprise its closest homologs before LECA (Table 1). These proteins were classifi ed as eukaryotic or prokaryotic, and as kinetochore or non-kinetochore protein, implying different evolutionary histories (Data and Methods). In order to allow different domains in a single protein to have different evolutionary histories, we primarily searched for homologs on the domain level, and represent these as a single ‘domain’ in Table 1 if their evolutionary history was shared.

On the domain level, we inferred the closest homologs of kinetochore proteins from gene phylogenies (15/55, 27%), profi le-versus-profi le searches (6/55, 11%) and structural information (6/55, 11%, Table 1), or a combination (13/55, 24%). In total, for 40 domains we identifi ed the closest homolog. For the remaining 15, we could not do so, either because they likely have homologs but we cannot resolve which is closest (6/55, 11%), or because they might have no homologs at all (9/55, 16%). 112 Chapter 5 Mosaic origin of the eukaryotic kinetochore 113

Table 1. Ancient homologs of kinetochore domains and their functions (see also Figure 1B). Note that if multiple domains have a shared evolutionary history, we regard them as a single unit in this table (Kinase- POLO box, NRH-Sec39). Some domains were recruited to the kinetochore before they duplicated to give rise to multiple kinetochore proteins. Those initial kinetochore entities are the ‘ancestral kinetochore units’. If a protein does not have closely related homologs in the kinetochore, the protein itself was the ancestral unit that got involved in the kinetochore. For all relationships, we indicate which type of evidence we have for it (examples in Figure 2). A: phylogenetic tree, B: hit in profi le-profi le search,C : structure and/or literature.

Closest pre- LECA Ancestral kineto- Closest pre-LECA ho- Domain LECA homolog Other homologous protein protein chore unit molog(s) (non-kinetochore) (kinetochore)

Plk Plk Centrosome (Plk4) Kinase- A POLO box Aurora Aurora Centrosome (Plk-Plk4) A

MadBub MadBub Uncharacterized A Kinase Chromatin assembly, DNA Mps1 Mps1 damage (TLK1) A

MadBub Mps1 B TPR anc_KT_TPR Spliceosome (SYF1) B Mps1 MadBub B

CenpA CenpA Nucleosome (H3) A

CenpS CenpT A TBP-associated factors (TFIID, SAGA) anc_KT_histone_1 unclear Chromatin structure (H2B, H3, H4) A Histone CenpT CenpS A

CenpW CenpX A DNA repair (DPOE), Transcription regulation anc_KT_histone_2 unclear (CCAAT-binding complex, NC2 complex) and Chromatin structure (H2A) 5 CenpX CenpW A A Cell cycle progression Cdc20 Cdc20 (Cdh1) A

Nuclear mRNA transport WD40 Bub3 Bub3 (Rae1) A

Tethering of Golgi-derived Rod Rod Other coatamers/adaptors BC vesicles at ER (Nag) A

DNA replication/repair Autophagy (Atg13, Atg101), meiosis (HOR- Mad2 Mad2 (Rev7) A MAD) A HORMA DNA replication/repair (Rev7), autophagy p31 p31 unclear (Atg13, Atg101), meiosis (HORMAD) A 112 Chapter 5 Mosaic origin of the eukaryotic kinetochore 113

Spc24

Spc25 RWD (1x) Mad1

Csm1 duplication or- Ubiquitin-like conjugases (E2) and RWD-like DNA repair (FancL) AC and der unresolved anc_KT_RWD dimerization domains (i.e. Gcn2) ABC transcription (Med15) A Zwint-1 AB

Knl1 RWD (2x) CenpO

CenpP

Ndc80 Nuf2 C microtubule-associated protein in meiosis CH anc_KT_NN-CH (FAM98), intra-fl agellar transport-ciliogenesis Nuf2 Ndc80 C (CLUAP1) B

CenpN CenpL C Transcription regulation (TBP), dsRNA interac- TBP-like anc_KT_TBP-like unclear tion (Ribonuclease III), binding of cargo motifs CenpL CenpN C (AP-2 beta, Coatamer subunit gamma) C

Mis12 Nkp1 B

Nkp1 Mis12 B

Dsn1 Nsl1 C Mis12/ anc_KT_Mis12/ Nkp-like Nkp-like Nsl1 Dsn1 C

Nnf1 Nkp2 B

Nkp2 Nnf1 B 5

Ska1 duplication or- Ska-like der unresolved Ska3 anc_KT_Ska * A

Ska2

AAA+ HORMA domain regulation TRIP13 TRIP13 ATPase (Prokaryotic TRIP13) A

NRH- Tethering of Golgi-derived Rod Rod Sec39 vesicles at ER (NAG) A

intra-Golgi transport Vescile tethering (COG/ GARP/Exocyst/ Vps51 ZW10 ZW10 (COG5) A subunits) A

BIR Survivin** Survivin

Plant hormone regulation (Auxin-binding-pro- Cupin CenpC CenpC tein, no pre-LECA function) A [350] 114 Chapter 5 Mosaic origin of the eukaryotic kinetochore 115

Nuclear protein import (Importin subunit beta- HEAT CenpI CenpI 1), transcription regulation (BTAF1), protein phosphatase 2A subunit (PR65) C

Nucleotide binding (RabL1), signal transduc- GTPase CenpM CenpM tion at cell membrane (Rem2), ER-to-Golgi transport (Rab1A) B

kinesin CenpE CenpE Retrograde transport (KIF1C) B

zinc fi nger BugZ BugZ Transcription regulation (ZNF879) B

Zwilch Zwilch

Incenp Incenp

Borealin Borealin

Sgo Sgo

Cep57 Cep57

CenpH CenpH

CenpK CenpK

CenpQ CenpQ

CenpU CenpU

*The phylogeny of Ska1, Ska2 and Ska3 [14] cannot be rooted, therefore it is unknown which are the closest outparalogs. **The BIR domain is involved in multiple processes in animals, but the kinetochore (inner centromere) function might be the ancestral one, because this is also reported in budding and fi ssion yeast

Evolutionary histories of kinetochore proteins Below we reconstruct the evolutionary history of LECA kinetochore proteins per protein 5 domain, including their identifi ed affi liations to other eukaryotic cellular processes, their prokaryotic homologs, and their ancient duplications within the kinetochore.

RWD The RWD (RING-WD40-DEAD)-like domains in kinetochore proteins (KT_RWD) are highly diverged members of the superfamily of E2 ubiquitin-like conjugases (UBC) [292- 294] (Figure 3, Supplementary Table 3). Since bona fi de catalytic UBCs were found in both Bacteria and Archaea, ubiquitin-like modifi cation was likely the ancestral function of this fold in FECA [9, 294-296]. Both catalytic [297] and non-catalytic [294] families of the UBC superfamily expanded massively between FECA and LECA. Functionally, the non- catalytic UBC-like proteins comprise three major groups (Figure 3C): (1) ubiquitin-related vesicle traffi cking proteins like UEV1/TSG101 [298] and AKTIP [299], (2) a group of E2/ E3-related canonical RWD proteins (RWD) involved in DNA/RNA-related processes (e.g. Gcn2 [300] and FancL [301]), and (3) eight LECA kinetochore proteins that form hetero- or homodimers, with either a single RWD: Spc24-Spc25, Mad1-Mad1, Csm1-Csm1, or 114 Chapter 5 Mosaic origin of the eukaryotic kinetochore 115

a double RWD confi guration: CenpO-CenpP and Knl1-Zwint-1 (Zwint-1 is also a RWD protein, see Supplementary Text). To determine whether KT_RWD domains evolved from a single or multiple UBC-like ancestral protein(s) and to assess the order of RWD domain duplications, we aligned archaeal, bacterial and eukaryotic UBC-like proteins and performed a phylogenetic analysis (Data and Methods, Supplementary Text, Supplementary Figure 1E, 3). We found that RWD and KT_RWD have a shared origin: their monophyly is strongly supported (bootstrap support: 96/100), and separates them from archaeal and eukaryotic UBCs (bootstrap support: 77/100). Similar to our observations in profi le-versus-profi le searches (Supplementary Table 2), most KT_RWD domains form a single monophyletic group (bootstrap support: 93/100). However, because of the highly divergent nature of RWD and KT_RWD domains and the low statistical supports within the KT_RWD clade, we were not able to evolutionarily reconstruct the exact order by which the KT_RWD proteins arose. Moreover, our phylogenetic analysis suggested that Med15 and FancL might also be closely related to KT_RWD proteins, possibly even closer than KT_RWD proteins are to one another. Altogether, our analyses reveal that kinetochore RWD proteins are the result of an expansion of non-catalytic E2 ubiquitin- like conjugases during eukaryogenesis, and indicate a shared origin of the kinetochore with systems operating RWD domains, involved in DNA repair (FancL), translation (Gcn2) and transcription regulation (Med15) (Figure 3C).

HORMA-TRIP13 Previously, the HORMA domain was suggested to be a eukaryotic invention [75], but a more recent study also reported its presence in phylogenetically diverse Bacteria [39]. In eukaryotes, HORMA proteins operate in the kinetochore (Mad2, p31comet), in autophagy (Atg13, Atg101) [302], in DNA repair (Rev7) and in meiosis (HORMAD). p31comet shares a regulator with HORMAD: TRIP13, an AAA+ ATPase. Burroughs et al. also reported 5 a TRIP13-like protein in HORMA-containing bacteria, even in the same operon [39]. This strongly suggests that also in these Bacteria, TRIP13 regulates HORMA. We also found the HORMA-TRIP13-like operon in a few Haloarchaea (Figure 4B). The eukaryotic HORMA domain proteins are monophyletic relative to the prokaryotic ones (Supplementary Figure 1F). The AAA+ ATPase phylogeny indicates that eukaryotic TRIP13 sequences are more closely related to the prokaryotic TRIP13-like sequences than to all other AAA+ ATPases (Supplementary Figure 1G). Therefore, these prokaryotic sequences can be considered actual ‘TRIP13’ proteins. How did eukaryotes acquire HORMA and TRIP13? Based on the phylogenies, we propose three scenarios. First, the module was invented between FECA and LECA, got horizontally transferred to a prokaryote before LECA, and got transferred across prokaryotic lineages. Second, eukaryotes derived it by vertical descent from archaeal ancestors (Figure 1A), after which it spread to Bacteria via horizontal gene transfer, and got lost massively in archaeal clades. Third, eukaryotes derived it by horizontal transfer from Bacteria, e.g. via endosymbiotic 116 Chapter 5 Mosaic origin of the eukaryotic kinetochore 117

A

5.0 oppuS r t

B 5jjx_A

2uy1_A TPR Mps1 TPR MadBub LECA kinetochore protein/domain

PF08424.9

other protein/domain

best hit PF15297.5 5gmk_v other hit

C N

β1 β5 β4 α β2 2 α1

5 C TBP-like fold hsCenpN [6EQT_A] scCenpL [4JE3_A]

literature + structure database 2 (ECOD, CATH & DALI-web) IntS9

symmetrical IntS11 1

TATA-box binding protein (DNA interaction) Integrator complex (dimerization)

RNase H3 (substrate binding) Coatamer subunit gamma 1 (cargo interaction) 116 Chapter 5 Mosaic origin of the eukaryotic kinetochore 117

Figure 2. Identifying homologs of kinetochore proteins by using phylogenetic trees, profi le-versus- profi le searches and structural similarities. A. Phylogenetic tree: ZW10. Phylogenetic tree of ZW10 and its homologs of the COG, GARP and exocyst complexes, which all function in the tethering of vesicles and membranes (see Supplementary Figure 1A for the uncollapsed tree). For ZW10 and various other kinetochore proteins, we used trees to determine the kinetochore proteins’ (green) most closely related other protein (blue). If this protein is a eukaryotic protein that represents a paralog to the kinetochore protein, it is the ‘closest outparalog’ of this kinetochore protein, which in case of ZW10 is COG5. B. Profi le-versus-profi le search hit: TPR domains MadBub and Mps1. Sequence similarity hits based on profi le-versus-profi le searches of the TPR domain of MadBub and Mps1. The arrows indicate which profi les (either from the kinetochore proteins or from common databases, see Supplementary Text) are hit by which other profi le. The best hit of the TPR domain of MadBub is the TPR domain of Mps1. Next to that, the TPR domain of Mps1 also hits profi les of various other proteins in PDB [22] (identifi ers with ‘_’) and Pfam (‘PF’) [34]. C. Structural similarity: CenpN, CenpL and TBP-like proteins. Structural representation of various protein families that contain a fold that is similar to the pseudo- symmetric TATA-box binding protein (TBP), which could not [34] be detected through sensitive sequence searches. The TBP-like fold consists of an elaborate set of curved β-sheets that form an interaction surface for its substrates such as DNA, RNA and various protein motifs, but also constitute a possible dimer interface, resulting in an even larger extended β-sheet confi guration. The cartoon representation of yellow (helices) and red (sheets) show the location and presence of a TBP-like domain in the here depicted proteins, the grey-ribbon representation indicate the non-homologous parts of the proteins. The function of the TBP-like domain in the different protein families is indicated between brackets. For the kinetochore protein CenpN and CenpL, the pdb accession is shown.

transfer from the alphaproteobacterial endosymbiont (Figure 1A). Subsequently, it was transferred among Bacteria and to Haloarchaea. The latter horizontal gene transfer is conceivable, since bacterial-haloarchaeal gene transfers are frequent [303, 304]. We 5 consider the third scenario most likely, since we do not know any examples of horizontal gene transfer from the pre-eukaryotic lineage to prokaryotes, arguing against the fi rst scenario, and because the HORMA-TRIP13 operon is more frequently found in Bacteria than in Archaea, arguing against the second. Because bacterial HORMA-TRIP13 are found in operons with genes involved in nucleotide signalling [39], it possibly originally functioned in this process between FECA and LECA, after which HORMA duplicated and neofunctionalized. Thereby, HORMA (and TRIP13) got repurposed for eukaryote- specifi c processes, such as meiosis and the kinetochore.

Histones Similar to eukaryotes, Archaea utilize histones to organize DNA [305], and therefore the ancestral function of histones likely was chromatin formation [268]. From FECA to LECA, histones duplicated and subfunctionalized many times, of which the products form histone-histone dimers, either homodimers or heterodimers (Figure 4C). Histones 118 Chapter 5 Mosaic origin of the eukaryotic kinetochore 119

A

Csm1 4 c2 p 5 S c2 Mad1 p S Knl1 Zwint-1

O P single RWD (4x) double RWD (4x) RWDD2A (?) GCN2 (kinase) duplication?

RWDD1 (dimerization) 2 FANCL (E3) 1 C catalytic cysteine E2/E3-related dimerization domains Cys C B RWD stress signalling & translation transcription & RNA regulation RWDD3 (sumoylation) RNF25 (E3) α protein modification FANCL 1 α prokaryotic DNA repair E2 ubiquitin-like α 2 conjugase UBC 1 membrane trafficking YPxxxP chromatin regulation N protein processing YPxxxP motif 3-5 β-sheets signaling extensive duplication 2 and neofunctionalization 1 FECA-LECA KT_RWD kinetochore super-structure microtubule attachment regulation

UBC9 (SUMO) UFC1 (Ufm1)

CSM1 (meiosis) MAD1 (checkpoint)

2 CENPO (CCAN) CENPP (CCAN) KNL1 (KMN) 1 UBE2S (UB) UEV1 (non-catalytic) SPC25 (KMN) SPC24 (KMN) 1 Kinetochore RWD-like E2 ubiquitin-like conjugating enzymes

Figure 3. RWD domains in kinetochore proteins are highly divergent non-catalytic members of the E2 ubiquitin conjugases (UBC) family that emerged as part of the rapid expansion of this protein family during the FECA-to-LECA transition. 5 A. Overview of the position of the 8 kinetochore proteins harbouring RWD domains. All KT_RWD proteins follow the same structural topology: a long coiled-coil region at the N-terminus and either a single (light blue) or double (darker blue) RWD domain at the C-terminus. The green dashed line indicates an internal duplication of 3 KT_RWD heterodimers that possibly gave rise to the LECA kinetochore complexity. B. Secondary structure of E2 and RWD-like proteins of the UBC superfamily that is characterized by a ‘β-meander’ of 3-5 β-sheets, enclosed by ɑ-helices at both termini and a ‘YPxxxP’ motif that often resides in between the third and the fourth β-sheet. Furthermore, a catalytic cysteine residue plays a role in the E2 ubiquitin conjugation step, which is lost in both RWD-like descendants (KT_RWD and RWD). C. The UBC superfamily in pre-LECA evolution. The UBC superfamily emerged in three evolutionary distinct eukaryotic protein families through an expansion of an ancestral prokaryotic E2 enzyme during the FECA-to-LECA transition: classic eukaryotic E2 ubiquitin ligase-related proteins (UBC) that function in ubiquitin-like modifi cation (e.g. sumoylation) and interactions (UEV1 is a ubiquitin-binding protein), non-catalytic RWD protein families (RWD) that operate as a dimerization domain to facilitate various E2/E3 ubiquitin-like ligation reactions (FancL-Ube2T and Rwdd3/Rsume-Ubc9) and RWD-like kinetochore proteins (KT_RWD) that are dimers and form part of the kinetochore superstructure (Spc24-Spc25, Knl1-Zwint-1 and CenpO-CenpP) and play a role 118 Chapter 5 Mosaic origin of the eukaryotic kinetochore 119

in microtubule attachment regulation (Mad1 and Csm1). Per protein family the structure of various members is depicted to show the overall similarity of the secondary structure. If present, the YPxxxP and the catalytic cysteine residues are represented in the ‘sticks’ confi guration, yellow and cyan, respectively. The RWD and KT_RWD also contain proteins that have a tandem RWD confi guration (‘2’ darker blue), while most UBC-like members appear in a singular form (‘1’ light blue). Per protein family a known molecular function is indicated between brackets. An annotated phylogenetic tree of our evolutionary analysis of the RWD/UBC family can be found in Supplementary Figure 1E and Supplementary Figure 3.

are found in the core histone complex (H3-H4-H2A-H2B [268]), transcription regulation (TFIID (TAFs [306]), SAGA (SUPTs [307]), CCAAT-binding complex (CBF [308]), Negative cofactor (NC2 [309]), DNA damage repair (DNA polymerase ε:DPOE [310]), the Fanconi anemia pathway [311, 312] and the kinetochore (CenpA [313], CenpSXTW [314]). The order in which all these pathways and complexes arose is not clear, but it is generally assumed that non-nucleosomal histones originated from duplications of the canonical, nucleosomal histones (H2A-H2B-H3-H4). Indeed, CenpA, the H3 variant that is specifi cally incorporated into centromeric DNA, originated from an ancient duplication that gave rise to H3 and CenpA [14, 268]. Although the canonical histones are among the most conserved eukaryotic proteins, other histone proteins diverged extensively, which, in combination with their short lengths, previously hampered their evolutionary reconstruction. We here adopted a similar approach as for the phylogenetic reconstruction of the RWD domain (see Data and Methods, Supplementary Text, and Supplementary Figure 1I). Our phylogenetic analyses revealed that the kinetochore histones arose from a co-duplication of CenpS-CenpT (bootstrap support: 99) and CenpX-CenpW (bootstrap support: 77). We found CenpS-CenpT to be phylogenetically affi liated to H2B/H3/H4/TFIID/SAGA-related histones, and CenpX-CenpW clustered with H2A/ CBF/NC2/DPOE/TAF11-related histones (Figure 4C). These affi liations suggest that the 5 origin of the kinetochore is interlinked with the emergence the eukaryotic chromatin environment, including a highly intricate transcription and DNA repair machinery.

Calponin Homology In the kinetochore, the CH (calponin homology) domain connects to microtubules, as does the CH domain of e.g. EB1-3. However, the ancestral function of this domain, which to our knowledge has not been found in prokaryotes, is not known. In current-day species they operate in many different processes, including binding of actin and F-actin, and in various cellular signalling pathways [315]. The kinetochore CH proteins seem to be part of a highly divergent clade of CH proteins [316], which includes also proteins involved in intrafl agellar transport, in ciliogenesis, in the centrosome and possibly RNA transport [317, 318]. It has been suggested this CH subfamily is specialized towards binding microtubules, implying that the kinetochore function refl ects the ancestral function [316]. 120 Chapter 5 Mosaic origin of the eukaryotic kinetochore 121

A 10 20 30 40 50 60 70 ss prediction CCc cHHHHHHHHHhc--CccCceecHHHHHHHCCccc CCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHcc Ccchh Nkp2 A. castellanii DDRATYQALHDYLFE- - NESALE I SFAEFQECFPKKHQKSAE I KRLYQEV K KGQQE AK RT VVANLKSEYALVRGRPEDE consensus ~~~s e ~~ I L ~~ f L l ~- - s ~L ~~ i i s l ~~F ~~ l F p ~~~~~~p ~ i ~~L Y r ~L q ~q r ~~~~d ~V ~~nI~~~~~~~r ~~~e ~~ consensus m~~~R ~~~L ~~~~~~al~~~l ~~~s ~~~ f ~~c f P ~~~~~~~e ------~l ~~~~~q ~~~~l ~~~~~~e f ~~ i ~ee~~ l ~ Nnf1 H. sapiens MT I SRVKL LDTMVDT FLQKLVAAGSYQRFTDCYKCFYQLQPA------MTQQIYDKFIAQLQTSIREEISDIKEEGNLE ss prediction CCchHHHHHHHHHHHHHHHHHHh CCHHHHHHHchHHHHHChH------HHHHHHHHHHHHHHHHHHHHHHHHHHh CCHH

Head I Nsl1 N

CC Stalk Dsn1 Dsn1 CC CC C

CC CC N C CC C N CC C Mis12 Nkp2 Nsl1

Nnf1 N Head II

Csm1 Outer KT S

Dsn1 Nkp2 Mis12 Nsl1

Mis12 Nnf1

Nkp1 Nkp2

Inner KT Q Structure Nnf1 HHsearch TS X PRC

B HORMA TRIP13 (AAA+ ATPase) Eukaryotes 91 Atg101 Eukarya kinetochore 15

DNA damage unclassified Stentor coeruleus autophagy 49 meiosis 84 Rev7 Eukarya 12 15 80 Mad2 Eukarya ATP ADP 25 100 p31comet Eukarya

100 75 HORMAD Eukarya Eukarya 100 97 Atg13 Eukarya

Rhodothermaeotum Prokaryotic operons (Burroughs et al. 2015) Rhodothermaeotum 36 93

72 100 Bacteroidetes REase+SAVED SMODS HORMA TRIP13 Bacteriodetes HORMA 37 TRIP13 22 Planctomycete REase SMODS HORMA TRIP13 Planctomycete 34 59 100

93 92 18 23 Proteobacteria REase SMODS HORMA TRIP13 Proteobacteria 39 99 100 Actinobacteria S-4TM SMODS HORMA TRIP13 Actinobacteria

50 5 94 α-Proteobacterium REase SMODS HORMA TRIP13 α-Proteobacterium 73 95 91 Acidobacterium Acidobacterium 96 100 100 Actinobacteria S-4TM SMODS HORMA TRIP13 Actinobacteria 100

100 Haloarchaea Haloarchaea 100

Prokaryotes

Other AAA+ 100 ATPases 0.6 0.5

CBF/NC2/DPOE TFIID/SAGA-related (TAF/SUPT) transcription repression C N DNA damage repair Kinetochore-related H2A Canonical histones Outer KT core 6 CCAAT-binding complex (CBF) p31 α1 α2 α3 CenpX Negative Cofactor 2 (NC2) DNA damage repair DNA polymerase epsilon (DPOE)

S two archetypes N Number of histone proteins C CenpW Archaeal histones H2B 2 kinetochore histone core DNA compaction TAF11-like SUPT3 (fusion)

ancient Q TFIID/SAGA co-duplication Centromere bacterial 10 transcription regulation histone-like H3 TS core CenpS H3 A X W H4 DNA damage repair core CenpA centromeric H3 variant CenpT kinetochore histone 120 Chapter 5 Mosaic origin of the eukaryotic kinetochore 121

Figure 4. Origins of the subunits of the Mis12 and Nkp complexes, of the kinetochore HORMA- TRIP13 module, and of kinetochore histone folds. A. Subunits of the Nkp and Mis12 complexes are homologous and harbour the Mis12/Nkp-like domain. Based on profi le-versus-profi le hits (with HHsearch [13] and PRC [21]) and structural information [23, 24], we established that all subunits of the Nkp and Mis12 complexes must be homologous to each other (Supplementary Text). Only one example of the profi le-versus-profi le hits is shown. Since subunits of the Mis12 complex are present across eukaryotes [14], we infer that also subunits of Nkp1 and Nkp2 were in LECA, as they were likely derived from duplications before LECA. Nkp2 and Nnf1 are each other’s best hit in profi le-versus-profi le searches, so possibly these proteins result from a duplication that occurred relatively late, just before LECA. B. Phylogenetic trees of HORMA domain proteins and TRIP13-like AAA+ ATPases. These phylogenetic trees suggest that both the HORMA domain and TRIP13 were derived from prokaryotes. Possibly, in early eukaryotic evolution (just after FECA, Figure 1A), an ancient HORMA-TRIP3 operon was acquired (either through vertical transmission from Archaea or from horizontal gene transfer from Archaea or Bacteria, such as the alphaproteobacterial endosymbiont, see Figure 1A). The HORMA domain duplicated between FECA and LECA to give rise to six HORMA domain proteins in LECA, of which at least HORMAD and p31comet interact with TRIP13. Although we not have direct biochemical evidence for HORMA-TRIP13 interaction in prokaryotes, their co-occurrence presence in a single operon strongly suggests that they also interact in these species, and hence that this interaction might be ancient. Moreover, they are in an operon with proteins that are involved in nucleotide signaling, suggesting HORMA and TRIP13 are also affi liated to this process [39]. The uncollapsed trees can be found in Supplementary Figure 1F,G. C. Scenario for the evolution of histones in eukaryotes. A cartoon of our reconstruction of the evolution of the histone-fold in eukaryotes, with a specifi c emphasis on the history of kinetochore-related histone proteins CenpA, -S, -T, -W and -X (Supplementary Figure 1I). On the left an overview of the kinetochore position of 5 centromere/ kinetochore-related histone-like proteins (green). On the right an illustration of theproposed evolutionary scenario. A histone of archaeal descent duplicated and subfunctionalized many times, giving rise to two phylogenetically archetype histone families: (1) H2B/H3/H4/TFIID/SAGA-related and (2) H2A/CBF/NC2/ 5 DPOE/TAF11-related. CenpA, CenpS and CenpT descended from the fi rst group, while CenpX and CenpW are more similar to the latter. CenpA is the outparalog of the canonical H3. The CenpSX and CenpTW dimers are a result of a co-duplication and therefore CenpS and CenpX, and CenpW and CenpT, likely are each other’s closest outparalog. A fusion of a TBP-associated factor 11 (TAF11)-like histone and a TAF13- like histone gave rise to the SAGA complex members SUPT3, which are the only double histone proteins known in eukaryotes. An annotated phylogenetic tree of our evolutionary analysis of the histone family can be found in Supplementary Figure 1I.

TBP-like Although it was previously observed that CenpL and CenpN harbour a fold similar to the TATA-box binding protein-related proteins [319-321], we found that they share a common ancestor with a diverse group of proteins similar to the pseudo-symmetric DNA-binding domain of the TATA-box binding protein (TPB) [322] (Figure 2C). These proteins function in nucleotide interaction (Crl [323], Med20, RNase H3 [324] and AMPK-like kinases [325]), 122 Chapter 5 Mosaic origin of the eukaryotic kinetochore 123

dimerization (various metabolic enzymes like spermidine synthase [326]), transcription regulation (IntS9-IntS11 [327]) and in motif-mediated cargo recognition (Coatamer gamma [328] and Adaptor beta [329]). Their structures are signifi cantly similar (Figure 2C, Supplementary Table 3, Supplementary File 152), but their sequences are not, as their homology was not detected by any profi le-versus-profi le search (Supplementary Table 2). TBP as well as various enzymes with a TPB-like fold (RNases and DNA glycosylases [322]) are present in Archaea [160], suggesting eukaryotes acquired these proteins via vertical descent (Figure 1A). Likely CenpN and CenpL, and other eukaryotic proteins with a TBP-like domain (Figure 2C), originated from duplications during the FECA-to- LECA transition. The structural similarities between CenpL, CenpN and other TBP-like domain proteins do not indicate which are most similar, and hence most closely related. Nevertheless, given that they form a heterodimer, we propose that CenpL and CenpN are each other’s closest outparalog, and that other TBP-like domain proteins are more distantly related.

Mis12/Nkp-like Through profi le-versus-profi le searches we discovered previously hidden homology within the kinetochore: subunits of the Nkp complex are homologous to subunits of the Mis12 complex. Combined with structural information on Mis12 complex subunits, we inferred that all subunits of these two complexes are homologous (Figure 4A, Supplementary Text). Sequence similarities suggest that Nnf1 and Nkp2 are each other’s outparalog, as well as Mis12 and Nkp1, hence these pairs may represent the most recent duplications. Possibly, the Mis12 complex originated fi rst, via intra-complex duplications, and the Nkp complex resulted from co-duplication of the Nnf1 and Mis12 proteins. We did not fi nd any homologs for these Mis12/Nkp-like proteins outside of the kinetochore.

5 Novelties in the kinetochore Like the Mis12/Nkp-like proteins, various other proteins and protein domains, such as Ska, seem unique to the kinetochore (Table 1). While these domains might have been invented between FECA and LECA and only serve roles in the kinetochore, we cannot exclude that they do have homologous prokaryotic or eukaryotic sequences after all, but that we were simply not able to detect these. Since for all of these domains it was already challenging to establish the kinetochore proteins to be homologous, it would not be surprising if non-kinetochore homologs do exist. The same applies to those kinetochore proteins for which we could not fi nd any homologs at all, such as CenpH, CenpK, CenpQ, CenpU and Cep57.

Common domains: kinases, TPR, tethering factors, WD40 The kinases at the kinetochore are not closely related to one another, except for Plk and Aurora (Table 1, Supplementary Figure 1D). The closest relative of Mps1 might be TLK 122 Chapter 5 Mosaic origin of the eukaryotic kinetochore 123

(tousled-like kinase), although the support for their affi nity is not very strong (bootstrap support: 36/100). The closest relative of MadBub is an uncharacterized group of proteins. While the kinase domains of Mps1 and MadBub are not each other’s closest outparalog, profi le-versus-profi le hits indicate that their TPR domains are (Figure 2B, Supplementary Table 2). This would imply that the Mps1 and MadBub TPR domains joined with a kinase domain independently, not as a single event. Both Mps1 and MadBub seem to diverge relatively deeply in the tree of eukaryotic kinases, not having clear relationships to larger kinase clusters. The closest relative of Plk is Plk4, hence the ancestral function of the pre-duplication gene might have been related to the centrosome. Aurora resulted from a duplication prior to the Plk-Plk4 divergence, such that probably Plk and Aurora got recruited to the kinetochore independently, after their divergence from their last common ancestor protein. Moreover, the POLO boxes of Plk and Plk4 joined the kinase domain after divergence from Aurora, but before their divergence, since their POLO boxes clearly are each other’s closest outparalog (Supplementary Table 2). Plk and Aurora seem related to the AGC clade of kinases, which are involved in a wide variety of cellular processes [330].

ZW10 is part of a large protein family containing subunits of the COG, GARP, Exocyst and NRZ complexes, which have a role in tethering transport vesicles to various (organellar) membranes [331-333] (Figure 2A). Its closest outparalog is COG5, which is involved in intra-Golgi transport (Supplementary Figure 1A). Next to its role in the kinetochore, ZW10 participates in the ER-associated NRZ complex. Interestingly, its kinetochore interactor Rod is most closely related to NAG, another component of the NRZ (Supplementary Figure 1H). Hence, the ancestor of Rod and NAG might have already interacted with ZW10, possibly both at the kinetochore as well as at the ER. After the duplication gave rise to Rod and NAG, these proteins might have subfunctionalized to the kinetochore 5 and ER, respectively, still using their interaction with ZW10. Alternatively, the kinetochore function of ZW10 was novel and only arose after Rod neofunctionalized towards it.

The relatives of the WD40 and TPR kinetochore proteins are highly diverse, and their repetitive nature makes it hard to resolve their (deep) evolutionary origins. Cdc20, a WD40 repeat protein, is most closely related to Cdh1 (Supplementary Figure 1B), which partially fulfi ls a similar role in activating the APC/C. Very likely, their ancestor hada similar job. Bub3’s closest outparalog is Rae1 (Supplementary Figure 1C), a protein involved in transporting mRNAs out of the nucleus. For both Cdh1 and Rae1, we cannot suggest nor exclude that their ancestors were already part of the kinetochore. When it comes to the deep origins of the WD40 repeat, it is not known yet if its origin predates the eukaryotic lineage, or if it was invented between FECA and LECA. While this repeat is clearly present in current-day prokaryotic proteins [334], these prokaryotes may have received it from recently eukaryotes via horizontal gene transfer. Similarly, the TPR domain 124 Chapter 5 Mosaic origin of the eukaryotic kinetochore 125

is found in prokaryotes and it was suggested to have been present in the prokaryotic ancestors of eukaryotes, although this is not yet established [335].

Mosaic origin of the LECA kinetochore Most LECA kinetochore consisted of domains common to eukaryotic proteins (40/55, 73%), but it also contains some with no detectable homology outside of the kinetochore (15/55, 27%, Table 1, Figure 1B). From the proteins with common domains, only one (TRIP13) was directly derived from its prokaryotic ancestors. All others have paralogs in eukaryotes that are more closely related than (if existing) prokaryotic homologs. These paralogs function in, among others, the centrosome, nucleosome and vesicular transport (Table 1, last two columns). Among these functions, DNA regulation- and repair-related functions recur. Of the 14 closest non-kinetochore homologs (either closest outparalogs or more distantly related paralogs) that we could identify, 4 are involved in these processes, including TLK1, H3, Rev7 and FancL (Table 1, fi fth column). We also observe such processes frequently in more distantly related homologs, such as DPOE and nucleosome histones (Table 1, last column). Among functions of more distantly related homologs, we also recurrently observe transcription regulation, of protein like TBP-associated factors, CCAAT-binding complex proteins, BTAF1 and zinc fi nger proteins. All in all, most LECA kinetochore proteins are part of families that are common in eukaryotes. These families typically expanded between FECA and LECA and diversifi ed into different eukaryotic cellular processes, including the kinetochore. Hence, these kinetochore proteins are related to many different processes, indicating a mosaic origin.

In spite of this mosaic origin, many kinetochore proteins arose from intra-kinetochore 5 gene duplications. This is illustrated by the high number of kinetochore proteins with a closest homolog within the kinetochore: From the 40 domains with an identifi ed closest homolog, 27 (68%) have theirs within the kinetochore, while 12 (30%) have a closest outparalog involved in another eukaryotic cellular process, such as the ones discussed above, and only 1 has a prokaryotic closest homolog (Table 1). We inferred that the 55 domains result from 36 ancestral kinetochore units (‘anc_KT’ units), implying that intra-kinetochore gene duplications expanded the kinetochore with a factor of 1.5. We observed relatively few domain fusions among LECA KT proteins: The only two unique kinetochore domain fusions appear to be presented by Mps1 and MadBub, whose TPR domains independently joined their kinase domains. Other fusions, such as those giving rise to Plk (kinase and POLO box) and Rod (WD40, NRH and Sec39), occurred before their divergence from their non-kinetochore outparalog (Plk4 in case of Plk, Nag in case of Rod), so these fusions were not kinetochore-specifi c. 124 Chapter 5 Mosaic origin of the eukaryotic kinetochore 125

Discussion

We have here shown that the kinetochore largely consists of paralogous proteins that are affi liated to a variety of other eukaryotic cellular processes. Gene duplications played a key role in eukaryogenesis [76], as they contributed to the expansion of various other core eukaryotic structures/organelles/processes, such as the spliceosome [290], the intrafl agellar transport complex [336], COPII [337] and the nuclear pore [73]. However, the role of duplications in the origin of the kinetochore is different from their role in membrane-specifying complexes, in which paralogs are mainly shared between the different organelles, rather than within them [338]. In tethering complexes, duplications generated proteins both within and between complexes [331]. The more ancient, pre-duplication origins of the kinetochore proteins seem diverse. Some proteins have prokaryotic homologs, while others do not. For the fi rst, although certain biochemical functions (e.g. HORMA-TRIP13 interaction, histone-DNA interaction carried out by CenpA) may be conserved, most of these prokaryote-derived kinetochore proteins no longer perform a cellular function similar to that of its prokaryotic ancestor. The kinetochore therefore followed a different evolutionary trajectory between FECA and LECA than e.g. NADH:ubiquinone oxidoreductase (Complex I) [339]. The latter complex was directly derived from the alphaproteobacterium that became the mitochondrion (Figure 1), and it expanded between FECA and LECA by incorporating other proteins of different origins. The kinetochore also differs from for example the ubiquitination system, which seems largely derived from Archaea via vertical descent, and subsequently expanded via duplication [297]. Similarly, the membrane-traffi cking system largely has archaeal roots [340]. Other eukaryotic systems, such as the spliceosome and the nuclear pore [73, 290], do not have functional prokaryotic ancestry, but prokaryotic sequences did contribute substantially to their core, in contrast to the kinetochore. However, the 5 nuclear pore does resemble the kinetochore in having a mosaic origin.

The intra-kinetochore duplications suggest an evolutionary trajectory by which the kinetochore has expanded, which runs from homodimers to heterodimers via gene duplication [276]. A primordial kinetochore might have been composed of complexes that consisted of multimers of single ancestral proteins (ultimately the ancestral kinetochore units in Table 1). After these proteins duplicated, the resulting paralogs maintained the capacity to interact, resulting in a heteromer. For example, the Ndc80 complex might have consisted of a tetramer of two copies of an ancient CH protein, and two copies of an ancient RWD protein. According to this model, the proteins with shared domains within complexes should be most closely related to one another. This paradigm might hold for the Ska subunits, the CH-domain proteins, TBP-like proteins and the RWD proteins except for Csm1 and Mad1. The four subunits of the Mis12 complex might have resulted from a single ancestral protein, such that a homotetramer became 126 Chapter 5 Mosaic origin of the eukaryotic kinetochore 127

a heterotetramer. However, the Nkp complex likely originated in a different manner, that is by co-duplication of an ancestral protein giving rise to Mis12 and Nkp1, and to Nnf1 and Nkp2, in fact a duplication of a heterodimer (see Results, ‘Mis12/Nkp-like’). The CenpTSWX complex may have originated from two ancestral histone fold proteins (see Table 1: anc_KT_histone_1 and anc_KT_histone_2) that comprised a tetramer consisting of two homodimers. After these two ancestral proteins co-duplicated, the paralogs conserved their interactions and thus the heterotetramer CenpTSWX arose. If we carefully inspect the kinetochore architecture, we observe that many paralogous proteins are positioned along the inner-outer kinetochore axis (Figure 1B, dashed line). We speculate that not too long before LECA, the genes encoding the proteins along this axis duplicated concomitantly, possibly during a whole-genome duplication, giving rise to the complex LECA kinetochore architecture.

The LECA kinetochore contains protein domains that are both unique to the kinetochore and therefore, by defi nition, unique to eukaryotes (15%), although we cannot exclude the existence of remote homologs of these proteins. New and more diverse genomes may allow for the detection of such distant homologs in the future. Kinetochore proteins that do share domains with other eukaryotic systems, such as the RWD and TPR, seem relatively strongly diverged in the kinetochore. For example, the TPR domains of Mps1 and MadBub seem more derived than those of the Anaphase Promoting Complex/ Cyclosome (APC/C), which we also studied. This suggests that, after these domains got involved in the kinetochore, their sequences evolved more rapidly, and continued to do so after LECA [14]. Such an evolutionary acceleration may also have occurred to the ‘de novo’ proteins in the LECA kinetochore, causing homology detection to fail. How could an acceleration in sequence evolution at the kinetochore be explained? Possibly, 5 it is related to the dynamic nature of the centromere. The sequences of centromeric DNA are poorly conserved across species [93]. Also centromere organisation also greatly varies: in some species spans the full-length of the chromosome (e.g. in Caenorhabditis elegans), while in others it is a sequence-specifi c, confi ned ‘point’ centromere (e.g. in Saccharomyces cerevisiae) [341].

In addition to fi nding the origins of kinetochore proteins, tracing in which order these proteins or domains got involved in the kinetochore would be highly interesting. Was an early kinetochore maybe just composed of the centromere- and microtubule- binding proteins, and was the CCAN (the ‘Cenp’ proteins), which serves as their bridge, added later? And would the centromere-binding proteins have evolved before or after the microtubule-binding ones? Relative timings of such attributions could potentially shed light on the evolution of eukaryotic chromosome segregation. Although little is known about evolution of the eukaryotic segregation machinery, it must be associated to the evolution of linear chromosomes, the evolution of the nucleus and of the eukaryotic 126 Chapter 5 Mosaic origin of the eukaryotic kinetochore 127

cytoskeleton, including centrosomes. Garg & Martin argued that, because eukaryotic, microtubule-based chromosome segregation requires plenty of ATP, likely this type of chromosome segregation only became possible after acquisition of the mitochondrion (Figure 1A) [69].

While currently no eukaryotes or ‘proto’-eukaryotes are known to segregate chromosomes in a pre-LECA manner, it remains hard to unravel which series of events gave rise to the spindle apparatus, the centromere and the kinetochore. On the prokaryotic side, metagenomics studies recently identifi ed Asgard Archaea [9, 10], the closest archaeal relatives of eukarytes currently known (Figure 1A). Maybe even more closely related prokaryotes will be discovered in the near future. Such species potentially contain features homologous or even orthologous to kinetochore components or eukaryotic chromosome segregation. New genomic sequences aided in reconstructing the evolution of the ubiquitin system [297] and the membrane traffi cking system [340]. Similarly, such newly identifi ed species may enhance our understanding of the pre-LECA evolution of the eukaryotic kinetochore and the chromosome segregation machinery.

Data and Methods

Profi le-versus-profi le searches To fi nd distant homologs of kinetochore proteins, we constructed HMM profi les and used genome-wide databases to apply profi le-versus-profi le searches. For each of the proteins included in our previous analysis [14], and for some more proteins we studied more recently (Nkp1, Nkp2, Csm1, Lsr4, Mam1, Hrr25), we aligned the sets of orthologous sequences (MAFFT, v.7.149b, ref, ‘einsi’ or ‘linsi’) and used these to construct HMM 5 profi les (www.hmmer.org, version HMMER 3.1b1) and hhm-formatted profi les. For each protein, we made such a profi le from the full-length alignment. In addition, if a given protein has well-annotated domains, we made separate profi les of these domains. While Zwint-1 has two RWD domains, we only used the fi rst as a separate domain profi le, because the second is very poorly conserved across species (Supplementary Figure 2). All HMM3 profi les can be found in Supplementary Files 1-147. We applied two different search strategies, using different tools and different search databases. For the fi rst, we searched with full-length profi les of the kinetochore proteins and searched against a database compiled of profi les from PANTHER11.1 [168] and the kinetochore protein profi les themselves. For this search, we made use of of PRC (version 1.5.6) [21].For the second strategy, we search with domain profi les if available for a given protein, and otherwise full-length profi les. We downloaded scop70 (March 1, 2016), pdb70 (September 14, 2016) and PfamA (31.0) profi le databases from the HH-suite depository (ftp://toolkit.genzentrum.lmu.de/HH-suite/databases/hhsuite_dbs, downloaded on July 128 Chapter 5 Mosaic origin of the eukaryotic kinetochore 129

15, 2017) and combined these profi les with the kinetochore domain/full-length profi les we also used as queries. We searched using HHsearch (version 2.0.15, Soeding 2005) [13]. For each of the search strategies, we identifi ed which of the database profi les correspond to the kinetochore domains/proteins. Using this information, we parsed the search results identifying ‘best hits’ and ‘bidirectional best hits’ for each kinetochore domain/protein profi le. For our network analysis, we used the results from the second (HHsearch-based) search strategy, applying an e-value cut-off of 1 or 10. We visualised the network in Cytoscape (version 3.5.1) [342], which can be found in Supplementary File 148. In this network, we also added the available cellular component GO terms to the hits using SIFTS [343]. In addition, we used information from the fi rst (PRC-based) search strategy to trace the distant homology between subunits of the Mis12 and Nkp complexes (see also Supplementary Text). The results of both search strategies can be found in Supplementary Table 2.

Phylogenetic trees For inference of the phylogenetic trees presented in this manuscript, we used a variety of methods. We collected homologs by searching with our tailor-made and Pfam HMM profi les (see ‘Profi le-versus-profi le searches’) against our local proteome database [14]. The fi rst four letters of the eukaryotic sequences represent the species, of which the full names can be found in Supplementary Table 4. For the phylogenies of ZW10 and related tethering factors, HORMA, histones, RWD and TRIP13, we used a subset of the species in this database (e.g. species present in Supplementary Figure 4). The phylogeny of all eukaryotic kinases was based on sequence-based subsampling [344]. For the prokaryotic sequences in the HORMA, TRIP13, RWD/E2 and histone phylogenies, we performed phmmer/jackhmmer online and collected sequence hits from UniProt, in addition to the 5 prokaryotic sequences reported by Burroughs et al. [39]. Multiple sequence alignments were inferred using MAFFT (v.7.149b, ‘einsi’ or ‘linsi’) [265], and trimmed with trimAl (1.2rev59, various options) [266]. Due to the high degree of sequence divergence for the UBC and histone fold-containing proteins, we constructed a super alignment of trusted trimmed orthologous group alignments using the function ‘merge’ of MAFFT (ginsi, unalignlevel 0.6). We manually scrutinized the resulting multiple sequence alignments (Supplementary Files 149 and 150) for clear misalignments based on structure-based alignments of available histone domains (see Supplementary Files 151 and 153). Trees were made using RAxML (version 8.0.20, automatic substitution model selection, GAMMA model of rate heterogeneity, rapid bootstrap analysis of 100 replicates) [281] or IQ-TREE (version 1.6.3, extended model selection, ultrafast bootstrap (1000) and SH- like approximate likelihood ratio test) [285]. The algorithm and substitution model used for each individual tree is reported in Supplementary Figure 1. Trees were visualised and annotated using FigTree [345]. 128 Chapter 5 Mosaic origin of the eukaryotic kinetochore 129

Structural similarity and secondary structure prediction To identify potential homologs based on structural similarity with LECA kinetochore proteins, we searched both the literature and databases such as PFAM (http://pfam. xfam.org [34]), ECOD (http://prodata.swmed.edu/ecod/ [346], RCSB Protein Data Bank (http://rcsb.org [22]) and CATH (http://www.cathdb.info/ [347]). Structures were visualized and processed using the python-based software package Pymol version 2.1.1 [348]. Structural alignments were performed using either ‘cealign’ and ‘super’ or were directly downloaded from the aforementioned databases and/or the DALI webserver [349]. The information on the structures and various hyperlinks to databases that we consulted can be found in Supplementary Table 3. Pymol session fi les, containing most of the structures used for the comparison of RWD/UBC-like proteins, TBP-like domains and histones are made available (Supplementary Files 151-153). Secondary structure predictions for Zwint-1 were performed using the JPRED webserver [350], embedded in the alignment package Jalview [351].

Classifi cations and interpretations of homologous protein families We classifi ed closest homologs as either eukaryotic or prokaryotic. A closest eukaryotic homolog is a paralog resulting from a gene duplication before LECA [76]. Eukaryotic homologs resulting from pre-LECA gene duplications are called ‘outparalogs’ [122] (see e.g. Figure 2A), as opposed to ‘inparalogs’, which are paralogs resulting from post-LECA duplications. We distinguished kinetochore outparalogs from outparalogs involved in other eukaryotic cellular processes. If a protein has a kinetochore protein as its closest outparalog, likely their ancestral, pre-duplication protein was already part of the primordial kinetochore. If two or more kinetochore proteins are closest related to one another, these thus are inferred to descend from a single ancestral kinetochore protein, which we refer to as ‘anc_KT’ for ‘ancestral kinetochore unit’ (Table 1). The ‘anc_ 5 KT’ is the protein that got involved in the kinetochore, and then duplicated to give rise to the paralogous kinetochore proteins. This ancestral kinetochore protein might have had also closest homolog (either eukaryotic or prokaryotic) outside of the kinetochore, which we also identifi ed. If a LECA kinetochore protein has no closest outparalog in the kinetochore, the protein itself forms the ancestral kinetochore unit.

Author contributions JH and ET performed the research. JH, ET, BS and GK conceived the project and wrote the manuscript.

Acknowledgments We thank Leny van Wijk for providing the phylogenetic tree of eukaryotic kinases and for helping to construct the eukaryotic proteome database, for which we also thank John van Dam. We thank the members of the Kops and Snel labs for helpful discussions on 130 Chapter 5 Mosaic origin of the eukaryotic kinetochore 131

the research.

Supplementary Text

Detecting kinetochore homologs using different resources To complete our picture of the origin of kinetochore proteins, we made use of four different sources of information: phylogenetic trees, searches among HMM profi les and structural information. These different information types result in different qualifi cations of relationships between pairs of (suspected) homologs (Table 1). In general, the profi le- profi le searches are in agreement with the relationships observed from the phylogenies (Supplementary Table 2). However, we noted that various kinetochore proteins, such as RWD proteins and Cep57, hit coiled-coil proteins. In general, the hits we identifi ed as likely coiled-coil were ignored, because coiled-coil similarity might not be indicative of homology, since it could also evolve convergently [188].

Proteins in LECA kinetochore & alternative rooting To determine which proteins were present in the LECA kinetochore (Figure 1B), we fi rst inferred for each protein if it was likely encoded in the genome of LECA. In principle, we did so based on Dollo parsimony, which states that a protein can only be invented once, hence the origin of the protein dates back to the last common ancestor of all species that have it. In applying this approach, we assume that the divergence between Opimoda and Diphoda represents the root of the eukaryotes (Supplementary Figure 4) [31]. While we are well aware of the controversies about the position of the eukaryotic root [353], we think that alternative rootings would not alter our model of the LECA kinetochore. If for 5 example the root actually lies between (a subset of) Excavata, such as proposed by [33], the presences of kinetochore proteins in this lineage would also support their presence in LECA.

This is the case for the CCAN subunits (‘Cenp’ proteins, Nkp1 and Nkp2, Figure 1B), because one of the Excavata species (Trichomonas vaginalis) contains CenpX and CenpS [354]. These proteins likely result from duplications, and their outparalogs are CenpW and CenpT, respectively. Hence, the common ancestor of the Excavata and the other eukaryotes (LECA) likely had these four CCAN components. Given that the CCAN subunits stronly co-evolve, likely LECA had the complete CCAN, also under this alternative root. The Dam1 complex would have been inferred to have been present in LECA based on Dollo parsimony, but we think it is very likely that its genes were invented later in evolution and got horizontally transferred among distantly related eukaryotic species [355]. Nkp1 and Nkp2 would not have been found to have been present in LECA under Dollo parsimony. However, we argue that, because they are homologous 130 Chapter 5 Mosaic origin of the eukaryotic kinetochore 131

to subunits of the Mis12 complex (Figure 4A) and these subunits are present across the eukaryotic tree of life, Nkp1 and Nkp2 likely resulted from ancient duplications before LECA, giving rise to Nkp1 and Mis12, and to Nkp2 and Nnf1. Nkp1 and Nkp2 were lost in major eukaryotic lineages quickly after LECA, because we do not observe them in Diphoda lineages (SAR, Archaeplastida and Excavata, Supplementary Figure 4). Moreover, Nkp1 and Nkp2 are part of the CCAN, which, as we pointed out, strongly co-evolves, which also holds for Nkp1 and Nkp2 [354]. If a protein likely was encoded by LECA, we in principle expect them to be part of the LECA kinetochore. We nevertheless exclude such a protein from the LECA kinetochore if it depends on a non-LECA protein for the kinetochore function (Hrr25), or if it seems more likely to be involved in another process, as indicated by characterizations in multiple species (Skp1). The complete list of kinetochore proteins studied and considerations for in/excluding them as part of the LECA kinetochore can be found in Supplementary Table 1.

Domain annotation of Mis12/Nkp-like In this study, we present the subunits of the Mis12 complex (Mis12, Nnf1, Dsn1, Nsl1) and of the Nkp complex (Nkp1, Nkp2) to be all homologous to one another, having a domain we coin “Mis12/Nkp-like” (Figure 1, 4A). We inferred their homology using different sources. First, and most convincingly, Nnf1 and Nkp2 appear homologous by being each other’s bidirectional best hit in the profi le-profi le output (Supplementary Table 2, HHsearch Evalue 10). The Nkp1 and Mis12 full-length profi les hit each each other as best hits with PRC (Supplementary Table 2, PRC Evalue 10). In the same search, the Nkp1 profi le hit the Nnf1 profi le, and the Nsl1 profi le hit the Mis12 profi le. Moreover, the profi le of Nkp1 hits that of Mis12 in HHpred online [356], albeit at very high E-value: 180. The structures of the Mis12 subunits were already shown to be similar [23, 24], therefore we propose their homology as well. If the Mis12 subunits are homologous to 5 one another, and two Mis12 complex subunits are homologous to the two Nkp subunits, all of these in total six proteins are homologous. Moreover, since we have no indications for other (prokaryotic or eukaryotic) homologs, we infer that the Mis12/Nkp-like domain was invented before the last eukaryotic common ancestor and gave rise to these six kinetochore proteins via gene duplications.

Double RWD domain in Zwint-1 orthologs In our sensitive profi le-versus-profi le analysis, various kinetochore RWD proteins hit each other, as well as other RWD-like and E2 proteins, indicating that their sequences were suffi ciently similar to confi rm their homology, with the notable exceptions of the RWD domains of CenpP (full-length profi les are hit, see Supplementary Table 2). Interestingly, Zwint-1, the only KMN (Knl1-Mis12-Ndc80) network subunit for which a structure has not yet been determined [242], was hit by various RWD profi les. Indeed, upon further inspection of the predicted secondary structure of Zwint-1 orthologs, we found they 132 Chapter 5 Mosaic origin of the eukaryotic kinetochore 133

follow a classic tandem RWD topology (Supplementary Figure 2), similar to its direct interaction partner Knl1 and the CenpO-CenpP dimer. Since all kinetochore RWD proteins form dimers through RWD-RWD interactions and since the main interactor of Zwint-1 is the double RWD protein Knl1, we predict that Zwint-1 is a bona fi de RWD kinetochore protein.

RWD-like/UBC evolution Due to the highly divergent sequence evolution of kinetochore RWD-like proteins (KT_ RWD) and other non-catalytic UBCs (RWD and for instance UEV/TSG101), we could only construct a short alignment (90 positions, minimal 30% column occupancy) for the whole UBC family. Because of the limited amount of phylogenetic informative positions, some of the parameters for the maximum likelihood methods maybe overfi tted and the resulting phylogenetic tree is therefore likely subject to artefacts such long-branch attraction (LBA, see for instance ‘divergent UBC’ and ‘bacteria_E’), resulting in the misplacement of divergent branches and distortion of the overall tree topology. Nonetheless, overall the UBC superfamily evolved into two distinct groups in eukaryotes: E2 ubiquitin conjugases (UBC, support=77) and two non-catalytic UBC-like groups (RWD and RWD_ KT; support=96). The inconsistent placement of the second RWD of Knl1 and Zwint-1 (Knl1_2 and Zwint-1_2), precluded a solid conclusion for a single origin of KT_RWD. Whether this means that Knl1_2 and Zwint-1_2 were independently acquired compared to CenpO_2 and CenpP_2 or signify a shared and more complex origin of KT_RWD and RWD is unclear. Likely the origin of KT_RWD is closely related to FancL, the only other double RWD protein that is currently found in eukaryotic genomes, and Med15, a mediator complex subunit for which we here uncovered the presence of an RWD-like domain in the C-terminus. 5 Whether the RWD-like proteins are of archaeal or bacterial descent is not clear from our analyses. Many different classes of modifi cation systems that operate UBC-like folds are also present in Bacteria (bact_UBC_A-D) and even a non-catalytic domain is found in some lineages (bact_UBC_E) [294]. Various non-catalytic UBC proteins however can also be found in the Asgard Archaea (closest to the archaeal ancestor of eukaryotic, Figure 1A) [9], which are phylogenetically affi liated with both RWD and UBC (Supplementary Figure 1E). Therefore, an RWD-like protein could have already been present in the archaeal ancestor of eukaryotes. Given that UBCs of archaeal descent extensively radiated in eukaryotes [297], we think it is likely that RWD and KT_RWD were also part of this radiation and are thus of archaeal descent.

Reconstruction of histone fold evolution Although the canonical histones are amongst the most highly conserved eukaryotic proteins at the amino acid level, most other histones in eukaryotes are highly divergent, including the TFIID-related TBP-associated factors (TAF), SAGA-related proteins (SUPT), 132 Chapter 5 Mosaic origin of the eukaryotic kinetochore 133

CCAAT-binding complex/nuclear transcription factor (NFY), Negative coregulator 2 (NC2), subunits of the DNA polymerase epsilon (DPOE), Chromatin Accessibility Complex (CHRAC) and the kinetochore histones (CenpA, CenpS, CenpT, CenpX, CenpW) (see Supplementary Figure 1I). To produce an informative alignment, we fi rst made separate alignments of slowly evolving orthologs of LECA histone proteins, which we subsequently aligned (corrected based on structural alignments, see ‘Data and Methods’). In addition, we added archaeal and bacterial histone-like sequences, which we acquired through jackhmmer runs against archaeal and bacterial UniProt databases (see ‘Data and Methods’), using known archaeal histones such as HMf [305], the reported Asgard histone-like proteins [9] and the ‘DUF1931’ protein family as queries. Due to the limited amount of positions (69) and highly divergent nature of the histone family, we could not resolve histone evolution as different algorithms (RAxML and IQTREE) and various models gave inconsistent results: (1) a number of orthologous groups (H2B, TAF3/8/SUPT7, TAF12) appeared at different positions in the tree, (2) the duplication order within for instance the TAF clade was often different with various low bootstrap support values, (3) the position of bacterial and archaeal taxa varied, and (4) the exact placement of CenpX and CenpW relative to each other changed. We here present one of the trees in which CenpX and CenpW are each other’s closest outparalog (Supplementary Figure 1I). In Archaea and Bacteria, histones are found that are affl iated to a number of different eukaryotic histone groups. The presence of a high number of bacterial histone-like proteins surprised us. Although our analyses did not give a consistent result, it is likely that histone-like proteins in Bacteria were acquired through horizontal gene transfers from either archaeal or eukaryotic lineages. In general, we observe that CenpX and CenpW cluster together with one of two major histone groups: (1) TAF11, H2A, NFY, NC2, DPOE and CHRAC1, while CenpS, CenpT and CenpA are more similar to (2) H3, H4 and all the other TAFs and SUPTs (not TAF11). The duplication 5 of CenpA and H3 is not always supported and in a number of cases CenpA branches from within the H3 clade. The duplication of CenpT and CenpS is overall well supported (bootstrap support:87-99). The position of CenpX and CenpW varied. The various trees suggested a closest outparalog for CenpX, i.e. TAF11, H2A, NC2A and a bacterial clade, and for CenpW, i.e. TAF12 or NC2A/NFY. Since many histones dimerize, our analyses could reveal ancient (co) duplications. Apart from the duplication of the kinetochore histones, the NFYA-NFYB, NC2A-NC2B and DPOE3-DPOE4/CHRAC1 histones seem to originate from an internal duplication as well. The order of duplication for the TFIID and SAGA complex TAF/SUPTs could not be easily reconciled with the known dimer pairs (see Supplementary Figure 1I: TAF3/8-TAF10, TAF4-TAF12, TAF6-TAF9, TAF11-TAF13, SUPT7-SUPT3, SUPT3-TAF10 AND ADA1-TAF12). 134 Chapter 5 Mosaic origin of the eukaryotic kinetochore 135

Supplementary Material

Supplementary Figure 1A-I (phylogenetic trees of kinetochore proteins and related homologs), Supplementary Tables 2-4 and Supplementary Files 1-153 can be found online: bioinformatics.bio.uu.nl/jolien/thesis/chapter5_mosaic_origin_eukaryotic_kinetochore/

Supplementary Table 1. Kinetochore proteins from human and/or yeast and their inferred presence in the LECA kinetochore (Figure 1B). For each protein we determined the orthologs across eukaryotic species [14] and determined whether it was encoded by the LECA genome (a ‘LECA protein’), based on Dollo parsimony (present in both Opimoda and Diphoda, Supplementary Figure 4), or based on the inference of a pre-LECA duplication that gave rise to this protein. In addition, we assessed how likely this protein was part of the LECA kinetochore (‘LECA KT protein’).

Protein Model Kinetochore Known domains Supergroup presence LECA protein LECA KT protein (yes/no,

species complex (Opis, Amoe, Exca, SAR, (yes/no, reason)

(h: human, Arch) reason)

y: budding

yeast)

Mad1 h, y Mad1-Mad2 RWD [357] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

Mad2 h, y Mad1-Mad2, HORMA [358] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

MCC

Bub3 h, y MCC WD40 [359] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

Cdc20 h, y MCC WD40 [360] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

MadBub h, y MCC TPR, kinase [361] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

5 Mps1 h, y TPR, kinase [198, Opis, Amoe, Exca, SAR, Arch yes: parsimony yes 362]

p31 h Mad2-Mad2, HORMA [363] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

MCC

TRIP13 h Mad2-Mad2, AAA+ ATPase Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

MCC [20, 364, 365]

Knl1 h, y Knl1-Zwint-1 RWD (2x) [242] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

Zwint-1 h, y Knl1-Zwint-1 RWD (2x) [242, Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

354]

Dsn1 h, y Mis12-C Mis12/Nkp-like Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

[23]

Nsl1 h, y Mis12-C Mis12/Nkp-like Opis, Amoe, SAR, Arch yes: parsimony yes

[23] 134 Chapter 5 Mosaic origin of the eukaryotic kinetochore 135

Nnf1 h, y Mis12-C Mis12/Nkp-like Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

[23, 24]

Mis12 h, y Mis12-C Mis12/Nkp-like Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

[23, 24]

CEP57 h Opis, Exca, SAR yes: parsimony Yes

ARH- h RhoGEF [366] Opis no: parsimony no GEF17

Ndc80 h, y Ndc80-C CH [316, 367] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

Spc24 h, y Ndc80-C RWD [368] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

Spc25 h, y Ndc80-C RWD [368] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

Nuf2 h, y Ndc80-C CH [316, 367] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

Ska1 h Ska-C Ska-like [355] Opis, Amoe, SAR, Arch yes: parsimony yes

Ska2 h Ska-C Ska-like [355] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

Ska3 h Ska-C Ska-like [355] Opis, Amoe, SAR, Arch yes: parsimony yes

Dam1 y Dam1-C Opis, SAR, Arch no: HGT [355] no

Duo1 y Dam1-C Duo2-Dad2-like Opis, SAR, Arch no: HGT [355] no

[355]

Dad2 y Dam1-C Duo2-Dad2-like Opis, SAR, Arch no: HGT [355] no

[355]

Dad1 y Dam1-C Dad1/4-Ask1-like Opis, Amoe, SAR, Arch no: HGT [355] no

[355]

Dad3 y Dam1-C Opis, SAR, Arch no: HGT [355] no

Dad4 y Dam1-C Dad1/4-Ask1-like Opis, SAR, Arch no: HGT [355] no

[355]

Hsk3 y Dam1-C Opis, SAR no: HGT [355] no 5

Ask1 y Dam1-C Dad1/4-Ask1-like Opis, SAR, Arch no: HGT [355] no

[355]

Spc19 y Dam1-C Opis no: parsimony no

Spc34 y Dam1-C Opis, SAR, Arch no: HGT [355] no

SKAP h SKAP-Astrin Opis no: parsimony no

Astrin h SKAP-Astrin Opis no: parsimony no

Spindly h RZZS Opis no: parsimony no

Rod h RZZS NRH, Sec39, Opis, Exca, SAR yes: parsimony yes, but without Spindly, it is

WD40 [369] unclear how it was recruited

to the KT 136 Chapter 5 Mosaic origin of the eukaryotic kinetochore 137

Zwilch h RZZS Opis, SAR yes: parsimony yes, but without Spindly, it is

unclear how it was recruited

to the KT

ZW10 h RZZS Vsp51 [370] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes, but without Spindly, it is

unclear how it was recruited

to the KT

Aurora h, y CPC Kinase [371] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

Incenp h, y CPC Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

Survivin h, y CPC Baculoviral IAP Opis, Exca, SAR yes: parsimony yes

repeat (BIR) [372]

Borealin h CPC Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

Sgo h, y CPC Opis, Amoe, SAR, Arch yes: parsimony yes

BugZ h Zinc fi nger Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

Plk h, y Kinase, polo box Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

[373, 374]

CenpA h, y Histone Histone [375] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

and phylogeny

(van Hooff

2017)

CenpB h CCAN TC5-DDE (TE) Opis no: parsimony no

[376]

CenpC h, y CCAN Cupin, pyrin [23, Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

319, 377]

CenpF h Opis no: parsimony no 5 CenpE h Kinesin [378] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes CenpH h, y CCAN-HIKM Opis, Amoe, SAR yes: parsimony yes

CenpI h, y CCAN-HIKM HEAT-like repeats Opis, Amoe, SAR yes: parsimony yes

[379]

CenpK h, y CCAN-HIKM Opis, Amoe, SAR yes: parsimony yes

CenpM h CCAN-HIKM GTPase [379] Opis, Amoe, SAR yes: parsimony yes

+ pre-LECA

duplicate

CenpL h, y CCAN-LN TBP-like [319] Opis, Amoe, SAR yes: parsimony yes

CenpN h, y CCAN-LN TBP-like, pyrin Opis, Amoe, SAR yes: parsimony yes

[319]

CenpO h, y CCAN-OPQ- RWD (2x) [292] Opis, Amoe, SAR, Arch yes: parsimony yes

RU + pre-LECA

duplicate 136 Chapter 5 Mosaic origin of the eukaryotic kinetochore 137

CenpP h, y CCAN-OPQ- RWD (2x) [292] Opis, Amoe, SAR yes: parsimony yes

RU + pre-LECA

duplicate

CenpQ h, y CCAN-OPQ- Opis, Amoe, SAR yes: parsimony yes

RU

CenpR h CCAN-OPQ- Opis no: parsimony no

RU

CenpU h, y CCAN-OPQ- Opis, Amoe, SAR yes: parsimony yes

RU

Nkp1 y CCAN-Nkp Mis12/Nkp-like Opis yes: predicted yes

pre-LECA

duplicate

Nkp2 y CCAN-Nkp Mis12/Nkp-like Opis, Amoe yes: predicted yes

pre-LECA

duplicate

CenpT h, y CCAN- Histone [314, Opis, Amoe, SAR yes: parsimony yes

TSWX 380] + pre-LECA

duplicate

CenpW h, y CCAN- Histone [314, Opis, Amoe, SAR yes: parsimony yes

TSWX 380] + pre-LECA

duplicate

CenpS h, y CCAN- Histone [314] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes

TSWX + pre-LECA

duplicate

CenpX h, y CCAN- Histone [314] Opis, Amoe, Exca, SAR, Arch yes: parsimony yes TSWX + pre-LECA 5 duplicate

Ndc10 y CBF3 GCR1_C, Cryp- Opis no: parsimony no

ton F-like [381]

Ctf13 y CBF3 F-box [382] Opis no: parsimony no

Cep3 y CBF3 Zinc fi nger, HEAT Opis no: parsimony no

repeat [383]

Skp1 y CBF3 Opis, Amoe, Exca, SAR, Arch yes: parsimony No: likely operated in SCF

ubiquitin ligase complex and

was recruited to the KT with

the CBF3 complex

Csm1 y Monopolin RWD [384] Opis, Amoe, SAR, Arch Yes: parsimony yes

Lsr4 y Monopolin Opis No: parsimony no

Mam1 y Monopolin Opis No: parsimony no 138 Chapter 5 Mosaic origin of the eukaryotic kinetochore 139

Hrr25 y Monopolin Kinase [385] Opis, Amoe, Exca, SAR, Arch Yes: parsimony No: likely KT function since the

origin of Lsr4, Mam1 (Saccha-

romycetales)

Supplementary Figure 1. Phylogenetic trees of domains present in kinetochore proteins. Kinetochore proteins are indicated in green, closest homologs (e.g. eukaryotic outparalogs) are indicated in blue. Details on tree inference methods can be found in Supplementary Text, Data and Methods. The sequences were obtained from our local proteome database. The fi rst four letters of the protein name indicate the species as listed in Supplementary Table 4. A. Vps51 tethering complex subunits (ZW10). B. WD40 (Cdc20). C. WD40 (Bub3). D. Kinases (MadBub, Mps1, Aurora, Plk). E. RWD (Zwint-1, Knl1, CenpO, CenpP, Spc24, Spc25, Mad1, Csm1). F. HORMA (Mad2, p31comet). G. AAA+ ATPase (TRIP13). H. NRH- Sec39 (Rod). I. Histones (CenpA, CenpT, CenpS, CenpX, CenpW) – support is shown for internal nodes that correspond to the origin of new orthologous groups up to the LECA level. Supplementary Figure 1 can be found online: http://bioinformatics.bio.uu.nl/jolien/thesis/chapter5_mosaic_origin_eukaryotic_kinetochore/

5 138 Chapter 5 Mosaic origin of the eukaryotic kinetochore 139

A Human (Zwint-1) CC CC 277

Budding yeast (Kre28) CC CC 386

Fission yeast (Sos7) CC CC 264 α-helical β-sheet CC coiled-coil

B mammals Loss RWD 2 Ciona intestinales Loss RWD 1 + 2 Opisthokonta Amoebozoa Excavata SAR saccharomycetes Archaeplastida basidiomycetes Unknown mucoromycetes various ascomycetes

early-branching opisthokonta

Naegleria gruberi

oomycetes

red algae

Zebra fish (predicted) CC CC 377 RWD 1 RWD 2

Supplementary Figure 2. Recurrent loss of RWD domains during the evolution of Zwint-1. A. Secondary structure prediction of three Zwint-1 orthologs: Zwint-1 (human), Kre28 (budding yeast) and Sos7 (fi ssion yeast), reveal a highly divergent C-terminal region. B. Overview of a multiple sequence 5 alignment of 83 Zwint-1 orthologs that was based on the manually predicted Zwint-like sequence in zebra fi sh. The colors of the alignment are based on the classic clustal colouring scheme and the alignment is condensed so that the letters of the residues are not visible anymore. On the left the small colored blocks indicate the supergroup to which each species belongs. The red and blue clades indicate which of the orthologs lost the RWD-2 or both the RWD-1 and RWD-2 domain respectively. 140 Chapter 5 Mosaic origin of the eukaryotic kinetochore 141

asgard_archaea_triple_UBC/RWD_2 {non-catalytic} mysterious 3x UBC/RWD protein in asgard archaea

A RWDD2 {DUF1115:splicing?}

WDR59 {WD40:amino acid-sensing}

RNF14 {RING-type E3:transcription factor regulation} & IMPACT {S5:metabolic-stress translational control}

RWDD3 {sumoylation co-factor}

RWDD4 {RNA-related} RWD YPxxxP motif DHX57 {zinc finger & DEAD-box:RNA helicase}

non-catalytic Csm1 4 c2 p 5 S c2 Mad1 RWDD1 {DRG:translational control} p S Knl1 GCN2 {kinase & HGTP_anticodon2:metabolic-stress translational control} Zwint-1

RNF25-like {RING-type E3:transcription factor regulation}

KNL1_2 {CC:outer kinetochore} O P single RWD (4x) double RWD (4x) MED15 {KIX:transcriptional co-activator} duplication

bacteria_E {non-catalytic}

ZWINT-1_2 {CC: outer kinetochore}

FANCL1 {RING-type E3:DNA interstrand crosslink repair}

FANCL2 {RING-type E3:DNA interstrand crosslink repair} KT_RWD 1. possible single origin of all KT_RWD SPC25 {CC:outer kinetochore} 2. no clear order of duplications ( X hypothesis depicted above)? 3. related to FANCL and RWD CENPP_1 {CC:inner kinetochore} 4. KNL1_2 & ZWINT-1_2 are highly divergent with YPxxxP motif ZWINT-1_1 {CC:outer kinetochore} 5. bacteria_E clade seems to be misplaced

KNL1_1 {CC:outer kinetochore}

CENPO_1 {CC:inner kinetochore}

MAD1 {CC:spindle assembly checkpoint}

CENPP_2 {CC:inner kinetochore}

CENPO_2 {CC:inner kinetochore}

CSM1 {CC:kinetochore fusion}

SPC24 {CC:outer kinetochore}

asgard_archaea_unknown_E2 {UBL}

bacteria_unknown_E2 {UBL} archaeal origin of asgard_archaea_triple_UBC/RWD_1&3 {non-catalytic} eukaryotic UBC fold

archaea_unknown_E2 {UBL} Legend E2F {NEDD8} RWD: eukaryotic E2-E3-related dimerization domains {other domains:function}

E2W {UBQ} KT_RWD: eukaryotic kinetochore-related UBC/RWD proteins {other domains:function}

E2I_UBC9* {SUMO} UBC: eukaryotic ubiquitin-like conjugating enzymes {type of UBL conjugation:function}

E2Q* {UBQ} Archaea {UBL conjugation} E2J-like {UBQ} Bacteria {UBL conjugation} E2J1* {UBQ} Multi-UBC protein E2J2* {UBQ} Single-UBC protein E2O* & BIRC6 & E2Z* & BRE {UBQ & FAT10} Ultra-fast bootstrap support (0-100) E2G & E2R {UBQ} 2.0 E2A {UBQ} UFC1 {UFM1} eukaryotic UBC B RWD * HGT to bacteria Scenario E2C {UBQ} bacterial or archeal UBC-like origin 1 5 E2K* {UBQ} of KT proteins? E2E* {UBQ} FancL E2D* {UBQ} bacterial UBC-like 2

E2N* {UBQ}

E2T {UBQ} 1 UBC archaeal E2_UBC-like E2S {UBQ}

E2M {NEDD8} extensive duplication E2H {UBQ} 2 and neofunctionalization 1 E2L {UB & ISG15} FECA-LECA

AKTIP {non-catalytic:FHF-HOPS mediated vesicular trafficking} KT_RWD

UBE2V {non-catalytic:co-factor polyubiquitination}

asgard_archaea {non-catalytic:ESCRT-related?} divergent UBC including non-catalytic asgard E2-like VPS37A_ESCRT-I {non-catalytic:ESCRT-I mediated vesicular trafficking} likely misplaced in the tree (LBA?)

UEV1_TSG101 {non-catalytic:ESCRT-I mediated vesicular trafficking}

bacteria_unknown {UBL}

bacteria_A {UBL}

bacteria_D {UBL} bacterial UBC_like outgroup bacteria_B {UBL}

bacteria_B_like {UBL} 140 Chapter 5 Mosaic origin of the eukaryotic kinetochore 141

Supplementary Figure 3. UBC family evolution. A. Fully annotated phylogenetic tree of the UBC family (based on Supplementary Figure 1E). The curly brackets denote other domains present and general function for RWD and KT_RWD proteins, while for UBC-like proteins it signifi es its potential substrates. See for Supplementary Text, section ‘RWD-like/UBC evolution’ for discussion. B. Scenario of the evolution of UBC family: the RWD-like and UBC proteins in eukaryotes are likely descendant from a prokaryotic ubiquitin-like modifi cation system that was streamlined by lineages that are phylogenetically affi liated the archaeal ancestor of eukaryotes. Subsequent duplications and sub/neofunctionalization gave rise to three groups of which the latter are likely highly related: (1) RWD (E2/E3-associated proteins), (2) UBC (bona fi de E2 ubiquitin conjugases) and (3) RWD-like kinetochore proteins (KT_RWD), which are likely the closest outparalog to the double RWD protein FancL. The numbers in light/dark blue correspond to the RWD confi guration present in each UBC/RWD group (single versus double).

Metazoa Capsaspora owczarzaki Fungi Opimoda Nuclearia sp. Opisthokonta Acanthamoeba castellanii LECA Acytostelium subglobosum Amoebozoa Naegleria gruberi Euglena gracilis Excavata* okamuranus Diphoda Aplanochytrium kerguelense Bigelowiella natans SAR 5 Embryophyta Klebsormidium flaccidum Guillardia theta Archaeplastida

Supplementary Figure 4. Phylogeny of eukaryotes. Small version of eukaryotic species tree with Opimoda-Diphoda root and eukaryotic supergroups. This topology was used to infer whether a protein was likely present in LECA (Supplementary Table 1, Figure 1B).

6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eu- karyote-specific proteins

Jolien JE van Hooff, Anne van Vlimmeren, Julian Vosseberg, Marina Marcet-Houben, Geert JPL Kops, Toni Gabaldón and Berend Snel

Manuscript in preparation 144 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 145

Abstract

Compared to prokaryotes, eukaryotes harbor a large intracellular complexity and unique features such as a nucleus, meiotic sex and genes with spliceosomal introns. This difference implies a major evolutionary transition (‘eukaryogenesis’) that required an increase in genomic complexity. The genome of the pre-eukaryotic lineage expanded through different processes, including the incorporation of genes from the mitochondrial endosymbiont, horizontal gene transfer, gene genesis and duplication. Previous work demonstrated that gene duplications nearly doubled the genome, thereby shaping and fi ne-tuning many eukaryotic cellular features. Recently, a phylogenomics study shed further light on eukaryogenesis by timing the endosymbiosis that gave rise to mitochondria. Using their method, we here examined which genes duplicated, and when they duplicated. We found that eukaryote-specifi c genes duplicated most, compared to genes from archaeal, bacterial or alphaproteobacterial (i.e. mitochondrial) descent. Moreover, these duplicated more recently than the entry of mitochondrial genes, indicating that eukaryogenesis continued after mitochondrial endosymbiosis. The paralogous pairs that we studied appear to evolve asymmetrically, with one paralog evolving more rapidly after the duplication than the other, suggesting it acquired a new function. Together, we here provide a framework for examining how and when gene duplications contributed to cellular complexity during eukaryogenesis.

Introduction

Compared to prokaryotes, eukaryotic cells are tremendously complex. Not only are they on average a 1000–fold larger, they also accommodate intricate intracellular features that prokaryotes do not have: a nucleus and other membrane-based organelles such as the Golgi apparatus and the endoplasmatic reticulum (ER), an actin-tubulin-based cytoskeleton, and mitochondria. Moreover, the genome of a typical eukaryote contains approximately four times as many genes [68]. Understanding the evolution of a complex, eukaryotic cell from simple prokaryotic ancestors is one of the major goals of evolutionary 6 biology. It is now widely accepted that all current-day eukaryotes are descendants of a single ancestor (the last eukaryotic common ancestor, LECA), which already had all key eukaryotic features [386]. Moreover, the origin of the mitochondrion is well established: it originated from the endosymbiosis of an Alphaproteobacteria-related bacterium into an Archaea-related host (Figure 1). The lineage that diverged from Archaea to give rise to eukaryotes is called the fi rst eukaryotic common ancestor (FECA). Recently, major advances have been made in identifying which prokaryotes were involved in this endosymbiosis. The Archaea-related host was discovered to be related to the newly identifi ed Asgard superphylum [9], and the endosymbiont was likely an early-branching 144 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 145

Archaea

Golgi Endosomes Cytoskeleton Nucleus LECA Telomeres cellular evolution ER FECA Mitochondrion RNA processing

genome evolution bacterial HGT a-proteobacterial acquisition eukaryotic invention

α-Proteobacteria duplication of gene with archaeal/bacterial/ α-proteobacterial/ eukaryote origin

Figure 1. Cellular and genome evolution during eukaryogenesis. Eukaryotes evolved from their prokaryotic ancestors through divergence of an archaeal lineage. Directly after this diverge, we term this lineage of the fi rst eukaryotic common ancestor (FECA). During the evolution of the eukaryotes, an alphaproteobacterial-related species was incorporated through endosymbiosis, and this endosymbiont gave rise to mitochondria. In addition to mitochondria, various other eukaryotic features originated, such as the nucleus, the intracellular membrane system and linear chromosomes with telomeres. The evolution of these features is underpinned by the evolution of the genome. In addition to genes derived by vertical descent from Archaea and genes from the alphaproteobacterial endosymbiont, the pre- eukaryotic lineage acquired genes from a range of bacterial donors, possibly via ‘regular’ horizontal gene transfer or via additional endosymbionts. Furthermore, genes arose de novo during eukaryogenesis. Genes from all these different origins duplicated, giving rise to the large and complex genome of the last eukaryotic common ancestor (LECA), the lineage from which all current-day eukaryotes descended.

lineage within Alphaproteobacteria, or maybe even a sister lineage of this clade [64].

The origination of eukaryotes, referred to as ‘eukaryogenesis’, is tightly coupled to, and can be studied by, the evolution of the genome between FECA and LECA (Figure 1). In 6 addition to the endosymbiont, genes were acquired from various other prokaryotic (mainly bacterial) lineages via ‘regular’ horizontal gene transfer. Such non-alphaproteobacterial genes also promoted eukaryogenesis, for example by providing the building blocks of the nuclear pore [73]. Moreover, novel folds arose during eukaryogenesis, genes that have no homolog in prokaryotes at all [74]. Genes from different origins underwent loss, domain fusion and, importantly, gene duplication, in a process referred to as a Biological Big Bang [387]. Gene duplications signifi cantly enlarged the pre-eukaryotic genome [8] and enabled the evolution of eukaryotic features. Gene families such as Rab GTPases, 146 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 147

Ras-like GTPases, kinases and kinesins greatly expanded via gene duplications, which enabled eukaryotes to, for instance, employ an elaborate intracellular signaling network, vesicular traffi cking and cytoskeleton [146, 175, 388].

Eukaryogenesis will be better understood by determining the relative order in which eukaryotic features and the eukaryotic gene complement emerged, as this order might provide clues about causality. Unraveling this order is however very complicated, because there are no surviving ‘intermediate’ life forms that uncover what features came fi rst. Recently, Pittis & Gabaldón developed a phylogenomics approach to unravel the order of events [65]. Based on branch lengths in individual gene trees, they estimated the timing of mitochondrial endosymbiosis, which is a topic of intense debate [389]. Their estimates indicated that majority of non-alphaproteobacterial genes entered the host before the alphaproteobacterial genes did, and not afterwards. These timings suggest that mitochondrion entered the pre-eukaryotic lineage relatively late in eukaryogenesis, compared at least to these other bacterial genes and the eukaryotic features these gave rise to. Thereby, this study supported a ‘mito-late’ scenario, which fi nds support from the argument that only a complex host cell could have been able to engulf the symbiont [390]. However, an opposite argument is that endosymbiosis drove the evolution of other typical eukaryotic features (‘mito-early’), for example by supplying surplus energy to the host cell that facilitated a larger genome and cell size and the evolution of intracellular structures [68].

Given their scale and biological signifi cance, gene duplications during eukaryogenesis are likely to yield valuable insights into the complexity of the host at the time of mitochondrial endosymbiosis. The branch length analysis developed by Pittis & Gabaldón provided us a means to study these gene duplications. Moreover, with this study we complemented previous estimates of gene duplications between FECA and LECA with a phylogenomics approach [76]. With this, we here endeavor to determine the scale of gene originations and gene duplications during eukaryogenesis. We examined which genes, i.e. of which origin, tend to duplicate most. Finally, we examined the relative timing of these gene duplications, as compared to the acquisition of 6 mitochondria and genes of other prokaryotic origin. We observe that gene origination and duplication indeed contributed substantially to the expansion of the pre-eukaryotic genome, albeit to a smaller degree compared to previous estimates. Novel, eukaryote- specifi c genes duplicated most during eukaryogenesis, while bacterial genes duplicated least. Duplications of genes from archaeal or bacterial origin occurred relatively early, compared to the entry of alphaproteobacterial genes, while duplications of eukaryote- specifi c genes duplicated late. Based on these observations, we propose that the host of the mitochondrial endosymbiont already supported some complexity, but that much evolutionary innovation also occurred afterwards. 146 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 147

Results

A phylogenomics approach to estimate the origins and timing of FECA-to- LECA gene duplications For investigating how and when gene duplications contributed to eukaryogenesis we adopted the phylogenomics approach and initially also the dataset from Pittis & Gabaldón [65]. However, in a pilot study we observed that their dataset contained much fewer ‘FECA-to-LECA’ duplications (59 duplications, contributing to 854 genes in LECA, unpublished data) than previously suggested to have occurred during eukaryogenesis. Therefore we made use of the KOG-to-COG gene family clusters from Makarova et al. [76]. This dataset contains eukaryotic orthologous groups (KOGs) mapped to homologous prokaryotic orthologous groups (COGs). In many cases, multiple KOGs were mapped to a single COG, which often refl ects a FECA-to-LECA duplication. Furthermore, KOGs were clustered if they are homologous to each other but lack a homologous COG. We made use of the original database from Makarova et al., as well as of an update that we developed ourselves, in which we projected the KOG-to-COG clusters onto an expanded and more phylogenetically balanced species set (see Materials & Methods). The updated database revealed more genes in LECA as well as more FECA-to-LECA gene duplications. Therefore the updated data set was chosen as our primary source.

For each gene cluster, we inferred a phylogenetic tree. From each tree, we deduced: 1) how many single genes in LECA (‘LECA genes’) it contains, 2) how many duplications occurred before LECA, but after divergence from the prokaryotic homologs (‘FECA- to-LECA duplications’), 3) which prokaryotic sequences, from which species, are most closely related to the eukaryotic genes, which indicate the prokaryotic origin of this eukaryotic gene (‘the prokaryotic sister clade’) (Figure 2A). The prokaryotic sister clades were divided into tree groups: archaeal, bacterial (i.e. non-alphaproteobacterial) and alphaproteobacterial. For clusters with only eukaryotic genes, we only assessed the number of LECA genes and FECA-to-LECA gene duplications (Figure 2B). Moreover, we estimated the relative timing of the gene duplications and of the divergence of eukaryotic genes from their prokaryotic relatives by measuring branch lengths (discussed below). 6 Measuring additions to the FECA-to-LECA genome by Archaea, Bacteria, Alphaproteobacteria, eukaryotic inventions and duplications. Based on our phylogenomic inference, we estimate the LECA genome to have contained at least 5534 genes, which is higher than the ~4000 approximated by others [76, 203]. Note that this estimate likely presents an underestimation, because we did not include eukaryotic genes with unclear sister clades (‘cellular organisms’ as sister clade, see Figure 5A, Table 1). The large majority of the 5534 LECA genes clustered with a sister clade (Figure 2A) whose common ancestor mapped to Bacteria or to lower level taxa within 148 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 149

A B LECA gene prokaryotic sister gene *** FECA-to-LECA duplication eukaryotic orthologous group

*** ** prokaryotic sister clade: *** archaeal, bacterial or alphaproteobacterial

** ** *** raw stem length (rsl) * ** raw duplication length (rdl) eukaryotic branch length (ebl) ** *

*** duplication length = median(rdl/ebl) * stem length = median(rsl/ebl) ** * * **

* **

Figure 2. Estimation of the timing of gene acquisitions and duplications with branch lengths. A. In genes trees containing prokaryote sequences, monophyletic eukaryotic clades of sequences were identifi ed. For each eukaryotic clade in the tree, we determined whether it was likely present in LECA, i.e. whether it fulfi lled the criterion of containing sequences from Unikonta and Bikonta. If so, this group of eukaryotic sequences was considered to have been present in LECA and therefore subject to our subsequent analysis. Based on the sequences in the sister clade, we determined its likely prokaryotic ancestry, which could fall into three categories: Archaea, Bacteria and Alphaproteobacteria. Only if the sister clade was made up completely of alphaproteobacterial sequences, it would be grouped into this category, otherwise it ends up in the general bacterial category. Within the eukaryotic group, we inferred whether there had been duplications between FECA and LECA, by, for each internal node starting from the fi rst eukaryote-only node, assessing whether its daughters A) both fulfi ll the LECA criterion, and B) share at least two species (see Materials and Methods). If so, this node was annotated as a duplication node, otherwise it will be a regular ‘LECA’ gene and we stopped traversing of this branch in the tree. After annotating the internal nodes, the timing of FECA-to-LECA duplications was measured by estimating the branch length between a LECA node and the duplication node (red arrows), for each LECA that is daughter to that duplication. This value was normalized by taking the median eukaryotic branch length (ebl, black arrows) of the sequences that comprise this LECA gene (from the LECA gene node to the leaves). The median of normalized estimations was taken as the duplication length. For example, the median of three estimations was taken to refl ect the duplication length of the duplication indicated by **. In an equivalent manner, the timing of 6 the divergence from the prokaryotic sister clade was measured by taking the median of the of the stem length, i.e. de distance between the LECA gene nodes and the node that unites the eukaryotic sequences with the prokaryotic sister sequences (green arrow), normalized by the ebl (black arrow). B. For genes trees containing no prokaryote sequences we determined whether it was likely present in LECA based on the LECA criterion. If so, the tree was analyzed. For each internal node, we inferred whether it could have been a FECA-to-LECA duplication based on the same criteria as for trees containing prokaryotic sequences. After annotating duplication nodes and LECA genes, the tree was rooted on the midpoint of the longest distance between LECA gene nodes (see Materials and Methods). For each duplication node, the duplication length was estimated using the same approach as for the trees that had prokaryotic sequences. 148 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 149

the bacteria (excluding Alphaproteobacteria): 3128 genes, comprising 56% of the LECA genome. Next, the largest category comprised LECA genes with an archaeal affi liation (862, 16%), while only a small proportion had an alphaproteobacterial one (197 genes, 4%, see Figure 3A, dark-colored bars). Previous studies also reported this order of LECA gene contributors [65, 391]. After bacterial-derived LECA genes, eukaryote-specifi c LECA genes constituted the largest part of the LECA genome: 1347 (24%) LECA genes appeared to have no prokaryotic homolog, which is a smaller fraction than previously estimated (40%, [203]). Taking into account the subsequent gene duplications, these 1347 genes were derived from 884 genes that were invented de novo between FECA and LECA.

Out of the 5534 LECA genes, 1914 (35%) resulted from at least one FECA-to-LECA duplication, that is, these LECA genes have a parent that is a FECA-to-LECA duplication node (Figure 2, Figure 3A, light-colored bars). Eukaryote-specifi c LECA genes were more often derived from gene duplications than those of other origins (62%, p<0.01), whereas LECA genes with bacterial (23%, p<0.01) and alphaproteobacterial (24%, p<0.01) affi liations duplicated less often. This distribution of duplications differed strongly from what was observed previously: Makarova et al. reported a (strong) underrepresentation of FECA-to-LECA duplications for eukaryote-specifi c genes [76]. In total, 1199 duplications (Figure 3B, total number of duplication lengths) expanded the genome from 4335 to 5534 genes. Note that these 4335 genes were not all present in the FECA lineage directly after divergence from the Archaea: some entered through transfer or invention, as illustrated in Figure 1. Duplications multiplied the pre-LECA genome with a factor of 1.28. A previous estimate reported a ‘paralogy quotient’ of 1.92 [76], although this number also includes some homologous LECA genes that did not result from duplication. Those ‘pseudoparalogs’ were derived from different prokaryotic sources, such as archaeal via vertical descent and alphaproteobacterial via endosymbiosis. Our phylogenomics pipeline allows to successfully detect such cases as two separate monophyletic eukaryotic clades with different prokaryotic sister clades within the same tree (see Discussion). Even considering these cases, it seems most probable that we detected fewer gene duplications than this previous study. 6 Many eukaryote-specifi c genes duplicated late in eukaryogenesis In order to examine the relative order of gene acquisitions and gene duplications between FECA and LECA, we measured the branch lengths of divergence from the sister clade (stem length, sl) and of duplication nodes (duplication length, dl). Like Pittis & Gabaldón [65], we corrected for gene-specifi c differences in sequence evolution by subdividing the raw stem or duplication length by the median eukaryotic branch length (ebl), i.e. the median of the branches from the LECA gene node to each of the eukaryotic gene leaves (Figure 2). We observed that the stem lengths of archaeal-derived genes 150 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 151

A B 3 A B 3128 3000 3 3128 3000 2 2000 2 2000 LECA genes 1347 (total) LECA genes 1347 (total) 1 1000 862 838 713 LECA genes Stem/duplication length 1 1000 contributions to LECA genome 862 from duplication 838

Stem/duplication length LECA genes 713 (subset) contributions to LECA genome 315 from duplication 197 (subset) 315 48 0 197 0 48 0 archaeal bacterial ɑ-proteobacterial eukaryotic 0 archaeal bacterial α-proteobacterial archaeal bacterial α-proteobacterial eukaryotic 0.36 0.23 0.24 0.62 653 2642 156 209 486 41 463 archaeal bacterial -proteobacterial eukaryotic archaeal bacterial α-proteobacterial archaeal bacterial α-proteobacterial eukaryotic A B ɑ stem lengths duplication lengths 0.36 0.23 0.24 0.62 653 2642 156 209 486 41 463 3 3128 stem lengths duplication lengths 3000 2 2000

LECA genes 1347 (total)

1 1000 6 862 838 713 LECA genes Stem/duplication length contributions to LECA genome from duplication (subset) 315 197 48 0 0 archaeal bacterial ɑ-proteobacterial eukaryotic archaeal bacterial α-proteobacterial archaeal bacterial α-proteobacterial eukaryotic 0.36 0.23 0.24 0.62 653 2642 156 209 486 41 463 stem lengths duplication lengths 150 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 151

Figure 3. LECA’s genes, their origins and the timing of their pre-LECA entry and duplication. A. Dark-colored bars present the numbers of genes inferred to have been present in LECA. We categorized these genes based on their prokaryotic affi liation (archaeal, bacterial or alphaproteobacterial), if present. If not, they were eukaryote-specifi c. The light-coloured bars indicate how many of these genes were derived from a duplication. In case of e.g. a LECA gene with an archaeal affi liation, it would be considered duplication-derived if after divergence from the archaeal sister clade, the gene duplicated at least once in the FECA-to-LECA lineage, of which this LECA gene is a result (as in Figure 2A). The numbers below the sister clade categories present the fraction of LECA genes that were duplication-derived. According to Fisher’s exact test, all pairwise comparisons of these ratios are signifi cantly different (p<0.05), except for bacterial - alphaproteobacterial. B. Stem and duplication lengths for the different sister clade categories, calculated as in Figure 2. The numbers below the categories indicate how many lengths are depicted in the box plot. A two-sided Mann–Whitney U-test indicates that the following categories are not signifi cantly different (p>0.05): archaeal stem lengths - alphaproteobacterial duplication lengths, bacterial stem lengths – archaeal duplication lengths, bacterial stem lengths – bacterial duplication lengths, archaeal duplication lengths – bacterial duplication lengths.

were signifi cantly larger than those of bacterial (p<0.01) and alphaproteobacterial (p<0.01) descent, confi rming both previous stem length analysis as well asthe commonly accepted endosymbiosis of an alphaproteobacterium by an archaeal-related lineage that occurred between FECA and LECA [65], even when accounting for gene duplications in the trees. We did not observe a signifi cant difference between the stem lengths from bacterial and alphaproteobacterial-derived genes (p=0.09), in contrast to Pittis & Gabaldón, although also in our study the latter appeared slightly smaller.

The duplication lengths of both archaeal- and bacterial-derived genes were larger than the stem lengths of the alphaproteobacterial genes (p=0.02, p=0.01, respectively). This suggests that a large fraction of FECA-to-LECA duplications and possibly their functional differentiation occurred before mitochondrial endosymbiosis. Moreover, the duplication lengths of archaeal-derived genes were much smaller than their stem lengths (p<0.01), which was not the case for bacterial-derived genes. Hence, after divergence of the archaeal lineage, it might have taken some time before the archaeal-derived genes started to duplicate, while bacterial-derived genes mostly duplicated quickly after they 6 entered.

The duplication lengths of alphaproteobacterial-derived, duplicated genes, consisting of only a few data points, behaved aberrantly. They are signifi cantly longer than all other categories except for the stem lengths of archaeal-derived genes (p=0.11 for comparison to archaeal-related stem lengths, p<0.01 in all other categories). In fact, were longer than the alphaproteobacterial stem lengths. This is conceptually and technically impossible: in our tree analysis, a gene cannot have duplicated before it entered the FECA-to- 152 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 153

LECA lineage, as we did not take into account more ancient duplications shared with prokaryotes. Upon closer inspection, the majority (28/41) of these alphaproteobacterial- descended duplicates turned out to belong to a single gene cluster, comprised of SET domain proteins. Possibly, some of the genes within this cluster either had unrealistically long raw duplication lengths before LECA, or very short branch lengths after LECA (Figure 2). Given the small number of duplication lengths in the alphaproteobacterial category, coming to a large extend from a single gene family, we will not further consider these here.

A Duplication lengths: Duplication lengths: random assignment of daughter 1,2 daughter 2 assigned by largest estimate

r = 0.173 r = 0.563 daughter 2 daughter 2 daughter

daughter 1 daughter 1

B

Archaea/Bacteria/Alphaproteobacteria A

C* *Function C evolved from function A, B via function B B B 6 A LECA gene AB B’ prokaryotic sister gene FECA-to-LECA duplication B AB eukaryotic orthologous groups: AB functionally diverged A } prokaryotic sister clade: archaeal, bacterial or alphaproteobacterial AB 152 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 153

Among eukaryote-specifi c genes, duplication lengths were small. In fact, these were signifi cantly smaller than any of the other categories of stem and duplication lengths (p<0.01 for all). This suggests that after the endosymbiosis, the genome expanded to a large extend through duplications of eukaryote-specifi c genes. Possibly, many of these genes not only duplicated, but also arose relatively recently. Based on our branch length inference, the oldest age we were able to estimate for these eukaryote-specifi c genes is based on the oldest duplication nodes, since they had no outgroup.

Dissimilar duplication length estimates hint at asymmetric evolution after duplication and possibly neofunctionalization As Figure 2 shows, for each duplication we had at least two estimates of its duplication length, which allowed us to assess their consistency. For the data in Figure 3B, we used the median to get a fi nal estimate for a single duplication. For assessing their consistency, however, we used the estimates on both sides of the duplication nodes, referred to as ‘daughter 1’ and ‘daughter 2’. We observed that the duplication lengths as estimated by daughter 1 and daughter 2 correlated very poorly (Figure 4A, left panel, r=0.17). Only a small number of duplications had similar estimates among the daughters (on the diagonal in Figure 4A, both panels). This could refl ect that after duplication, both daughters, the newly arisen paralogs, evolve asymmetrically, such that one has a higher rate of sequence divergence compared to the other. Previous studies reported that some paralogous pairs diverge asymmetrically, whereby one paralog accelerates its evolutionary rate relative to the other. The other might either maintain the evolutionary rate of the pre-duplication gene, or accelerate to a smaller degree [392-396]. However, if pairs evolve asymmetrically, they might only do so for a short time interval [396], due to which we wonder whether this process would yield such divergent estimates as we observed here.

Figure 4. Two duplication lengths estimates for a single FECA-to-LECA duplication. A. Left panel: Duplication lengths (Figure 3B) were based on at least two estimates (Figure 2), namely from the different LECA genes that are daughter to this duplication node. This scatter plot shows the estimates from both sides of the duplication node. If a duplication node has on one of its daughter nodes multiple estimates as well, as a result of a successive duplication (Figure 2), we took the median value for that daughter. Linear 6 regression analysis was performed to infer the correlation coeffi cient (r-value). The expected consistency of the two estimates is indicated by the diagonal (blue line). Right panel: same data and analysis as left-sided plot, but here the shortest estimate is defi ned as daughter 1, while the largest estimate is daughter 2, representing the possible asymmetry in sequence evolution after duplication. Duplication lengths > 3 were not taken into account. B. Cartoon illustrating how neofunctionalization might be associated to a longer duplication length: the gene that acquired function C has a longer duplication length than its paralog, which maintained function B. This cartoon also shows how functional annotation of LECA genes could inform which functions evolved from which other functions, and, by applying branch length analysis, when they did. 154 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 155

A

B

Figure 5. Smallest taxonomic classifi cation of sister clade. A. Bar chart presenting all prokaryotic sister clades used for branch length analysis in Figure 3B, and also sister clades of mixed origin (‘cellular organisms’) that were not used for stem length analysis. For this chart, the sister clades were assigned at the lowest possible taxonomic level that united the sequences within that clade. B. Taxonomic rank of the sister clades in (A). The included ranks from highest to lowest level: 6 superkingdom – phylum – subphylum – class – subclass – order – suborder – family – tribe – genus – species. ‘No rank’ indicates that the rank of the taxon is not classifi ed, which is the case for e.g. ‘cellular organisms’.

The ebl-corrected estimates (Figure 2, 4B) did not correlate better than the raw estimates (Supplementary Figure 1A). This suggests that the faster-evolving copy did not maintain this increased rate after LECA: in that case, one would expect that the ebl-correction would make the duplication lengths converge. In fact, the raw duplication lengths (r=0.36) were more similar than the regular, corrected ones (r=0.17). This could signify that the median eukaryotic branch lengths were dissimilar as well, but in the reverse way. The ebl’s 154 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 155

indeed were quite dissimilar when plotting those from sister LECA genes stemming from the most terminal FECA-to-LECA duplications (Supplementary Figure 1B). Nevertheless, given that the raw duplication estimates varied strongly, many duplicates might still have evolved asymmetrically after the duplication. Possibly, the fast-evolving paralog acquired a new function (‘neofunctionalization’), of which its rapid sequence evolution is an indication (Figure 4B, function C evolved from function B after a gene duplication). Although highly different duplication lengths might complicate estimating the timing of a gene duplication, it may indicate the acquisition of new functions, hence they might provide useful information to functionally understand eukaryogenesis.

Conclusion & Discussion

Stem and duplication lengths invite to reconsider ‘mito-early’ and ‘mito- late’ In this study, we found that LECA contained at least 5534 genes, derived from Archaea, Alphaproteobacteria and other Bacteria (Figure 1), and from de novo origination, and multiplied through subsequent duplications. For these genes, we determined their evolutionary history from the moment they entered the pre-eukaryotic lineage, either via vertical descent or by (endosymbiotic) horizontal gene transfer. These histories enhance our understanding of eukaryogenesis, because: (1) Archaeal and (possibly) bacterial gene acquisitions and their subsequent duplications predate alpha-proteobacterial contributions, which might indicate that eukaryogenesis started already before the entry of the alphaproteobacterium that gave rise to the mitochondrion, and (2) eukaryote-specifi c genes duplicated en masse, and likely after the entry ofthe alphaproteobacterium, suggesting that after this endosymbiosis, eukaryogenesis continued and maybe even accelerated. Hence, possibly a ‘mito-intermediate’ scenario would better fi t these data than either ‘mito-early’ or ‘mito-late’. In such a scenario some, but not all, eukaryotic features had already emerged before the mitochondria [397]. Another model in which these two observations may fi t is that some key events in eukaryogenesis, including the emergence of mitochondria, occurred concomitantly. A recent model on eukaryogenesis, coined ‘inside-out’, proposes that mitochondria are not 6 the result of a host engulfi ng an alphaproteobacterium (‘outside-in’), but of an archaeal- related cell with extracellular membrane protrusions, which expanded more and more around the alphaproteobacterium [71]. Those outgrowths facilitated the interaction between both prokaryotic cells by increasing their contact area. Under this inside-out model, the ER and nucleus originated during, and as a result of, these extensions. This model would be interesting to investigate phylogenomically, since it predicts that many genomic expansions have a similar timing to the entry of alphaproteobacterial genes. 156 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 157

Our data showed that eukaryote-specifi c genes duplicated most, followed by archaeal- derived genes, followed by bacterial- and alphaproteobacterial-derived genes (Figure 3A). Given that most duplicates of eukaryote-specifi c origin were relatively young (Figure 3B), these differences in duplication ratios do not seem to be caused by some genes that had more time to duplicate than others. If the FECA-to-LECA lineage had relatively constant, equal rate of duplication, one would have expected the archaeal-derived genes to have duplicated most often. More likely, these differences in ‘duplication tendencies’ refl ect functional differences between genes. Some functional categories have been shown to be substantially more prone to gene duplication, or in fact retention of the gene duplicates, than others. For example, between FECA and LECA, genes with a role in protein fate determination duplicated more often than genes with metabolic functions [76]. Indeed, genes of different origins also tend to be functionally biased [398]. Genes from archaeal origin often operate in information processing, whereas those from alphaproteobacterial origin are strongly enriched in metabolic functions [72]. To test if the uneven duplications distribution among origins was confounded by the functional enrichment of these origins, it would be worthwhile to project functional categories onto the gene clusters. With such a projection, one can assess if functional category is a good predictor of the duplication tendency, also across different gene origins.

We observed that multiple duplication length estimates of a single duplication node differed strongly. These differences did not weaken by using the normalized duplication lengths instead of the raw ones (compare Figure 4A to Supplementary Figure 1A). Possibly, the ebl-based normalization did not result in more similar estimates because the evolutionary dynamics before and after LECA were substantially different, and including the rate of sequence divergence. Gene families may follow a pattern of episodic, rapid sequence evolution before LECA followed by slower, conservative evolution after LECA in the different eukaryotic lineages. Such a pattern would fi t into a ‘Big Bang’ hypothesis for eukaryogenesis [387], and may underlie the many functional innovations from FECA- to-LECA, compared to more recent, post-LECA evolution.

Complications in assessing the ancestries of LECA genes 6 Many eukaryotic genes had sister clade that falls in the category of Bacteria, which includes all sister clades that contain bacterial sequences that were not confi ned to alphaproteobacterial ones. Hence, this is a very broad, not very informative category, which prevented us from exactly pinpointing from which lineage these genes were derived. Looking specifi cally at the sister clade identity at the lowest possible taxonomic level, we observed that the majority still belonged to the ‘Bacteria’ (Figure 5A). In fact, we observed that some sister clades were not even confi ned to a single domain of life (see Figure 5A: ‘cellular organisms’ and Table 1: ‘multidomain sister clade’). The ‘Bacteria’ sister clades thus encompassed a wide variety of bacterial sequences, belonging to 156 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 157

different bacterial phyla. Similarly, the majority of archaeal sister clades were annotated as ‘Archaea’, hence comprising sequences from divergent archaeal lineages. In general, most sister clades had a high taxonomic level (Figure 5B). Such broad sister clades obviously do not indicate that these genes were obtained by eukaryotes before the divergence of the prokaryotic lineages in these broad sister clades. After all, FECA is much younger than the common ancestor of Bacteria or Archaea, as it is likely younger than the common ancestors of major bacterial and archaeal (super)phyla and classes. Rather, these broad sister clades likely refl ect technical and biological issues. First, single gene trees may carry too little signal to accurately tease out the correct topology from the sequences. This may result in a eukaryotic sequence clade that in reality has Asgard Archaea as a sister clade, ends up next to a group of prokaryotic sequences containing, in addition to these Asgard sequences, sequences from various other archaeal taxa. Second, after a prokaryote “donated” a gene to the pre-LECA lineage, it may have been horizontally transferred among diverse prokaryotic lineages [190]. For example, the eukaryotic genes that had ‘Proteobacteria’ as a sister clade (Figure 5A) might actually be of alphaproteobacterial, endosymbiotic origin (Figure 1), and subsequently may have been transferred among Proteobacteria. This may also explain why we found a low number of alphaproteobacterial-derived genes, compared for example a study by Gabaldón & Huynen, who estimated that the alphaproteobacterial endosymbiont contributed at least 630 genes to eukaryotes [399]. In addition to transfer after a gene entered the pre-LECA lineage, a gene may also have been transferred among prokaryotes before, which also may obscure the origin of that eukaryotic gene [53, 190]. Given the importance of horizontal gene transfer among prokaryotes [400], it likely also occurred to genes that ended up in eukaryotes. Moreover, the prokaryotic donor lineage may subsequently lose the donated gene, which also complicates its identifi cation, as well as its stem length estimation (Figure 2). Altogether, the exact origins of eukaryotic genes are diffi cult to unravel, even in phylogenomics approaches.

Although we found that gene duplication contributed substantially to the LECA genome, its contribution is smaller than we expected based on previous work [76]. After all, not only genome-wide, but also small-scale studies demonstrated that particular eukaryotic cellular systems contain many FECA-to-LECA paralogs, such as such as the spliceosome 6 [290], the intrafl agellar transport complex [336], COPII [73], the nuclear pore [74] and the ubiquitination system [297]. Possibly, our set of phylogenetic trees reveals that many paralogs are in fact pseudoparalogs: those are homologous genes in eukaryotic genomes that did not result from gene duplications between FECA and LECA, but from multiple prokaryotic origins, including for example various ribosomal proteins. Alternatively, our trees may reveal that some paralogs are true paralogs, but more ancient than FECA, e.g. because they are shared with certain Archaea that were not included before. Both cases would be refl ected by multiple monophyletic clades of eukaryotic sequences in 158 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 159

the trees, which is indeed what we observed: 3451 monophyletic eukaryotic clades (which may contain multiple LECA genes) were found in 1046 gene trees, suggesting that the clusters on average contained 3 eukaryotic clades. While some of these might refl ect ancient duplications that eukaryotes share with (certain) prokaryotes, others may comprise pseudoparalogs. Moreover, in some instances the eukaryotic sequences might be polyphyletic due to prokaryotic sequences that, a result of horizontal gene transfer or contamination, were positioned within the eukaryotic sequences. Although eukaryote- to-prokaryote gene transfer seems less frequent than prokaryote-to-eukaryote, multiple cases were reported [401-403]. Contamination has also observed, especially in certain genomes [404]. We tried to remove prokaryotic sequences likely refl ecting eukaryote- to-prokaryote horizontal gene transfer or contamination (see Materials & Methods), but likely our fi ltering criteria did allow some to be maintained in the trees. Such prokaryotic sequences in otherwise eukaryotic clades may have disguised FECA-to- LECA duplications. Moreover, as mentioned above, eukaryotic sequences may faultily have been split in the gene tree simply due to erroneous estimations of the gene tree topology. Finally, rapidly evolving homologous genes may have escaped from homology detection, due to which they did not end up in the same tree. In our investigations into the origins of the kinetochore, we observed that many of the genes involved in the kinetochore are FECA-to-LECA paralogs, but were not easily identifi ed as such, because their sequences diverged extensively [405]. All in all, we think that the phylogenomics dataset that we analyzed here underestimated the number of FECA-to-LECA gene duplications.

Phylogenomics provides a framework to functionally study eukaryogenesis Possibly, adding more genomes to such a phylogenomics analysis may help to more reliably infer the origins and duplication history of eukaryotic genes. Genomes that may help include those that actually represent a descendant of the donor species, those able to uncover transfers among eukaryotes or from eukaryotes to prokaryotes and those that fi x erroneous tree topologies. To illustrate the fi rst based on our own data, by adding the recently published Asgard genome bins [9], we found 218 eukaryotic clusters with Asgard sequences constituting their sister clade. If the Asgard sequences were not part 6 of the phylogenetic analysis, the sister clade may have contained archaeal lineages that are less closely related to eukaryotes, which would not only lead to erroneous inference of the gene’s origin, but which would also increase the stem length. The Asgard lineages were shown to contain genes previously considered eukaryote-specifi c [74], which may support the hypothesis that certain ‘eukaryotic’ features already evolved in the common ancestor of eukaryotes and Asgard Archaea. Possibly, some of the FECA-to-LECA duplications reported in previous work were, according to our analysis, in fact more ancient because they were shared with certain Asgard lineages, as explained above. Such duplications would be interesting to study, specifi cally if the genes involved are 158 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 159

hypothesized to contribute to eukaryotic complexity. New genomes from even closer related prokaryotes might aid in further specifying the prokaryotic origins of eukaryotic genes, and may fi ne-tune the relative order of gene acquisitions. In addition to better or more data, the interpretation of genes trees may be improved. By looking at individual gene trees, we observed that our parsing method (see Materials & Methods) and, but potentially also phylogeny-sensitive automatic reconciliations, often reconstructed the evolutionary history of a gene differently than manual reconciliation. By applying automatic analysis that more adequately take into account the dynamics of genome evolution and the features of the data [406], our inferences may improve.

As discussed above, functional annotation of genes may help to assess if genes having certain functions duplicated more often during eukaryogenesis than others. Functional annotation may also be able to uncover two other (potentially) important aspects of eukaryogenesis. By functionally annotated not only a whole gene tree, but also individual LECA genes in the tree, as illustrated by Figure 4B, one could learn in which order different functions evolved during eukaryogenesis. By associating stem and duplication lengths to such functions, the order could also be examined across different trees. Furthermore, such functional annotations could uncover the relative importance of sub- and neofunctionalization after gene duplication during eukaryogenesis. In various lineages, both processes are recognized to contribute to the retention of different duplicated genes. One may argue that, due to the immense functional innovation between FECA and LECA (Figure 1), neofunctionalization was relatively important compared to subfunctionalization. If functional annotation of LECA genes enables one to discriminate between neo- and subfunctionalization, it furthermore allows for assessing whether asymmetric duplication lengths correlate to neofunctionalization. If such neofunctionalizations after duplications are confi rmed, one could argue that the paralog with the new function likely underwent accelerated evolution. On the other hand, the gene copy that preserved the original function might have evolved at a more constant pace. If so, this would argue for using the branch lengths of this ‘ancestor-like’ paralog only, that is mostly the shortest (Figure 4A). 6 Materials & Methods

Constructing a novel database of KOG-to-COG gene clusters Clusters of homologous sequences were created based on the KOG-to-COG mappings established by Makarova et al. [76]. To assign sequences to the KOG-to-COG clusters, we applied a specifi c method for each of the following three species categories: the eukaryotes, the prokaryotes excluding Asgard Archaea, and the Asgard Archaea. For eukaryotes, we collected protein sequences from 10 species present in 5 eukaryotic 160 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 161

supergroups: Naegleria gruberi [203] and Euglena gracilis (Excavata), Cladospihon okarmurans [407] and Bigelowiella natans [408] (Stramenopila-Alveolata-Rhizaria, a.k.a. ‘SAR’), Guillardia theta [408] and Klebsormidium fl accidum [3] (Archaeplastida), Acanthamoeba castellanii [409] and Acytostelium subglobosum [410] (Amoebozoa), and Capsaspora owczarzaki [61] and Nuclearia sp. [411] (Opisthokonta). We derived HMM profi les for KOGs from EggNOG 4.5 [204]. The original KOG-to-COG clusters also contained ‘TWOGs’, candidate orthologous groups. For each TWOG we found the best matching ‘ENOG’ (‘Unsupervised Cluster of Orthologous Group’) provided by EggNOG. We combined the HMM profi les of these ENOGs with the KOG HMM profi les and created a profi le database. We performed hmmscan [165] to assign protein sequences from the eukaryotic species to KOGs/ENOGs. For prokaryotes, except for Asgard Archaea, we downloaded all sequences from EggNOG, including their memberships to COGs. For the COGs present in the KOG-to-COG clusters, we obtained all their member sequences. For the Asgard archaeal genome bins [9], we collected the predicted protein sequences. We downloaded HMM profi les of all COGs from EggNOG and assigned the Asgard protein sequences to COGs using hmmscan. Subsequently, for all KOGs (and ENOGs, which we collectively refer to as KOGs) and COGs, we reduced the number of sequences with kClust, using a score per column of 3.53 [412]. We subsequently merged homologous sequences from eukaryotes, prokaryotes and Asgard Archaea according to the KOG-to-COG mapping, resulting in novel KOG-to-COG clusters. We applied the phylogenetic analysis described below on the original KOG-to-COG dataset [76], as well as on this novel one. Results from subsequent phylogenetic analyses of both dataset are presented in Table 2.

Building phylogenetic trees and selection of candidate LECA genes For each KOG-to-COG cluster, we generated phylogenetic trees using the in-house pipeline also used previously [65]. The sequences were aligned using MAFFT, version 6.861b, option –auto [413] and subsequently trimmed using trimAl, version 1.4, with a gap threshold of 0.1 [266]. From these alignments, we constructed phylogenetic trees using FastTree, version 2.1.8, with ‘WAG’ as evolutionary model [414]. Using ETE3 [415], we parsed each tree. First, all sequences were annotated with the taxonomic classifi cation of 6 the species using NCBI Taxonomy. We examined whether the tree contained prokaryotic sequences that probably refl ect recent horizontal gene transfers and that might interfere with our analysis. We searched for small clusters of archaeal or bacterial sequences within eukaryotic sequences and that either only belonged to a single genus, or consisted of a single sequence. Those were removed from the trees. Subsequently, we determined whether the tree contained sequences from prokaryotes and eukaryotes, or only from eukaryotes. For both tree categories, we determined whether the eukaryotic sequences they contain were derived from a gene present in LECA, and not from a more novel gene, such as gene found only in animals. We did this based on our LECA gene criterion, 160 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 161

Table 1. Analyses of gene trees from two different datasets. KOG-to-COG original contains gene clusters from Makarova et al., using the original species set [76]. KOG-to-COG update uses the same clusters projected onto all prokaryotic genomes from EggNOG [204], Asgard genome bins [9] and 10 eukaryotic genomes from 5 different eukaryotic supergroups: Capsaspora owczarzaki and Parvularia atlantis (Opisthokonta), Acytostelium subglobosum and Acanthamoeba castellanii (Amoebozoa), Guillardia theta and Klebsormidium fl accidum (Archaeplastida) Bigelowiella natans and Cladosiphon okamuranus (SAR), and Euglena gracilis and Naegleria gruberi (Excavata). For gene trees that have prokaryotic sequences, ‘FECA’ gene counts refl ect the number of monophyletic eukaryotic groups that fulfi ll the LECA criterion of having both a Unikont and a Bikont sequence. For eukaryote-only trees, each tree that suffi ces this LECA criterion counts as a single ‘FECA’ gene. In this study (Figures 3) we did not take into account ‘FECA’ genes with a sister clade that contains sequences from multiple domains of life (Archaea, Bacteria and Eukaryota). KOG-to-COG original KOG-to-COG update no. of analysed trees 1814 1963 no. of calculated stem lengths 1477 3864 no. of calculated duplication lengths 583 1426 no. of LECA genes 2768 5534 ‘FECA’ genes: Archaea 339 653 ‘FECA’ genes: Bacteria – other 899 2642 ‘FECA’ genes: Alpha-proteobacteria 129 156 ‘FECA’ genes: Eukarya 851 884 ‘FECA’ genes: multidomain sister 110 413

which states that in order to have been present in LECA, a gene must be found in both Unikonta (Opisthokonta, Amoebozoa) and Bikonta (Archaeplastida, Excavata, SAR), which we consider a likely root of the eukaryotic tree of life [31]. Only trees of which the eukaryotic sequences met this criterion were further analyzed.

Identifying LECA genes, gene duplications and sister clades in eukaryote- prokaryote gene trees 6 The gene trees containing eukaryote and prokaryote sequences were fi rst rooted on a random prokaryotic sequence. Monophyletic clusters of eukaryotic sequences were identifi ed. For each cluster, we assessed whether it fulfi lled the LECA criterion. Ifso, the node uniting this cluster was annotated as ‘FECA’ gene and further analyzed. The tree was rerooted on the prokaryotic sequence that had the largest distance to this eukaryotic clade. The FECA gene node was visited and the identity of its sister clade was determined and categorized as ‘Archaea’, ‘Bacteria’ or ‘Alphaproteobacteria’. Note that we also found FECA genes with sister clades that contain sequences from different 162 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 163

categories, as well as from eukaryotes (see Figure 5a: ‘cellular organisms’ and Table 1: ‘FECA genes: multidomain sister clade’), but these were not taken into account. Next to identifying its sister clade, for each ‘FECA gene’ we assessed whether it contained FECA-to-LECA gene duplications and, as a result, multiple LECA genes. Starting from the FECA gene node, we checked for each node in this eukaryotic clade whether its two daughters fulfi lled the LECA criterion, and whether its two daughters fulfi lled the species overlap criterion of two, which means that the daughters need to share at least two species. If so, this node was annotated as duplication node, and the same test was applied to both of its daughter nodes, until no new duplications were found. Then, the daughters of duplication nodes were annotated as LECA genes. If no duplications were found, this FECA gene node counted as a ‘FECA gene’ and as a ‘LECA gene’. For each FECA gene node in the tree, we inferred stem and duplication lengths (Figure 2A). Note that if for a given FECA gene node no duplications were found, only the stem length was calculated, which was only based on a single estimate (from a single LECA gene).

Identifying LECA genes and gene duplications in eukaryote-only gene trees Since eukaryote-only trees contained no candidate outgroup sequences that provided a root for the tree, we rooted these trees based on, if present, FECA-to-LECA duplication nodes. Treating the tree in an unrooted fashion, we visited each internal node in the tree, and checked if it could refl ect such a duplication. In order to be a duplication node, we required that each of its three daughters fulfi lled our LECA criterion (both in Unikonta and Bikonta), and that each pair of daughters fulfi lled the species overlap criterion. If a node meets these criteria, it was annotated as a duplication node, and its daughters were annotated as LECA gene nodes. After tree traversal, we collected the LECA gene nodes and rooted the tree at the midpoint of all distances between pairs of LECA genes in the tree. The root itself was also annotated as a duplication node. Trees in which no duplication nodes were identifi ed to contain a single LECA gene, and were not being rooted nor further analyzed. However, such a tree did count both as a ‘LECA gene’ and as a single ‘FECA’ gene in Figure 3A. If a tree did contain gene duplications, for each of these duplication nodes we inferred the duplication length (Figure 2B).

6 Author contributions JJEH performed the research. AV performed the pilot study preceding the research. JV and MMH aided in the development of the tree analysis pipeline. JJEH, BS and TG conceived the project. JJEH and BS wrote the manuscript. GK aided in writing the manuscript.

Acknowledgments We kindly thank Kira Marakova and Eugene Koonin for sharing their KOG-to-COG protein clusters with us. We also thank Rob Field for providing protein predictions from 162 Chapter 6 Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specifi c proteins 163

transcriptome sequences of Euglena gracilis.

Supplementary Material

A Raw duplication lengths: Raw duplication lengths: random assignment of daughter 1,2 daughter 2 assigned by largest estimate

r = 0.357 r = 0.67 daughter 2 daughter 2 daughter

daughter 1 daughter 1

B Median eukaryotic branch length (ebl): random assignment of daughter 1,2

r = 0.378 ebl daugther 2

6 ebl daugther 1

Supplementary Figure 1. Two raw duplication length estimates of a single FECA-to-LECA duplication and differences between median eukaryotic branch lengths of paralogous pairs. A. Similar as Figure 4A, but here the raw duplication lengths (Figure 2) are depicted. Raw duplication lengths > 5 were not taken into account. B. For the most terminal duplication nodes (duplication nodes that have no daughters that are duplication nodes), the ebl’s (Figure 2) of the daughter LECA genes, which are pairs of paralogs, were compared. Ebl’s > 5 were not taken into account.

7 Discussion 166 Chapter 7 Discussion 167

In this thesis, I delineated the evolution of the kinetochore by reconstructing its origins before LECA and its divergence after LECA. The kinetochore plays a vital role in eukaryotic cell division, because it connects the chromosomes to spindle microtubules, by which it ensures that chromosomes are moved by the spindle microtubules towards the future daughter cells. Because we aimed to construct the best, most detailed picture of the evolution of the kinetochore, we largely studied the evolution of the kinetochore proteins ‘by hand’, or, maybe more appropriate, ‘by eye’ (Chapters 3, 4 and 5). As I argued in Chapter 2, such a manual analysis enabled me to cope with protein-specifi c diffi culties, of which I encountered many examples while studying kinetochore proteins. On the other hand, quantitative insights into the generic patterns of genome evolution require automated, large-scale analysis, as I applied in Chapter 6. Indeed, this enabled us to analyze how, between the fi rst and the last eukaryotic common ancestor, the genome of the pre-eukaryotic lineage complexifi ed due to horizontal and endosymbiotic gene transfers, invention of novel genes and large-scale gene duplications. In this discussion, I will refl ect on my analyses, observations and conclusions and I will provide an outlook on the fi eld of cellular evolution of eukaryotes. I explore if, and how, the kinetochore serves as a model system for genome evolutionary dynamics before and after LECA. I discuss how comparative genomics could guide cell biology of non-model organisms. In turn, broader cellular experimentation could aid in reconstructing the evolution of specifi c cellular entities such as the kinetochore.

Kinetochore as a model for evolution of eukaryotic cellular processes?

The genome sequencing revolution has put on the table questions about how genomes evolve, and what their evolution is driven by. Often, models of genome evolution, such as on the evolutionary fate of gene duplicates, are developed, supported or adjusted by case studies. Famous cases showed that gene duplication triggered the evolution of C4 photosynthesis [416], of fl owers [417] and of the adaptive immune system [418]. Could the evolution of the kinetochore likewise provide a case example that can contribute to our understanding of the evolution of eukaryotic cellular processes? Here, I attempt to extract to what extend kinetochore evolution confi rms common evolutionary trajectories, how it follows uncommon trajectories, and which broader questions it poses.

Infl ation-streamlining in the kinetochore Comparative genomics as well as modeling studies have revealed that genomes often 7 evolve according to ‘infl ation-streamlining’, in which relatively complex ancestors radiate into many daughter lineages with reduced genomes (Chapter 1). The kinetochore may evolve in line with to such infl ation-streamlining dynamics. In Chapters 3 and 5, 166 Chapter 7 Discussion 167

I reconstructed the kinetochore of the last eukaryotic common ancestor (LECA) and demonstrated that it was highly complex, i.e. contained many proteins. This complexity resulted from recruitment of proteins to the kinetochore and subsequent duplications (Chapter 5), which can be considered ‘infl ation’. Subsequently, the kinetochore reduced through many losses in more recent eukaryotic lineages. Notably, 83% of the examined current-day eukaryotes encode fewer kinetochore components than LECA. This percentage is partially fl awed by the data, because we based our inferences only on the characterized kinetochores of human and yeast. In fact, very likely other kinetochores contain other proteins, which we did not account for. It may also be fl awed by the way we look at the data: Because we defi ne orthology on the level of eukaryotes, recent duplicates only count as a single kinetochore protein, while maybe both paralogs are involved in the kinetochore and hence contribute to increased complexity in current- day lineages. Nevertheless, even if we account for such duplications (by considering duplicates as individual kinetochore proteins), still 66% of the eukaryotes have smaller kinetochores than LECA. Hence, it seems reasonable to conclude that after LECA, many lineages reduced, or streamlined, their kinetochores. Do these fi ndings indeed indicate that the kinetochore evolved in an infl ation-streamlining-like manner? Often, the ancestral genome is infl ated by gene families that expanded through gene and genome duplications. Indeed, many of the kinetochore components are paralogous to one another (Chapter 5), so likely the kinetochore indeed infl ated through expansion, for example of RWD and histone domains. On the other hand, kinetochore streamlining does not seem to occur primarily via loss of the paralogs that were created during to this expansion, as the typical infl ation-streamlining hypothesis posits. Rather, the kinetochore streamlining seems to have proceeded mostly via loss of functional modules, such as the CCAN (Chapter 3). The CCAN of LECA was composed of 17 proteins (Chapter 5). It was lost long ago (quickly after LECA, e.g. in the archaeplastid, alveolate and excavate ancestors) as well as more recently (e.g. in the dikaryan fungus Cryptococcus neoformans). Since the complete complex was lost, it seems unlikely that ancient paralogous proteins are still able to fulfi ll the ancient, pre-expansion function. For example, it seems unlikely that other RWD proteins of the kinetochore, such as Knl1, are able to take over the CCAN function of CenpP or CenpO in lineages that lost the latter proteins: such a substitution would not make much sense since these lineages do not have a CCAN at all. Since streamlining hence mostly occurs through loss of kinetochore modules, we can conclude that already before LECA, various ancient paralogs sub/neofunctionalized towards these different modules. In LECA, the paralogs thus likely already were no longer redundant. This contrasts infl ation-streamlining models that suggest streamlining via loss of redundant proteins [419], mostly when applied to whole-genome duplications as a means of infl ation [60]. Hence, although the kinetochore superfi cially seems to follow 7 infl ation-streamlining dynamics, it does not entirely evolve as this theory predicts. 168 Chapter 7 Discussion 169

Constructive neutral evolution in the kinetochore Another model of genome evolution, coined constructive neutral evolution, explains how (genomic) complexity is retained, while the phenotype, at least initially, is not signifi cantly altered (Chapter 1). Recent kinetochore evolution might indeed encompass examples of constructive neutral evolution. The LECA kinetochore only contained a single copy of MadBub, a protein composed of a TPR domain and a kinase domain. We observed that many current-day lineages have two copies of this protein, which apparently resulted from multiple gene duplications in distinct lineages. Often, these copies often subfunctionalized into one with a functional KEN box, and one with a functional kinase domain [143, 144]. Based on the observation that this trajectory was followed by multiple lineages independently, one might think that it alters the phenotype in an adaptive fashion, that it provides a selective advantage. However, evolutionary experiments showed that the ancestral protein, which contains both the KEN box and the kinase domain, confers a fi tness indistinguishable from that of paralogs [420]. Thus, although the genomes and the kinetochores became more complex, their phenotype might be similar. The MadBub protein seems to evolve as predicted by the duplication-degeneration-complementation (DDC) model. According to this model, pairs of paralogs proteins (‘duplication’) lose ancestral functions (‘degeneration’) in a differential manner, due to which they both become essential (‘complementation’) [421]. The DDC model thus presents a specifi c mode of constructive neutral evolution [422]. The kinetochore possibly encompasses some other examples of constructive neutral evolution, such as expansions that do not include duplication of an ancient kinetochore protein. Recently, BugZ and ARHGEF17 have been identifi ed as kinetochore components in human cell lines [223, 224, 423]. These proteins were both proposed to function in the spindle assembly checkpoint (SAC, Chapter 1). Currently no evidence exists that orthologs of these proteins have roles in the kinetochores of other species, such as budding yeast. In fact, budding yeast does not even have a 1-to-1 ortholog of ARHGEF17. Combined with the fact that these proteins correlate poorly with other SAC proteins across eukaryotic lineages (Chapter 3), we propose that they became involved in the SAC recently. We hypothesize that for example the SAC protein Bub3 only recently has become dependent on BugZ, while the SAC output of Bub3 may have stayed the same. On longer evolutionary timescales, a protein such as BugZ may allow for adaptive evolution, for example by fi ne-tuning Bub3’s output. Indeed, although not driving fi xation, the increased complexity may allow for adaptive traits to arise, to be ‘co-opted’ as such [424]. Possibly, constructive neutral evolution may have also played a role in the pre-LECA evolution of the kinetochore. The subsequent loss of various KT modules, such as the CCAN, may suggest that some of these are indeed ‘neutral’. However, since apparently these modules can be lost, 7 we cannot speak of ‘irremediable complexity’, which is often used as a synonym for constructive neutral evolution [57]: the complexity surely appears remediable. While in current-day lineages that have the CCAN, its proteins might be essential, rapid loss of 168 Chapter 7 Discussion 169

the complete module may in under certain, yet unidentifi ed circumstances be allowed for.

Analogous kinetochore proteins may require new theories Some patterns in kinetochore evolution do not align well with either infl ation-streamlining or constructive neutral evolution. One of these is non-homologous displacement: since LECA, the Ska complex (Ska-C) was recurrently displaced by the Dam1 complex (Dam1-C). Opposed to the above discussed processes, this displacement results in neither a kinetochore expansion nor a reduction. Non-homologous gene displacement is not completely rare, it has been observed before, but mostly for enzymatic proteins rather than for structural ones [425, 426]. This may have biological signifi cance, or may be due to investigator bias; it might be more straightforward to determine if two enzymes are functionally equivalent or not. The example of Ska-C and Dam1-C poses the question how frequently structural proteins, or even protein complexes, are being displaced. How might this question be answered? A common way to detect analogous proteins is through phylogenetic profi ling (Chapter 1): anticorrelating or complementary phylogenetic profi les of two proteins might indicate that these proteins perform the same job in different organisms [132, 426, 427], which is exactly what Ska-C and Dam1-C show. Possibly, large-scale phylogenetic profi ling across all sorts of protein functions could identify other non-enzymatic analogous pairs. In a genome-wide screen, we measured that of all protein pairs, 2% anticorrelate stronger than Ska-C and Dam1-C subunits (Chapter 4). It seems reasonable to assume that at least some of these protein pairs are non-enzymatic. It should be noted that many of these pairs may not be anticorrelating in the rare way Ska-C and Dam1-C do. While the phylogenetic profi les of Ska-C and Dam1-C both span the eukaryotic tree of life (probably due to HGT of Dam1-C), other analogous pairs of proteins will more likely follow a pattern in which one or both is/are restricted to a single clade. For example, one protein might be present in all eukaryotic lineages, except for one in which the other protein is present. Such patterns are also observed for various analogous enzyme pairs, which are either present in Bacteria or eukaryotes [425]. Moreover, the case of Ska-C and Dam1-C demonstrates that it is diffi cult, yet crucial, to understand how lineages transit between two analogous proteins or even protein complexes. How could Dam1-C originate and be maintained if Ska-C was still present, carrying out its function as normally? Was there any selective advantage the primordial Dam1-C gave, or did it maybe originate after loss of Ska-C? And if there was actual displacement of Ska-C in Dam1-C in lineages the latter complex got transferred to, what selective advantage did it give them? Were these somehow pre- adapted to Dam1-C? Thus, the displacement of Ska-C by Dam1-C also poses ultimate questions: why, how, and under what conditions does an analogous protein arise and 7 get integrated into an organism’s protein network? 170 Chapter 7 Discussion 171

Alternative dynamics as revealed by co-evolutionary analysis The example of Ska-C and Dam-C also shows that co-evolutionary analysis aids in determining what evolutionary phenomenon we are facing: if we would have inspected the Ska-C or Dam1-C phylogenetic profi le in isolation, we would not have inferred non- homologous gene displacement. There are some additional lessons we can learn from detecting co-evolution through phylogenetic profi les. It taught us that the kinetochore functions and evolves in a modular fashion. These complexes, or modules, are differently sized (3-13 in Chapter 3, Figure 1). Apparently also for larger complexes this co-evolution is detectable. Hence, both in large and in small complexes the subunits might depend strongly on one another. Large complexes may equally well get lost rapidly, as particularly the CCAN shows. Moreover, we have observed that some proteins, such as TRIP13 and ZW10 co-evolve with multiple other ones, while these other proteins do not coevolve with each other (Chapter 3). This might indicate that the ancestral protein had a dual role (discussed in Chapter 3). In case of such ‘three-way’ co-evolution, co-evolution with one protein may also be restricted to a specifi c clade. This may indicate that a protein gained another function in this specifi c clade. Maybe phylogenetic profi ling could therefore also be used to detect a functional switch, and thus aid in predicting whether ‘the ortholog conjecture’ applies to a given pair of orthologs or not. As discussed above, studying the evolution not just of single proteins, but of a protein network such as the kinetochore, may enable us to detect candidate cases of constructive neutral evolution: Through this analysis we can detect which proteins are likely added to such a network after the networks’ invention, possibly indicating constructive neutral evolution. However, proving that such an addition is neutral will be epistemologically challenging. Since the number of sequenced eukaryotic genomes is increasing, the resolution to detect co-evolution also does, which allows for detecting also the more fi ne-grained signals co-evolution (Chapter 3). For example, it may be possible to uncover the order in which displacement or subunit loss occurs. For this reason, I foresee that using and improving co-evolution detection methods will remain highly valuable.

Predictions on non-model kinetochores

As often mentioned, our evolutionary analysis of the kinetochore greatly suffers from a lack of information on non-model organisms’ kinetochores. Nevertheless, our predictions on non-model organisms’ kinetochores might be useful as a means to guide their cellular biology. In fact, we are making use of our computational predictions to study ciliate, apicomplexan and dinofl agellate kinetochores. Based on our mapping of yeast and 7 human kinetochore components onto other species, we could pinpoint which species’ kinetochores are most unusual and thus prioritize which to study experimentally. More systematically, one could collapse those lineages with (seemingly) similar kinetochores, 170 Chapter 7 Discussion 171

and for each of the resulting clusters pick one species that serves as a model. Whereas within the Archaeplastida one model species might suffi ce, more are needed from the SAR supergroup (Chapter 3, Figure 1). Moreover, our mapping could be used to detect analogous kinetochores or kinetochore complexes, as discussed above. Detection of co-evolution, perhaps of pairs of complexes rather than of pairs of proteins, may uncover analogs of, for example, the CCAN. Indeed, detection of co-evolution on the level of evolutionary modules was proven successful [134]. Although considered essential, the kinetochore itself might actually be displaced: the ‘unconventional’ kinetochore of kinetoplastids (Excavata) was reported to be non-homologous [116]. While the kinetoplastid kinetochore likely would not be predicted based on anti-correlating phylogenetic profi les in our current proteome dataset, if the number of genomes in the used database is suffi ciently large, this might become possible in the future. Of course, such predictions need to be validated experimentally.

With respect to the kinetoplastids kinetochore I would like to point out that it might not be completely unrelated to that of other kinetochores after all. Recently, D’Archivio & Wickstead found that it might contain an Ndc80/Nuf2-like protein [113]. We observed that two of the reported unconventional kinetoplastid kinetochore proteins (KKTs) appear orthologous to regular kinetochore proteins. The sequence of KKT14 might be most similar (unpublished data), and therefore orthologous, to MadBub. KKT15 is positioned next to the orthologous groups of Bub3 and Rae1 in the phylogeny of many WD40 domain-containing proteins (Chapter 5). While these kinetoplastids (Leishmania major, Trypanosoma brucei) have another sequence within the Rae1 orthologous group, this suggests that KKT15 belongs to the Bub3 orthologous group, although it strictly does not fall into its cluster.

While experimental characterization of non-model kinetochores will aid to further reconstruct the evolution of the kinetochore and its evolutionary dynamics, it will also aid in testing predictions, including the ortholog conjecture itself (Chapter 1). Furthermore, studying the kinetochores of other eukaryotes may answer the question to what extend the kinetochore is indeed a vital structure. Maybe very minimal kinetochores suffi ce in some species, e.g. having a microtubule-binding protein that is directly connected to the chromosomal DNA. Maybe in some species a kinetochore is not required at all, for example because chromosome segregation is executed without a mitotic spindle, or because (equal) chromosome segregation is not required. To exemplify the fi rst: Some ciliates separate their specialized macronuclei without centromeres and a mitotic spindle, a process that is called ‘amitosis’ [428]. Furthermore, these macronuclei are highly polyploid, due to which maybe chromosome segregation does not need to be 7 strictly regulated. After all, their offspring nuclei are unlikely to be devoid of any genes due to this polyploidy. Following a similar line of reasoning, some highly polyploidy 172 Chapter 7 Discussion 173

archaeal species have been suggested to not deploy a genome segregation system at all [25].

By examining the kinetochores and chromosome segregation systems of more diverse eukaryotes, a better model of these features in LECA can be sketched. Hopefully, this will enable us to answer questions such as if LECA had open or closed mitosis, or an intermediate form (Chapter 1), and which mitosis type thus evolved from which in more recent eukaryotic lineages. In order to assess the evolutionary fl exibility of the kinetochore, and possibly correlating kinetochore diversity to phenotypic features, it might be most fruitful to study closely related lineages of which the (predicted) kinetochores differ strongly. Also in choosing these, our mapping of kinetochore proteins might be useful.

More data from extant species make more complex ancestors

While our interpretation that LECA had a complex kinetochore aligns with the ‘infl ation- streamlining’ hypothesis, it also seems inherent to the way we interpret the data. We may be overestimating the LECA kinetochore and the LECA genome in general. Our inference of the LECA kinetochore relies on two assumptions. First, we generally assume that proteins that function in the kinetochore in current-day lineages also already did so in LECA. Second, we generally assume that proteins that are present on both sides of the hypothesized eukaryotic root were present in LECA’s genome, even if they are absent from many current-day species. Due to these assumptions, more data, either genomic or experimental, tend to complexify the LECA kinetochore. To me, this seems an undesirable trend, because it would mean that kinetochore evolution in recent eukaryotes is dominated by simplifi cation, while I would assume that simplifi cation and complexifi cation would be more balanced. It might be partially solved by closely assessing if these assumptions apply to a given protein.

The fi rst assumption relates to the ortholog conjecture (Chapter 1). Assuming that the (single) function we see in current-day species is also the ancestral one can be disputed, for example because many proteins have been shown to operate in multiple processes, such as so-called ‘moonlighting proteins’ [429] and promiscuous proteins with intrinsically disordered regions [430]. Moreover, proteins may evolve new functions, particularly paralogs subject to neofunctionalization [431]. How can we identify proteins that may not have the ancestral function, in order to exclude them from the reconstruction of a particular process or machinery in the ancestor, such as the kinetochore? As mentioned, 7 one could evaluate whether the protein co-evolves with other components of this process or machinery [222]. Moreover, if the protein relies on a protein motif to execute the function of interest, while this motif is only found in a few closely related species, 172 Chapter 7 Discussion 173

this might be another reason to exclude this protein from the reconstruction. Finally, of course information on alternative function(s) in (multiple) other species may point to another ancestral function.

The second assumption is associated to the infl ation-streamlining hypothesis of genome evolution. Not only the last common ancestor of eukaryotes has been observed to be complex, but also the ancestors of for example the Archaea and the bacterial phylum of Rickettsia [56]. Although these ancestral genomes probably were complex indeed [432], their complexity might be overestimated to some degree. This overestimation might occur if HGT has been underestimated in the reconstruction methods, resulting in what Doolittle et al. designated a ‘genome of Eden’ [433]. Particularly LECA reconstructions might face this risk, given that most of these do not account for any transfer at all. As mentioned, HGT is considered relatively rare in eukaryotes [53], but its frequency is topic of intense debate [197, 434-436]. Studies in eukaryotes often reported HGT in particular lineages, such as fungi, and bacterial species frequently seem to be the donor [54, 437, 438]. Although proving HGT can be challenging, for example due to poorly supported gene phylogenies (see e.g. Chapter 4), observation-based estimates of HGT in eukaryotes would be highly valuable. Maybe such estimates need to be lineage- and/ or process-specifi c, as not all lineages and not all functional categories are subject to HGT equally frequently.

Small- and large-scale studies complement each other in illuminating eukaryogenesis

Eukaryotes possess many features prokaryotes lack (Chapter 1), and, as a result, these features are often subject to studies that attempt to reconstruct their origins, as we did for the kinetochore (Chapter 5). These studies yielded important insights into the prokaryotic origins and the evolutionary dynamics of many machineries, pathways and structures [72]. Collectively, these studies revealed that 1) proteins operating in eukaryote-specifi c systems can be related to proteins from diverse Bacteria and Archaea, or likely are eukaryote-specifi c, 2) within a given system, proteins may come from different domains of life, leading to ‘chimeric’ systems, 3) many of these proteins were derived from FECA-to-LECA duplications, whose products either formed protein complexes within a particular system, or contributed to different systems. In case of the latter, the subunits of an ancient protein complex may even have co-duplicated, due to which one can actually speak of ‘paralogous systems’. 7 These observations bring forward questions about the general patterns of genome evolution during eukaryogenesis. Which prokaryotes contributed in what degree to 174 Chapter 7 Discussion 175

the genome of LECA? Can we uncover the underlying events, such as HGT from a specifi c prokaryote clade into the pre-eukaryotic lineage? How did functions evolve, such that proteins acquired from different sources became capable to cooperate with each other in completely novel, primordial eukaryotic systems? What is the dominant way in which FECA-to-LECA duplications contributed to eukaryotic systems: within or across systems? Answering such questions requires large-scale comparative genomics, and indeed, such studies have been carried out [65, 76, 391]. In general, these studies confi rm what has been observed in small-scale studies of specifi c eukaryotic systems. Most convincingly they confi rm the mixed origins of the eukaryotic genome, and the relatively small contribution of Archaea to the eukaryotic genome – although this latter observation might partially result from the relatively small number of sequenced archaeal genomes. These studies also confi rm, and agree on, the functional biases of the genes from different prokaryotic origins: typically information-processing proteins are coming from Archaea and metabolic proteins from Bacteria. Particularly, they all reveal, as we also did in Chapter 6, that many genes have origins related to Bacteria other than Alphaproteobacteria. It is however not always possible to pinpoint which bacterial lineage actually was the donor. As we discussed in Chapter 6, fi nding the donor might be hampered by extensive, old and ongoing HGT among Bacteria, or even Bacteria and Archaea.

Phylogenomics studies are expected to confer certain advantages over non-phylogenetic inferences. They are assumed to best represent the actual origins and in successive duplications and can be used for branch length analysis and timing. However, these phylogenomics studies often not seem to cover the ancestries of all genes present in LECA. LECA has been estimated to have encoded contained ~4000 proteins [76, 203], of which previous such studies only analyzed a small subset [65, 391] (because they not explicitly cover FECA-to-LECA duplications, the exact number is unclear). In part, this gap is simply caused by the absence of eukaryote-only proteins in these studies, but in part it is likely also due to the elimination of families that were unsuitable for the analysis, e.g. due to low statistical supports of essential nodes. Often unclear phylogenies are expected to be improved by adding more data. As long as the protein families are computationally feasible, new genome data may help. More and new prokaryotic genomes could help to specify the origins of eukaryotic genes with higher certainly. However, as we observed, adding more proteins also increases the chances that interpreting the tree in an automated fashion becomes more diffi cult, because for example a prokaryotic sequence falls into the eukaryotic cluster, and thereby breaks up the eukaryotic monophyly (Chapter 6). Possibly informed subsampling of the genomes, 7 such as selecting genomes that evolve slowly and are not often involved in HGT, could help to improve the trees. Many trees that are diffi cult to analyze automatically might nevertheless be understood when analyzed visually, ‘by hand’. For this reason, small- 174 Chapter 7 Discussion 175

scale studies will continue to enhance our understanding of eukaryogenesis (Chapter 2). An important challenge will be how to incorporate these into large-scale examinations, such that they not just anecdotally illustrate eukaryogenesis, but also contribute to a global eukaryogenesis model.

7 176 References References 177

References

1. Knoll, A.H. et al. (2006) Eukaryotic organisms in Proterozoic oceans. Philosophical Transactions of the Royal Society B: Biological Sciences 361 (1470), 1023-1038. 2. Eme, L. et al. (2014) On the Age of Eukaryotes: Evaluating Evidence from Fossils and Molecular Clocks. Cold Spring Harbor Perspectives in Biology 6 (8). 3. Hori, K. et al. (2014) Klebsormidium flaccidum genome reveals primary factors for plant terrestrial adaptation. Nat Commun 5, 3978. 4. Wu, J. et al. (2003) Identification of functional links between genes using phylogenetic profiles. Bioinformatics 19 (12), 1524-30. 5. Glazko, G.V. and Mushegian, A.R. (2004) Detection of evolutionarily stable fragments of cellular pathways by hierarchical clustering of phyletic patterns. Genome Biol 5 (5), R32. 6. Jeyaprakash, A.A. et al. (2012) Structural and functional organization of the Ska complex, a key component of the kinetochore-microtubule interface. Mol Cell 46 (3), 274-86. 7. Lake, J.A. et al. (1984) Eocytes: a new ribosome structure indicates a kingdom with a close relationship to eukaryotes. Proceedings of the National Academy of Sciences of the United States of America 81 (12), 3786-3790. 8. Rivera, M.C. and Lake, J.A. (1992) Evidence That Eukaryotes and Eocyte Prokaryotes Are Immediate Relatives. Science 257 (5066), 74-76. 9. Zaremba-Niedzwiedzka, K. et al. (2017) Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature 541 (7637), 353-358. 10. Spang, A. et al. (2015) Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature 521, 173. 11. Vader, G. (2015) Pch2TRIP13: controlling cell division through regulation of HORMA domains. Chromosoma 124 (3), 333-339. 12. Maaten, L.v.d. and Hinton, G. (2008) Visualizing data using t-SNE. Journal of Machine Learning Research 9 (Nov), 2579-2605. 13. Söding, J. (2005) Protein homology detection by HMM–HMM comparison. Bioinformatics 21 (7), 951-960. 14. van Hooff, J.J. et al. (2017) Evolutionary dynamics of the kinetochore network in eukaryotes as revealed by comparative genomics. EMBO reports. 15. Ncbi Resource Coordinators (2015) Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 43 (D1), D6-D17. 16. Williams, T.A. et al. (2013) An archaeal origin of eukaryotes supports only two primary domains of life. Nature 504, 231. 17. Sagan, L. (1967) On the origin of mitosing cells. Journal of Theoretical Biology 14 (3), 225-274. 18. Brooks, C.F. et al. (2011) Toxoplasma gondii sequesters centromeres to a specific nuclear region throughout the cell cycle. Proceedings of the National Academy of Sciences 108 (9), 3767-3772. 19. Zelter, A. et al. (2015) The molecular architecture of the Dam1 kinetochore complex is defined by cross-linking based structural modelling. Nature Communications 6, 8673. 20. Ye, Q. et al. (2015) TRIP13 is a protein-remodeling AAA+ ATPase that catalyzes MAD2 conformation switching. Elife 4, e07367. 21. Madera, M. (2008) Profile Comparer: a program for scoring and aligning profile hidden Markov models. Bioinformatics 24 (22), 2630-2631. 22. Berman, H.M. et al. (2000) The Protein Data Bank. Nucleic Acids Research 28 (1), 235-242. 23. Petrovic, A. et al. (2016) Structure of the MIS12 Complex and Molecular Basis of Its Interaction with CENP-C at Human Kinetochores. Cell 167 (4), 1028-1040.e15. 24. Zhou, X. et al. (2017) Phosphorylation of CENP-C by Aurora B facilitates kinetochore attachment error correction in mitosis. Proceedings of the National Academy of Sciences 114 (50), E10667-E10676. 25. Barillà, D. (2016) Driving Apart and Segregating Genomes in Archaea. Trends in Microbiology 24 (12), 957-967. 26. Badrinarayanan, A. et al. (2015) Bacterial Chromosome Organization and Segregation. Annual Review of Cell and Developmental Biology 31 (1), 171-199. 27. Lindås, A.-C. and Bernander, R. (2013) The cell cycle of archaea. Nature Reviews Microbiology 11, 627. 28. Kensche, P.R. et al. (2008) Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution. J R Soc Interface 5 (19), 151-70. 29. Burki, F. et al. (2007) Phylogenomics Reshuffles the Eukaryotic Supergroups. PLOS ONE 2 (8), e790. 30. Sullivan, K.F. et al. (1994) Human CENP-A contains a histone H3 related histone fold domain that is required for targeting to the centromere. The Journal of Cell Biology 127 (3), 581-592. 31. Derelle, R. et al. (2015) Bacterial proteins pinpoint a single eukaryotic root. Proceedings of the National Academy of Sciences 112 (7), E693-E699. 32. Simpson, A. et al. (2017) Protist Diversity and Eukaryote Phylogeny. Handbook of the Protists: 1-21. 33. He, D. et al. (2014) An Alternative Root for the Eukaryote Tree of Life. Current Biology 24 (4), 465-470. 176 References References 177

34. Finn, R.D. et al. (2014) Pfam: the protein families database. Nucleic Acids Research 42 (D1), D222-D230. 35. Chatr-aryamontri, A. et al. (2017) The BioGRID interaction database: 2017 update. Nucleic Acids Research 45 (D1), D369-D379. 36. Vilella, A.J. et al. (2009) EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Research 19 (2), 327-335. 37. Meluh, P.B. et al. (1998) Cse4p Is a Component of the Core Centromere of Saccharomyces cerevisiae. Cell 94 (5), 607-613. 38. Blower, M.D. and Karpen, G.H. (2001) The role of Drosophila CID in kinetochore formation, cell-cycle progression and heterochromatin interactions. Nature cell biology 3 (8), 730-739. 39. Burroughs, A.M. et al. (2015) Comparative genomic analyses reveal a vast, novel network of nucleotide-centric systems in biological conflicts, immunity and signaling. Nucleic Acids Research 43 (22), 10633-10654. 40. Buchwitz, B.J. et al. (1999) Cell division: A histone-H3-like protein in C. elegans. Nature 401 (6753), 547-548. 41. Takahashi, K. et al. (2000) Requirement of Mis6 Centromere Connector for Localizing a CENP-A-Like Protein in Fission Yeast. Science 288 (5474), 2215-2219. 42. Woese, C.R. and Fox, G.E. (1977) Phylogenetic structure of the prokaryotic domain: The primary kingdoms. Proceedings of the National Academy of Sciences 74 (11), 5088-5090. 43. Dubin, M. et al. (2010) Dynamics of a novel centromeric histone variant CenH3 reveals the evolutionary ancestral timing of centromere biogenesis. Nucleic Acids Research 38 (21), 7526-7537. 44. Talbert, P.B. et al. (2002) Centromeric localization and adaptive evolution of an Arabidopsis histone H3 variant. The Plant Cell 14 (5), 1053-1066. 45. Cervantes, M.D. et al. (2006) The CNA1 Histone of the Ciliate Tetrahymena thermophila Is Essential for Chromosome Segregation in the Germline Micronucleus. Molecular Biology of the Cell 17 (1), 485-497. 46. Hoeijmakers, W.A. et al. (2012) Plasmodium falciparum centromeres display a unique epigenetic makeup and cluster prior to and during schizogony. Cellular microbiology 14 (9), 1391-1401. 47. Keeling, P.J. (2010) The endosymbiotic origin, diversification and fate of plastids. Philosophical Transactions of the Royal Society B: Biological Sciences 365 (1541), 729. 48. Dawson, S.C. et al. (2007) The cenH3 histone variant defines centromeres in Giardia intestinalis. Chromosoma 116 (2), 175-184. 49. Janouškovec, J. et al. (2017) A New Lineage of Eukaryotes Illuminates Early Mitochondrial Genome Reduction. Current Biology 27 (23), 3717-3724.e5. 50. Brown, M.W. et al. (2018) Phylogenomics Places Orphan Protistan Lineages in a Novel Eukaryotic Super-Group. Genome Biology and Evolution 10 (2), 427-433. 51. Zubáčová, Z. et al. (2012) Histone H3 Variants in Trichomonas vaginalis. Eukaryotic Cell 11 (5), 654-661. 52. Ciccarelli, F.D. et al. (2006) Toward Automatic Reconstruction of a Highly Resolved Tree of Life. Science 311 (5765), 1283. 53. Ku, C. et al. (2015) Endosymbiotic origin and differential loss of eukaryotic genes. Nature 524 (7566), 427-432. 54. Keeling, P.J. and Palmer, J.D. (2008) Horizontal gene transfer in eukaryotic evolution. Nat Rev Genet 9. 55. Julius, L. et al. (2011) How a neutral evolutionary ratchet can build cellular complexity. IUBMB Life 63 (7), 528-537. 56. I., W.Y. and V., K.E. (2013) Genome reduction as the dominant mode of evolution. BioEssays 35 (9), 829-837. 57. Gray, M.W. et al. (2010) Irremediable Complexity? Science 330 (6006), 920-921. 58. Cuypers, T.D. and Hogeweg, P. (2012) Virtual Genomes in Flux: An Interplay of Neutrality and Adaptability Explains Genome Expansion and Streamlining. Genome Biology and Evolution 4 (3), 212-229. 59. Sémon, M. and Wolfe, K.H. (2007) Consequences of genome duplication. Current Opinion in Genetics & Development 17 (6), 505-512. 60. Van de Peer, Y. et al. (2009) The evolutionary significance of ancient genome duplications. Nature Reviews Genetics 10, 725. 61. Suga, H. et al. (2013) The Capsaspora genome reveals a complex unicellular prehistory of animals. Nature Communications 4, 2325. 62. Stanier, R. et al., The microbial world, 2 e éd, Englewood Cliffs, Prentice Hall, 1963. 63. Booth, A. and Doolittle, W.F. (2015) Eukaryogenesis, how special really? Proceedings of the National Academy of Sciences 112 (33), 10278-10285. 64. Martijn, J. et al. (2018) Deep mitochondrial origin outside the sampled alphaproteobacteria. Nature 557 (7703), 101-105. 65. Pittis, A.A. and Gabaldón, T. (2016) Late acquisition of mitochondria by a host with chimaeric prokaryotic ancestry. Nature 531 (7592), 101-104. 66. Degli Esposti, M. (2016) Late Mitochondrial Acquisition, Really? Genome Biology and Evolution 8 (6), 2031-2035. 67. Martin, W.F. et al. (2017) Late Mitochondrial Origin Is an Artifact. Genome Biology and Evolution 9 (2), 373-379. 68. Lane, N. and Martin, W. (2010) The energetics of genome complexity. Nature 467 (7318), 929-934. 69. Garg, S.G. and Martin, W.F. (2016) Mitochondria, the Cell Cycle, and the Origin of Sex via a Syncytial Eukaryote Common Ancestor. Genome Biology and Evolution 8 (6), 1950-1970. 70. Koonin, E.V. (2016) Viruses and mobile elements as drivers of evolutionary transitions. Philosophical Transactions of the Royal Society B: Biological Sciences 371 (1701). 71. Baum, D.A. and Baum, B. (2014) An inside-out origin for the eukaryotic cell. BMC biology 12 (1), 76. 178 References References 179

72. Koonin, E.V. (2010) The origin and early evolution of eukaryotes in the light of phylogenomics. Genome Biology 11 (5), 209. 73. Mans, B. et al. (2004) Comparative genomics, evolution and origins of the nuclear envelope and nuclear pore complex. Cell cycle 3 (12), 1625-1650. 74. Hartman, H. and Fedorov, A. (2002) The origin of the eukaryotic cell: a genomic investigation. Proceedings of the National Academy of Sciences 99 (3), 1420-1425. 75. Aravind, L. et al. (2006) Comparative genomics and structural biology of the molecular innovations of eukaryotes. Current Opinion in Structural Biology 16 (3), 409-419. 76. Makarova, K.S. et al. (2005) Ancestral paralogs and pseudoparalogs and their role in the emergence of the eukaryotic cell. Nucleic Acids Research 33 (14), 4626-4638. 77. Hajduk, I.V. et al. (2016) Connecting the dots of the bacterial cell cycle: Coordinating chromosome replication and segregation with cell division. Seminars in Cell & Developmental Biology 53, 2-9. 78. Morgan, D.O. (1997) CYCLIN-DEPENDENT KINASES: Engines, Clocks, and Microprocessors. Annual Review of Cell and Developmental Biology 13 (1), 261-291. 79. Santaguida, S. and Amon, A. (2015) Short- and long-term effects of chromosome mis-segregation and aneuploidy. Nature Reviews Molecular Cell Biology 16, 473. 80. Balasubramanian, M.K. et al. (2004) Comparative Analysis of Cytokinesis in Budding Yeast, Fission Yeast and Animal Cells. Current Biology 14 (18), R806-R818. 81. Musacchio, A. and Salmon, E.D. (2007) The spindle-assembly checkpoint in space and time. Nature Reviews Molecular Cell Biology 8, 379. 82. Bhaud, Y. et al. (2000) Morphology and behaviour of dinoflagellate chromosomes during the cell cycle and mitosis. J Cell Sci 113 (7), 1231-1239. 83. De Souza, C.P.C. et al. (2004) Partial Nuclear Pore Complex Disassembly during Closed Mitosis in Aspergillus nidulans. Current Biology 14 (22), 1973-1984. 84. McEwen, B.F. et al. (2001) CENP-E Is Essential for Reliable Bioriented Spindle Attachment, but Chromosome Alignment Can Be Achieved via Redundant Mechanisms in Mammalian Cells. Molecular Biology of the Cell 12 (9), 2776-2789. 85. Moens, P.B. (1976) Spindle and kinetochore morphology of Dictyostelium discoideum. The Journal of Cell Biology 68 (1), 113. 86. Peterson, J.B. and Ris, H. (1976) Electron-microscopic study of the spindle and chromosome movement in the yeast Saccharomyces cerevisiae. Journal of Cell Science 22 (2), 219. 87. De Souza, C.P. and Osmani, S.A. (2009) Double duty for nuclear proteins – the price of more open forms of mitosis. Trends in Genetics 25 (12), 545-554. 88. Drechsler, H. and McAinsh, A.D. (2012) Exotic mitotic mechanisms. Open Biology 2 (12). 89. Sazer, S. et al. (2014) Deciphering the Evolutionary History of Open and Closed Mitosis. Current Biology 24 (22), R1099-R1103. 90. Verdaasdonk, J.S. and Bloom, K. (2011) Centromeres: unique chromatin structures that drive chromosome segregation. Nature Reviews Molecular Cell Biology 12, 320. 91. Steiner, F.A. and Henikoff, S. (2015) Diversity in the organization of centromeric chromatin. Current Opinion in Genetics & Development 31, 28-35. 92. McKinley, K.L. and Cheeseman, I.M. (2015) The molecular basis for centromere identity and function. Nature Reviews Molecular Cell Biology 17, 16. 93. Melters, D.P. et al. (2013) Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. Genome Biology 14 (1), R10. 94. Henikoff, S. et al. (2001) The centromere paradox: stable inheritance with rapidly evolving DNA. Science 293. 95. Malik, H.S. and Henikoff, S. (2002) Conflict begets complexity: the evolution of centromeres. Curr Opin Genet Dev 12. 96. Plohl, M. et al. (2014) Centromere identity from the DNA point of view. Chromosoma 123 (4), 313-325. 97. Hyman, A.A. and Sorger, P.K. (1995) Structure and Function of Kinetochores in Budding Yeast. Annual Review of Cell and Developmental Biology 11 (1), 471-495. 98. Cuacos, M. et al. (2015) Atypical centromeres in plants—what they can tell us. Frontiers in Plant Science 6 (913). 99. Marques, A. and Pedrosa-Harand, A. (2016) Holocentromere identity: from the typical mitotic linear structure to the great plasticity of meiotic holocentromeres. Chromosoma 125 (4), 669-681. 100. Steiner, F.A. and Henikoff, S. (2014) Holocentromeres are dispersed point centromeres localized at transcription factor hotspots. Elife 3. 101. Drinnenberg, I.A. et al. (2014) Recurrent loss of CenH3 is associated with independent transitions to holocentricity in insects. Elife 3. 102. Musacchio, A. and Desai, A. (2017) A Molecular View of Kinetochore Assembly and Function. Biology 6 (1), 5. 103. Carroll, C.W. et al. (2010) Dual recognition of CENP-A nucleosomes is required for centromere assembly. The Journal of Cell Biology 189 (7), 1143. 104. Przewloka, Marcin R. et al. (2011) CENP-C Is a Structural Platform for Kinetochore Assembly. Current Biology 21 (5), 399-405. 105. Screpanti, E. et al. (2011) Direct binding of Cenp-C to the Mis12 complex joins the inner and outer kinetochore. 178 References References 179

Current Biology 21 (5), 391-398. 106. Hornung, P. et al. (2014) A cooperative mechanism drives budding yeast kinetochore assembly downstream of CENP-A. J Cell Biol 206 (4), 509-524. 107. Kang, Y.H. et al. (2011) Mammalian polo-like kinase 1-dependent regulation of the PBIP1-CENP-Q complex at kinetochores. Journal of Biological Chemistry 286 (22), 19744-19757. 108. Tirupataiah, S. et al. (2014) Yeast Nkp2 is required for accurate chromosome segregation and interacts with several components of the central kinetochore. Molecular Biology Reports 41 (2), 787-797. 109. Rago, F. et al. (2015) Distinct Organization and Regulation of the Outer Kinetochore KMN Network Downstream of CENP-C and CENP-T. Current Biology 25 (5), 671-677. 110. Kitagawa, M. and Lee, S.H. (2015) The chromosomal passenger complex (CPC) as a key orchestrator of orderly mitotic exit and cytokinesis. Frontiers in cell and developmental biology 3, 14. 111. Hindriksen, S. et al. (2017) The ins and outs of Aurora B inner centromere localization. Frontiers in cell and developmental biology 5. 112. Tanaka, T.U. (2010) Kinetochore–microtubule interactions: steps towards bi‐orientation. The EMBO Journal 29 (24), 4070. 113. D’Archivio, S. and Wickstead, B. (2017) Trypanosome outer kinetochore proteins suggest conservation of chromosome segregation machinery across eukaryotes. Journal of Cell Biology 216 (2), 379-391. 114. Chan, Y.W. et al. (2012) Aurora B controls kinetochore-microtubule attachments by inhibiting Ska complex-KMN network interaction. J Cell Biol 196 (5), 563-71. 115. Kalantzaki, M. et al. (2015) Kinetochore–microtubule error correction is driven by differentially regulated interaction modes. Nature Cell Biology 17, 421. 116. Akiyoshi, B. and Gull, K. (2014) Discovery of Unconventional Kinetochores in Kinetoplastids. Cell 156 (6), 1247- 1258. 117. Sibbald, S.J. and Archibald, J.M. (2017) More protist genomes needed. Nature Ecology &Amp; Evolution 1, 0145. 118. Ellegren, H. (2014) Genome sequencing and population genomics in non-model organisms. Trends in Ecology & Evolution 29 (1), 51-63. 119. Lewin, H.A. et al. (2018) Earth BioGenome Project: Sequencing life for the future of life. Proceedings of the National Academy of Sciences 115 (17), 4325-4333. 120. Bork, P. et al. (1998) Predicting function: from genes to genomes and back11Edited by P. E. Wright. Journal of Molecular Biology 283 (4), 707-725. 121. Haft, D.H. (2015) Using comparative genomics to drive new discoveries in microbiology. Current Opinion in Microbiology 23, 189-196. 122. Koonin, E.V. (2005) Orthologs, Paralogs, and Evolutionary Genomics. Annual Review of Genetics 39 (1), 309-338. 123. Ohno, S. (1970) Evolution by gene duplication Springer. New York. 124. Hahn, M.W. (2009) Distinguishing among evolutionary models for the maintenance of gene duplicates. Journal of Heredity 100 (5), 605-617. 125. Gabaldón, T. and Koonin, E.V. (2013) Functional and evolutionary implications of gene orthology. Nature Reviews Genetics 14, 360. 126. Nehrt, N.L. et al. (2011) Testing the Ortholog Conjecture with Comparative Functional Genomic Data from Mammals. PLOS Computational Biology 7 (6), e1002073. 127. Studer, R.A. and Robinson-Rechavi, M. (2009) How confident can we be that orthologs are similar, but paralogs differ? Trends in Genetics 25 (5), 210-216. 128. Mushegian, A.R. and Koonin, E.V. (1996) A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proceedings of the National Academy of Sciences 93 (19), 10268. 129. Forterre, P. (2002) Displacement of cellular proteins by functional analogues from plasmids or viruses could explain puzzling phylogenies of many DNA informational proteins. Molecular Microbiology 33 (3), 457-465. 130. Pellegrini, M. (2012) Using Phylogenetic Profiles to Predict Functional Relationships. In Bacterial Molecular Networks: Methods and Protocols (van Helden, J. et al. eds), pp. 167-177, Springer New York. 131. Carvalho-Santos, Z. et al. (2011) Tracing the origins of centrioles, cilia, and flagella. The Journal of Cell Biology 194 (2), 165-175. 132. Morett, E. et al. (2003) Systematic discovery of analogous enzymes in thiamin biosynthesis. Nat Biotechnol 21 (7), 790-5. 133. Pazos, F. and Valencia, A. (2008) Protein co‐evolution, co‐adaptation and interactions. The EMBO Journal 27 (20), 2648. 134. Li, Y. et al. (2014) Expansion of Biological Pathways Based on Evolutionary Inference. Cell 158 (1), 213-225. 135. Li, B. et al. (2000) Identification of Human Rap1: Implications for Telomere Evolution. Cell 101 (5), 471-483. 136. Schalm, S.S. and Blenis, J. (2002) Identification of a Conserved Motif Required for mTOR Signaling. Current Biology 12 (8), 632-639. 137. Pellegrini, M. et al. (1999) Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proceedings of the National Academy of Sciences 96 (8), 4285. 138. Marks, D.S. et al. (2011) Protein 3D Structure Computed from Evolutionary Sequence Variation. PLOS ONE 6 (12), e28766. 139. Falck, J. et al. (2005) Conserved modes of recruitment of ATM, ATR and DNA-PKcs to sites of DNA damage. 180 References References 181

Nature 434, 605. 140. Richards, T.A. et al. (2011) Horizontal gene transfer facilitated the evolution of plant parasitic mechanisms in the oomycetes. Proceedings of the National Academy of Sciences 108 (37), 15258-15263. 141. Anderson, D.P. et al. (2016) Evolution of an ancient protein function involved in organized multicellularity in animals. eLife 5, e10147. 142. Mykytyn, K. et al. (2004) Bardet–Biedl syndrome type 4 (BBS4)-null mice implicate Bbs4 in flagella formation but not global cilia assembly. Proceedings of the National Academy of Sciences of the United States of America 101 (23), 8664. 143. Suijkerbuijk, S.J. et al. (2012) The vertebrate mitotic checkpoint protein BUBR1 is an unusual pseudokinase. Dev Cell 22 (6), 1321-9. 144. Tromer, E. et al. (2016) Phylogenomics-guided discovery of a novel conserved cassette of short linear motifs in BubR1 essential for the spindle checkpoint. Open Biology 6 (12), 160315. 145. Atherton, J. et al. (2017) A structural model for microtubule minus-end recognition and protection by CAMSAP proteins. Nature Structural &Amp; Molecular Biology 24, 931. 146. Wickstead, B. et al. (2010) Patterns of kinesin evolution reveal a complex ancestral eukaryote with a multifunctional cytoskeleton. BMC Evolutionary Biology 10 (1), 110. 147. Talbert, P.B. et al. (2012) A unified phylogeny-based nomenclature for histone variants. Epigenetics & Chromatin 5 (1), 7. 148. Iyer, L.M. et al. (2004) Evolutionary history and higher order classification of AAA+ ATPases. Journal of Structural Biology 146 (1), 11-31. 149. Foth, B.J. et al. (2006) New insights into myosin evolution and classification. Proceedings of the National Academy of Sciences of the United States of America 103 (10), 3681. 150. Richards, T.A. and Cavalier-Smith, T. (2005) Myosin domain evolution and the primary divergence of eukaryotes. Nature 436, 1113. 151. Elias, M. et al. (2012) Sculpting the endomembrane system in deep time: high resolution phylogenetics of Rab GTPases. Journal of Cell Science 125 (10), 2500-2508. 152. Lindås, A.-C. et al. (2008) A unique cell division machinery in the Archaea. Proceedings of the National Academy of Sciences 105 (48), 18942. 153. Yutin, N. and Koonin, E.V. (2012) Archaeal origin of tubulin. Biology Direct 7 (1), 10. 154. Leonard, C.J. et al. (1998) Novel Families of Putative Protein Kinases in Bacteria and Archaea: Evolution of the “Eukaryotic” Protein Kinase Superfamily. Genome Research 8 (10), 1038-1047. 155. Holland, P.W.H. (1999) Gene duplication: Past, present and future. Seminars in Cell & Developmental Biology 10 (5), 541-547. 156. Basu, M.K. et al. (2008) Evolution of protein domain promiscuity in eukaryotes. Genome Research 18 (3), 449-461. 157. Tromer, E. et al. (2016) Phylogenomics-guided discovery of a novel conserved cassette of short linear motifs in bubr1 essential for the spindle checkpoint. Open Biology 6 (12). 158. Tekaia, F. (2016) Inferring Orthologs: Open Questions and Perspectives. Genomics Insights 9, GEI.S37925. 159. Bork, P. and Koonin, E.V. (1998) Predicting functions from protein sequences—where are the bottlenecks? Nature Genetics 18, 313. 160. Koster, Maria J.E. et al. (2015) Genesis of Chromatin and Transcription Dynamics in the Origin of Species. Cell 161 (4), 724-736. 161. Peviani, A. et al. (2016) The phylogeny of C/S1 bZIP transcription factors reveals a shared algal ancestry and the pre-angiosperm translational regulation of S1 transcripts. Scientific Reports 6, 30444. 162. Sanders, A.A.W.M. et al. (2015) KIAA0556 is a novel ciliary basal body component mutated in Joubert syndrome. Genome Biology 16 (1), 293. 163. Lynch, M. and Conery, J.S. (2000) The Evolutionary Fate and Consequences of Duplicate Genes. Science 290 (5494), 1151. 164. Maddison, D. and Schulz, K.-S. (2009) The tree of life web project. 2007. URL: http://tolweb.org. 165. Finn, R.D. et al. (2015) HMMER web server: 2015 update. Nucleic Acids Research 43 (W1), W30-W38. 166. Li, W. et al. (2015) The EMBL-EBI bioinformatics web and programmatic tools framework. Nucleic Acids Research 43 (W1), W580-W584. 167. Sonnhammer, E.L.L. and Östlund, G. (2015) InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Research 43 (D1), D234-D239. 168. Mi, H. et al. (2017) PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements. Nucleic Acids Research 45 (D1), D183-D189. 169. Zerbino, D.R. et al. (2018) Ensembl 2018. Nucleic Acids Research 46 (D1), D754-D761. 170. Altschul, S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25 (17), 3389-3402. 171. Altschul, S. (2011) The statistics of sequence similarity scores. World Wide Web electronic publication unknown. 172. Koonin, E.V. et al. (2000) Protein fold recognition using sequence profiles and its application in structural genomics. Advances in protein chemistry 54, 245-275. 173. Burmester, T. and Hankeln, T. (2014) Function and evolution of vertebrate globins. Acta Physiologica 211 (3), 501- 180 References References 181

514. 174. Findeisen, P. et al. (2014) Six Subgroups and Extensive Recent Duplications Characterize the Evolution of the Eukaryotic Tubulin Protein Family. Genome Biology and Evolution 6 (9), 2274-2288. 175. Jékely, G. (2003) Small GTPases and the evolution of the eukaryotic cell. BioEssays 25 (11), 1129-1138. 176. Reconciling Phylogenetic Trees. In Evolution after Gene Duplication. 177. Kamneva, O.K. and Ward, N.L. (2014) Chapter 9 - Reconciliation Approaches to Determining HGT, Duplications, and Losses in Gene Trees. In Methods in Microbiology (Goodfellow, M. et al. eds), pp. 183-199, Academic Press. 178. Nakhleh, L. (2013) Computational approaches to species phylogeny inference and gene tree reconciliation. Trends in Ecology and Evolution 28 (12), 719-728. 179. Kummerfeld, S.K. and Teichmann, S.A. (2005) Relative rates of gene fusion and fission in multi-domain proteins. Trends in Genetics 21 (1), 25-30. 180. Snel, B. et al. (2000) Genome evolution: gene fusion versus gene fission. Trends in Genetics 16 (1), 9-11. 181. Ekman, D. et al. (2007) Quantification of the Elevated Rate of Domain Rearrangements in Metazoa. Journal of Molecular Biology 372 (5), 1337-1348. 182. van Dam, T.J.P. et al. (2009) Phylogeny of the CDC25 homology domain reveals rapid differentiation of Ras pathways between early animals and fungi. Cellular Signalling 21 (11), 1579-1585. 183. De Lau, W.B. et al. (2012) The R-spondin protein family. Genome biology 13 (3), 242. 184. Fokkens, L. et al. (2010) Enrichment of homologs in insignificant BLAST hits by co-complex network alignment. BMC Bioinformatics 11 (1), 86. 185. Lambacher, N.J. et al. (2015) TMEM107 recruits ciliopathy proteins to subdomains of the ciliary transition zone and causes Joubert syndrome. Nature Cell Biology 18, 122. 186. Barker, A.R. et al. (2014) Bioinformatic analysis of ciliary transition zone proteins reveals insights into the evolution of ciliopathy networks. BMC Genomics 15 (1), 531. 187. Surkont, J. and Pereira-Leal, J.B. (2015) Evolutionary Patterns in Coiled-Coils. Genome Biology and Evolution 7 (2), 545-556. 188. Mistry, J. et al. (2013) Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Research 41 (12), e121. 189. Rackham, O.J.L. et al. (2010) The Evolution and Structure Prediction of Coiled Coils across All Genomes. Journal of Molecular Biology 403 (3), 480-493. 190. Ku, C. et al. (2015) Endosymbiotic gene transfer from prokaryotic pangenomes: Inherited chimerism in eukaryotes. Proceedings of the National Academy of Sciences. 191. Som, A. (2015) Causes, consequences and solutions of phylogenetic incongruence. Briefings in Bioinformatics 16 (3), 536-548. 192. Philippe, H. et al. (2011) Resolving Difficult Phylogenetic Questions: Why More Sequences Are Not Enough. PLOS Biology 9 (3), e1000602. 193. Whelan, S. et al. (2018) PREQUAL: detecting non-homologous characters in sets of unaligned homologous sequences. Bioinformatics, bty448-bty448. 194. Berke, L. and Snel, B. (2015) The plant Polycomb repressive complex 1 (PRC1) existed in the ancestor of seed plants and has a complex duplication history. BMC Evolutionary Biology 15 (1), 44. 195. Deutekom, E.S.v.D., T. J. P.; Snel, B. (2018) Measuring the impact of gene prediction on gene loss estimates in eukaryotes by quantifying falsely inferred absences. submitted. 196. Koutsovoulos, G. et al. (2016) No evidence for extensive horizontal gene transfer in the genome of the tardigrade <em>Hypsibius dujardini</em>. Proceedings of the National Academy of Sciences 113 (18), 5053. 197. Leger Michelle, M. et al. (2018) Demystifying Eukaryote Lateral Gene Transfer (Response to Martin 2017 DOI: 10.1002/bies.201700115). BioEssays 40 (5), 1700242. 198. Lee, S. et al. (2012) Characterization of Spindle Checkpoint Kinase Mps1 Reveals Domain with Functional and Structural Similarities to Tetratricopeptide Repeat Motifs of Bub1 and BubR1 Checkpoint Kinases. Journal of Biological Chemistry 287 (8), 5988-6001. 199. Nijenhuis, W. et al. (2013) A TPR domain–containing N-terminal module of MPS1 is required for its kinetochore localization by Aurora B. The Journal of Cell Biology 201 (2), 217. 200. Eme, L. et al. (2017) Archaea and the origin of eukaryotes. Nature Reviews Microbiology 15, 711. 201. del Campo, J. et al. (2014) The others: our biased perspective of eukaryotic genomes. Trends in Ecology & Evolution 29 (5), 252-259. 202. Cheng, S. et al. (2018) 10KP: A phylodiverse genome sequencing plan. GigaScience 7 (3), giy013-giy013. 203. Fritz-Laylin, L.K. et al. (2010) The genome of Naegleria gruberi illuminates early eukaryotic versatility. Cell 140 (5), 631-42. 204. Huerta-Cepas, J. et al. (2016) eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Research 44 (D1), D286-D293. 205. Piel, W. et al., TreeBASE v. 2: A Database of Phylogenetic Knowledge, e-BioSphere 2009, London, 2009. 206. Boettcher, B. and Barral, Y. (2013) The cell biology of open and closed mitosis. Nucleus 4 (3), 160-165. 207. Cheeseman, I.M. (2014) The kinetochore. Cold Spring Harbor perspectives in biology 6 (7), a015826. 208. Santaguida, S. and Musacchio, A. (2009) The life and miracles of kinetochores. The EMBO Journal 28 (17), 2511- 2531. 182 References References 183

209. De Souza, C.P.C. and Osmani, S.A. (2007) Mitosis, Not Just Open or Closed. Eukaryotic Cell 6 (9), 1521-1527. 210. van Hooff, J. et al. (2017) Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1. Genome biology and evolution. 211. Drinnenberg, I.A. et al. (2016) Evolutionary Turnover of Kinetochore Proteins: A Ship of Theseus? Trends in Cell Biology 26 (7), 498-510. 212. Tromer, E. et al. (2015) Widespread Recurrent Patterns of Rapid Repeat Evolution in the Kinetochore Scaffold KNL1. Genome Biology and Evolution 7 (8), 2383-2393. 213. Vleugel, M. et al. (2012) Evolution and function of the mitotic checkpoint. Dev Cell 23 (2), 239-50. 214. Meraldi, P. et al. (2006) Phylogenetic and structural analysis of centromeric DNA and kinetochore proteins. Genome Biology 7 (3), R23. 215. Eme, L. et al. (2011) The phylogenomic analysis of the anaphase promoting complex and its targets points to complex and modern-like control of the cell cycle in the last common ancestor of eukaryotes. BMC Evol Biol 11, 265. 216. Gutiérrez-Caballero, C. et al. (2012) Shugoshins: from protectors of cohesion to versatile adaptors at the centromere. Trends in Genetics 28 (7), 351-360. 217. Zamariola, L. et al. (2013) SGO1 but not SGO2 is required for maintenance of centromere cohesion in Arabidopsis thaliana meiosis. Plant Reproduction 26 (3), 197-208. 218. Wang, X. and Dai, W. (2005) Shugoshin, a guardian for sister chromatid segregation. Experimental Cell Research 310 (1), 1-9. 219. Lampert, F. et al. (2010) The Dam1 complex confers microtubule plus end-tracking activity to the Ndc80 kinetochore complex. J Cell Biol 189 (4), 641-9. 220. Schmidt, Jens C. et al. (2012) The Kinetochore-Bound Ska1 Complex Tracks Depolymerizing Microtubules and Binds to Curved Protofilaments. Developmental Cell 23 (5), 968-980. 221. Fokkens, L. and Snel, B. (2009) Cohesive versus flexible evolution of functional modules in eukaryotes. PLoS Comput Biol 5 (1), e1000276. 222. Schneider, A. et al. (2013) Shared Protein Complex Subunits Contribute to Explaining Disrupted Co-occurrence. PLoS Comput Biol 9 (7), e1003124. 223. Toledo, C.M. et al. (2014) BuGZ is required for Bub3 stability, Bub1 kinetochore function, and chromosome alignment. Dev Cell 28 (3), 282-94. 224. Jiang, H. et al. (2014) A microtubule-associated zinc finger protein, BuGZ, regulates mitotic chromosome alignment by ensuring Bub3 stability and kinetochore targeting. Dev Cell 28 (3), 268-81. 225. Wan, Y. et al. (2015) Splicing function of mitotic regulators links R-loop–mediated DNA damage to tumor cell killing. The Journal of Cell Biology 209 (2), 235-246. 226. Jiang, H. et al. (2015) Phase Transition of Spindle-Associated Protein Regulate Spindle Apparatus Assembly. Cell 163 (1), 108-122. 227. Hirose, H. et al. (2004) Implication of ZW10 in membrane trafficking between the endoplasmic reticulum and Golgi. EMBO J 23 (6), 1267-78. 228. Chan, Y.W. et al. (2009) Mitotic control of kinetochore-associated dynein and spindle orientation by human Spindly. The Journal of Cell Biology 185 (5), 859-874. 229. Griffis, E.R. et al. (2007) Spindly, a novel protein essential for silencing the spindle assembly checkpoint, recruits dynein to the kinetochore. The Journal of Cell Biology 177 (6), 1005-1015. 230. Yamamoto, T.G. et al. (2008) SPDL-1 functions as a kinetochore receptor for MDF-1 in Caenorhabditis elegans. The Journal of Cell Biology 183 (2), 187-194. 231. Nagpal, H. and Fukagawa, T. (2016) Kinetochore assembly and function through the cell cycle. Chromosoma 125 (4), 645-659. 232. Gascoigne, Karen E. et al. (2011) Induced Ectopic Kinetochore Assembly Bypasses the Requirement for CENP-A Nucleosomes. Cell 145 (3), 410-422. 233. Singh, T.R. et al. (2010) MHF1-MHF2, a Histone-Fold-Containing Protein Complex, Participates in the Fanconi Anemia Pathway via FANCM. Molecular Cell 37 (6), 879-886. 234. Yan, Z. et al. (2010) A histone-fold complex and FANCM form a conserved DNA-remodeling complex to maintain genome stability. Mol Cell 37 (6), 865-78. 235. Westhorpe, F.G. and Straight, A.F. (2013) Functions of the centromere and kinetochore in chromosome segregation. Current Opinion in Cell Biology 25 (3), 334-340. 236. Wang, H. et al. (2004) Human Zwint-1 specifies localization of Zeste White 10 to kinetochores and is essential for mitotic checkpoint signaling. J Biol Chem 279 (52), 54590-8. 237. Kops, G.J. et al. (2005) ZW10 links mitotic checkpoint signaling to the structural kinetochore. J Cell Biol 169 (1), 49-60. 238. Pagliuca, C. et al. (2009) Roles for the conserved spc105p/kre28p complex in kinetochore-microtubule binding and the spindle assembly checkpoint. PLoS One 4 (10), e7640. 239. Jakopec, V. et al. (2012) Sos7, an essential component of the conserved Schizosaccharomyces pombe Ndc80- MIND-Spc7 complex, identifies a new family of fungal kinetochore proteins. Mol Cell Biol 32 (16), 3308-20. 240. Famulski, J.K. et al. (2008) Stable hZW10 kinetochore residency, mediated by hZwint-1 interaction, is essential for the mitotic checkpoint. J Cell Biol 180 (3), 507-20. 241. Zhang, G. et al. (2015) Distinct domains in Bub1 localize RZZ and BubR1 to kinetochores to regulate the checkpoint. 182 References References 183

Nat Commun 6, 7162. 242. Petrovic, A. et al. (2014) Modular Assembly of RWD Domains on the Mis12 Complex Underlies Outer Kinetochore Organization. Molecular Cell 53 (4), 591-605. 243. Eytan, E. et al. (2014) Disassembly of mitotic checkpoint complexes by the joint action of the AAA-ATPase TRIP13 and p31(comet). Proc Natl Acad Sci U S A 111 (33), 12019-24. 244. Wang, K. et al. (2014) Thyroid Hormone Receptor Interacting Protein 13 (TRIP13) AAA-ATPase Is a Novel Mitotic Checkpoint-silencing Protein. Journal of Biological Chemistry 289 (34), 23928-23937. 245. Nelson, C.R. et al. (2015) TRIP13PCH-2 promotes Mad2 localization to unattached kinetochores in the spindle checkpoint response. Journal of Cell Biology 211 (3), 503-516. 246. Hara, K. et al. (2010) Crystal Structure of Human REV7 in Complex with a Human REV3 Fragment and Structural Implication of the Interaction between DNA Polymerase ζ and REV1. Journal of Biological Chemistry 285 (16), 12299- 12307. 247. Suzuki, H. et al. (2015) Structure of the Atg101-Atg13 complex reveals essential roles of Atg101 in autophagy initiation. Nat Struct Mol Biol 22 (7), 572-580. 248. Jao, C.C. et al. (2013) A HORMA domain in Atg13 mediates PI 3-kinase recruitment in autophagy. Proceedings of the National Academy of Sciences 110 (14), 5486-5491. 249. Loïodice, I. et al. (2004) The Entire Nup107-160 Complex, Including Three New Members, Is Targeted as One Entity to Kinetochores in Mitosis. Molecular Biology of the Cell 15 (7), 3333-3344. 250. Orjalo, A.V. et al. (2006) The Nup107-160 Nucleoporin Complex Is Required for Correct Bipolar Spindle Assembly. Molecular Biology of the Cell 17 (9), 3806-3818. 251. Drinnenberg, I.A. et al. (2014) Recurrent loss of CenH3 is associated with independent transitions to holocentricity in insects. Elife 3, e03676. 252. Sudakin, V. et al. (2001) Checkpoint inhibition of the APC/C in HeLa cells is mediated by a complex of BUBR1, BUB3, CDC20, and MAD2. The Journal of Cell Biology 154 (5), 925-936. 253. De Antoni, A. et al. (2005) The Mad1/Mad2 Complex as a Template for Mad2 Activation in the Spindle Assembly Checkpoint. Current Biology 15 (3), 214-225. 254. Luo, X. et al. (2002) The Mad2 spindle checkpoint protein undergoes similar major conformational changes upon binding to either Mad1 or Cdc20. Mol Cell 9 (1), 59-71. 255. Sironi, L. et al. (2002) Crystal structure of the tetrameric Mad1-Mad2 core complex: implications of a ‘safety belt’ binding mechanism for the spindle checkpoint. EMBO J 21 (10), 2496-506. 256. Ding, D. et al. (2012) Functional interaction between the Arabidopsis orthologs of spindle assembly checkpoint proteins MAD1 and MAD2 and the nucleoporin NUA. Plant Molecular Biology 79 (3), 203-216. 257. Cheeseman, I.M. et al. (2006) The Conserved KMN Network Constitutes the Core Microtubule-Binding Site of the Kinetochore. Cell 127 (5), 983-997. 258. Schittenhelm, R. et al. (2007) Spatial organization of a ubiquitous eukaryotic kinetochore protein network in Drosophila chromosomes. Chromosoma 116 (4), 385-402. 259. Przewloka, M.R. et al. (2007) Molecular analysis of core kinetochore composition and assembly in Drosophila melanogaster. PLoS One 2 (5), e478. 260. Williams, B. et al. (2007) Mitch – a rapidly evolving component of the Ndc80 kinetochore complex required for correct chromosome segregation in Drosophila. Journal of Cell Science 120 (20), 3522-3533. 261. Cheeseman, I.M. et al. (2004) A conserved protein network controls assembly of the outer kinetochore and its ability to sustain tension. Genes & development 18 (18), 2255-2268. 262. Nakajima, Y. et al. (2009) Nbl1p: A Borealin/Dasra/CSC-1-like Protein Essential for Aurora/Ipl1 Complex Function and Integrity in Saccharomyces cerevisiae. Molecular Biology of the Cell 20 (6), 1772-1784. 263. Parra, G. et al. (2009) Assessing the gene space in draft genomes. Nucleic Acids Research 37 (1), 289-297. 264. Johnson, M. et al. (2008) NCBI BLAST: a better web interface. Nucleic Acids Research 36 (suppl 2), W5-W9. 265. Katoh, K. et al. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30 (14), 3059-66. 266. Capella-Gutiérrez, S. et al. (2009) trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25 (15), 1972-1973. 267. Stamatakis, A. (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22 (21), 2688-2690. 268. Malik, H.S. and Henikoff, S. (2003) Phylogenomics of the nucleosome. Nat Struct Mol Biol 10 (11), 882-891. 269. Letunic, I. and Bork, P. (2016) Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Research 44 (W1), W242-W245. 270. Boutet, E. et al. (2016) UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: how to use the entry view. Plant Bioinformatics: Methods and Protocols, 23-54. 271. Bailey, T.L. et al. (2009) MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37 (Web Server issue), W202-8. 272. Mi, H. et al. (2013) PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Research 41 (D1), D377-D386. 273. Cunningham, F. et al. (2015) Ensembl 2015. Nucleic Acids Research 43 (D1), D662-D669. 184 References References 185

274. Cipriano, M.J. (2013) An analysis of kinetochore proteins in a wide range of eukaryotes and the kinetochore of Giardia lamblia, UNIVERSITY OF CALIFORNIA, DAVIS. 275. Keeling, P.J. and Inagaki, Y. (2004) A class of eukaryotic GTPase with a punctate distribution suggesting multiple functional replacements of translation elongation factor 1α. Proceedings of the National Academy of Sciences of the United States of America 101 (43), 15380-15385. 276. Pereira-Leal, J.B. et al. (2007) Evolution of protein complexes by duplication of homomeric interactions. Genome Biology 8 (4), 1-12. 277. Douglas, E.S. and Penny, L.S. (1999) The Plastid Genome of the Cryptophyte Alga, Guillardia theta: Complete Sequence and Conserved Synteny Groups Confirm Its Common Ancestry with Red Algae. Journal of Molecular Evolution 48 (2), 236-244. 278. Schönknecht, G. et al. (2014) Horizontal gene acquisitions by eukaryotes as drivers of adaptive evolution. Bioessays 36 (1), 9-20. 279. Soanes, D. and Richards, T.A. (2014) Horizontal Gene Transfer in Eukaryotic Plant Pathogens. Annual Review of Phytopathology 52 (1), 583-614. 280. Boutet, E. et al. (2016) UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. In Plant Bioinformatics: Methods and Protocols (Edwards, D. ed), pp. 23-54, Springer New York. 281. Stamatakis, A. (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30 (9), 1312-1313. 282. Stanke, M. and Morgenstern, B. (2005) AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Research 33 (suppl 2), W465-W467. 283. Gruber, M. et al. (2006) Comparative analysis of coiled-coil prediction methods. Journal of Structural Biology 155 (2), 140-145. 284. Goldman, N. et al. (2000) Likelihood-Based Tests of Topologies in Phylogenetics. Systematic Biology 49 (4), 652- 670. 285. Nguyen, L.-T. et al. (2015) IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies. Molecular Biology and Evolution 32 (1), 268-274. 286. Westermann, S. and Schleiffer, A. (2013) Family matters: structural and functional conservation of centromere- associated proteins from yeast to humans. Trends in Cell Biology 23 (6), 260-269. 287. Talbert, P.B. et al. (2004) Adaptive evolution of centromere proteins in plants and animals. Journal of Biology 3 (4), 1-17. 288. Makarova, M. and Oliferenko, S. (2016) Mixing and matching nuclear envelope remodeling and spindle assembly strategies in the evolution of mitosis. Current Opinion in Cell Biology 41, 43-50. 289. Dacks, J.B. et al. (2016) The changing view of eukaryogenesis–fossils, cells, lineages and how they all come together. J Cell Sci 129 (20), 3695-3703. 290. Vosseberg, J. and Snel, B. (2017) Domestication of self-splicing introns during eukaryogenesis: the rise of the complex spliceosomal machinery. Biology Direct 12 (1), 30. 291. Field, M.C. and Dacks, J.B. (2009) First and last ancestors: reconstructing evolution of the endomembrane system with ESCRTs, vesicle coat proteins, and nuclear pore complexes. Current Opinion in Cell Biology 21 (1), 4-13. 292. Schmitzberger, F. and Harrison, S.C. (2012) RWD domain: a recurring module in kinetochore architecture shown by a Ctf19–Mcm21 complex structure. EMBO reports 13 (3), 216-222. 293. Doerks, T. et al. (2002) Systematic Identification of Novel Protein Domain Families Associated with Nuclear Functions. Genome Research 12 (1), 47-56. 294. Burroughs, A.M. et al. (2008) Anatomy of the E2 ligase fold: implications for enzymology and evolution of ubiquitin/Ub-like protein conjugation. J Struct Biol 162 (2), 205-18. 295. Nunoura, T. et al. (2011) Insights into the evolution of Archaea and eukaryotic protein modifier systems revealed by the genome of a novel archaeal group. Nucleic Acids Res 39 (8), 3204-23. 296. Hennell James, R. et al. (2017) Functional reconstruction of a eukaryotic-like E1/E2/(RING) E3 ubiquitylation cascade from an uncultured archaeon. Nat Commun 8 (1), 1120. 297. Grau-Bove, X. et al. (2015) The eukaryotic ancestor had a complex ubiquitin signaling system of archaeal origin. Mol Biol Evol 32 (3), 726-39. 298. Sundquist, W.I. et al. (2004) Ubiquitin Recognition by the Human TSG101 Protein. Molecular Cell 13 (6), 783-789. 299. Xu, L. et al. (2008) An FTS/Hook/p107(FHIP) complex interacts with and promotes endosomal clustering by the homotypic vacuolar protein sorting complex. Mol Biol Cell 19 (12), 5059-71. 300. Nameki, N. et al. (2004) Solution structure of the RWD domain of the mouse GCN2 protein. Protein Sci 13 (8), 2089-100. 301. Hodson, C. et al. (2011) Structural analysis of human FANCL, the E3 ligase in the Fanconi anemia pathway. J Biol Chem 286 (37), 32628-37. 302. Rosenberg, S.C. and Corbett, K.D. (2015) The multifaceted roles of the HORMA domain in cellular signaling. The Journal of Cell Biology 211 (4), 745. 303. Nelson-Sathi, S. et al. (2012) Acquisition of 1,000 eubacterial genes physiologically transformed a methanogen at the origin of Haloarchaea. Proceedings of the National Academy of Sciences 109 (50), 20537. 304. Becker, E.A. et al. (2014) Phylogenetically Driven Sequencing of Extremely Halophilic Archaea Reveals Strategies 184 References References 185

for Static and Dynamic Osmo-response. PLOS Genetics 10 (11), e1004784. 305. Mattiroli, F. et al. (2017) Structure of histone-based chromatin in Archaea. Science 357 (6351), 609. 306. Bieniossek, C. et al. (2013) The architecture of human general transcription factor TFIID core complex. Nature 493 (7434), 699-702. 307. Helmlinger, D. and Tora, L. (2017) Sharing the SAGA. Trends Biochem Sci 42 (11), 850-861. 308. Nardini, M. et al. (2013) Sequence-specific transcription factor NF-Y displays histone-like DNA binding and H2B- like ubiquitination. Cell 152 (1-2), 132-43. 309. Kamada, K. et al. (2001) Crystal Structure of Negative Cofactor 2 Recognizing the TBP-DNA Transcription Complex. Cell 106 (1), 71-81. 310. Pursell, Z.F. and Kunkel, T.A. (2008) DNA Polymerase ε: A Polymerase of Unusual Size (and Complexity). In Progress in Nucleic Acid Research and Molecular Biology (Conn, P.M. ed), pp. 101-145, Academic Press. 311. Zhao, Q. et al. (2014) The MHF complex senses branched DNA by binding a pair of crossover DNA duplexes. Nat Commun 5, 2987. 312. Tao, Y. et al. (2012) The structure of the FANCM-MHF complex reveals physical features for functional assembly. Nat Commun 3, 782. 313. Chen, C.C. and Mellone, B.G. (2016) Chromatin assembly: Journey to the CENter of the chromosome. J Cell Biol 214 (1), 13-24. 314. Nishino, T. et al. (2012) CENP-T-W-S-X Forms a Unique Centromeric Chromatin Structure with a Histone-like Fold. Cell 148 (3), 487-501. 315. Gimona, M. et al. (2001) Functional plasticity of CH domains. FEBS Letters 513 (1), 98-106. 316. Schou, K.B. et al. (2014) A divergent calponin homology (NN–CH) domain defines a novel family: implications for evolution of ciliary IFT complex B proteins. Bioinformatics 30 (7), 899-902. 317. Pasek, R.C. et al. (2012) Mammalian Clusterin associated protein 1 is an evolutionarily conserved protein required for ciliogenesis. Cilia 1 (1), 20. 318. Pérez-González, A. et al. (2014) hCLE/C14orf166 Associates with DDX1-HSPC117-FAM98B in a Novel Transcription- Dependent Shuttling RNA-Transporting Complex. PLOS ONE 9 (3), e90957. 319. Pentakota, S. et al., Decoding the centromeric nucleosome through CENP-N, eLife, 2017. 320. Chittori, S. et al. (2018) Structural mechanisms of centromeric nucleosome recognition by the kinetochore protein CENP-N. Science 359 (6373), 339-343. 321. Hinshaw, S.M. and Harrison, S.C. (2013) An Iml3-Chl4 heterodimer links the core centromere to factors required for accurate chromosome segregation. Cell Rep 5 (1), 29-36. 322. Brindefalk, B. et al. (2013) Evolutionary history of the TBP-domain superfamily. Nucleic Acids Res 41 (5), 2832-45. 323. Cavaliere, P. et al. (2014) Structural and functional features of Crl proteins and identification of conserved surface residues required for interaction with the RpoS/sigmaS subunit of RNA polymerase. Biochem J 463 (2), 215-24. 324. Miyashita, S. et al. (2011) Identification of the substrate binding site in the N-terminal TBP-like domain of RNase H3. FEBS Lett 585 (14), 2313-7. 325. Townley, R. and Shapiro, L. (2007) Crystal structures of the adenylate sensor from fission yeast AMP-activated protein kinase. Science 315 (5819), 1726-9. 326. Wu, H. et al. (2008) Crystal Structure of Human Spermine Synthase: IMPLICATIONS OF SUBSTRATE BINDING AND CATALYTIC MECHANISM. Journal of Biological Chemistry 283 (23), 16135-16146. 327. Wu, Y. et al. (2017) Molecular basis for the interaction between Integrator subunits IntS9 and IntS11 and its functional importance. Proc Natl Acad Sci U S A 114 (17), 4394-4399. 328. Dodonova, S.O. et al. (2015) VESICULAR TRANSPORT. A structure of the COPI coat and the role of coat proteins in membrane vesicle assembly. Science 349 (6244), 195-8. 329. Owen, D.J. et al. (2000) The structure and function of the beta 2-adaptin appendage domain. EMBO J 19 (16), 4216 -27. 330. Pearce, L.R. et al. (2010) The nuts and bolts of AGC protein kinases. Nature Reviews Molecular Cell Biology 11, 9. 331. Koumandou, V.L. et al. (2007) Control systems for membrane fusion in the ancestral eukaryote; evolution of tethering complexes and SM proteins. BMC Evolutionary Biology 7 (1), 29. 332. Hong, W. and Lev, S. (2014) Tethering the assembly of SNARE complexes. Trends in Cell Biology 24 (1), 35-43. 333. Schröter, S. et al. (2016) Coat/tether interactions—exception or rule? Frontiers in cell and developmental biology 4, 44. 334. Hu, X.-J. et al. (2017) Prokaryotic and Highly-Repetitive WD40 Proteins: A Systematic Study. Scientific Reports 7 (1), 10585. 335. Schlegel, T. et al. (2007) The Tetratricopeptide Repeats of Receptors Involved in Protein Translocation across Membranes. Molecular Biology and Evolution 24 (12), 2763-2774. 336. van Dam, T.J.P. et al. (2013) Evolution of modular intraflagellar transport from a coatomer-like progenitor. Proceedings of the National Academy of Sciences. 337. Schlacht, A. and Dacks, J.B. (2015) Unexpected Ancient Paralogs and an Evolutionary Model for the COPII Coat Complex. Genome Biology and Evolution 7 (4), 1098-1109. 338. Mast, F.D. et al. (2014) Evolutionary mechanisms for establishing eukaryotic cellular complexity. Trends in Cell Biology 24 (7), 435-442. 339. Gabaldón, T. et al. (2005) Tracing the Evolution of a Large Protein Complex in the Eukaryotes, NADH:Ubiquinone 186 References References 187

Oxidoreductase (Complex I). Journal of Molecular Biology 348 (4), 857-870. 340. Klinger, C.M. et al. (2016) Tracing the Archaeal Origins of Eukaryotic Membrane-Trafficking System Building Blocks. Molecular Biology and Evolution 33 (6), 1528-1541. 341. Melters, D.P. et al. (2012) Holocentric chromosomes: convergent evolution, meiotic adaptations, and genomic analysis. Chromosome Research 20 (5), 579-593. 342. Shannon, P. et al. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome research 13 (11), 2498-2504. 343. Velankar, S. et al. (2012) SIFTS: structure integration with function, taxonomy and sequences resource. Nucleic acids research 41 (D1), D483-D489. 344. van Wijk, L.M.S., Berend (2018) Phylogenomics reveals ancestral kinase relations, repertoire, and fate in present- day eukaryotes. In preparation. 345. Rambaut, A. (2012) FigTree v1. 4. Molecular evolution, phylogenetics and epidemiology. Edinburgh, UK: University of Edinburgh, Institute of Evolutionary Biology. 346. Cheng, H. et al. (2014) ECOD: An Evolutionary Classification of Protein Domains. PLOS Computational Biology 10 (12), e1003926. 347. Dawson, N.L. et al. (2017) CATH: an expanded resource to predict protein function through structure and sequence. Nucleic Acids Research 45 (D1), D289-D295. 348. Schrödinger, L., PyMOL The PyMOL Molecular Graphics System, Version, 2010. 349. Holm, L. and Sander, C. (1995) Dali: a network tool for protein structure comparison. Trends in Biochemical Sciences 20 (11), 478-480. 350. Drozdetskiy, A. et al. (2015) JPred4: a protein secondary structure prediction server. Nucleic Acids Res 43 (W1), W389-94. 351. Waterhouse, A.M. et al. (2009) Jalview Version 2—a multiple sequence alignment editor and analysis workbench. Bioinformatics 25 (9), 1189-1191. 352. Dunwell, J.M. et al. (2001) Evolution of functional diversity in the cupin superfamily. Trends in Biochemical Sciences 26 (12), 740-746. 353. Williams, Tom A. (2014) Evolution: Rooting the Eukaryotic Tree of Life. Current Biology 24 (4), R151-R152. 354. Tromer, E., Evolution of the Kinetochore Network in Eukaryotes, Utrecht University, 2017. 355. van Hooff, J.J.E. et al. (2017) Unique Phylogenetic Distributions of the Ska and Dam1 Complexes Support Functional Analogy and Suggest Multiple Parallel Displacements of Ska by Dam1. Genome Biology and Evolution 9 (5), 1295-1303. 356. Zimmermann, L. et al. (2018) A Completely Reimplemented MPI Bioinformatics Toolkit with a New HHpred Server at its Core. Journal of Molecular Biology 430 (15), 2237-2243. 357. Kim, S. et al. (2012) Structure of human Mad1 C-terminal domain reveals its involvement in kinetochore targeting. Proceedings of the National Academy of Sciences 109 (17), 6549-6554. 358. Aravind, L. and Koonin, E.V. (1998) The HORMA domain: a common structural denominator in mitotic checkpoints, chromosome synapsis and DNA repair. Trends in Biochemical Sciences 23 (8), 284-286. 359. Wilson, D.K. et al. (2005) The 1.1-Å Structure of the Spindle Checkpoint Protein Bub3p Reveals Functional Regions. Journal of Biological Chemistry 280 (14), 13944-13951. 360. Tian, W. et al. (2012) Structural analysis of human Cdc20 supports multisite degron recognition by APC/C. Proceedings of the National Academy of Sciences 109 (45), 18419. 361. Bolanos-Garcia, V.M. and Blundell, T.L. (2011) BUB1 and BUBR1: multifaceted kinases of the cell cycle. Trends in Biochemical Sciences 36 (3), 141-150. 362. Thebault, P. et al. (2012) Structural and functional insights into the role of the N-terminal Mps1 TPR domain in the SAC (spindle assembly checkpoint). Biochemical Journal 448 (3), 321-328. 363. Yang, M. et al. (2007) p31comet Blocks Mad2 Activation through Structural Mimicry. Cell 131 (4), 744-755. 364. Ye, Q. et al. (2017) The AAA+ ATPase TRIP13 remodels HORMA domains through N‐terminal engagement and unfolding. The EMBO Journal 36 (16), 2419-2434. 365. Brulotte, M.L. et al. (2017) Mechanistic insight into TRIP13-catalyzed Mad2 structural transition and spindle checkpoint silencing. Nature Communications 8 (1), 1956. 366. Rümenapp, U. et al. (2002) A mammalian Rho-specific guanine-nucleotide exchange factor (p164-RhoGEF) without a pleckstrin homology domain. Biochemical Journal 366 (Pt 3), 721-728. 367. Ciferri, C. et al. (2008) Implications for Kinetochore-Microtubule Attachment from the Structure of an Engineered Ndc80 Complex. Cell 133 (3), 427-439. 368. Wei, R.R. et al. (2006) Structure of a Central Component of the Yeast Kinetochore: The Spc24p/Spc25p Globular Domain. Structure 14 (6), 1003-1009. 369. Çivril, F. et al. (2010) Structural Analysis of the RZZ Complex Reveals Common Ancestry with Multisubunit Vesicle Tethering Machinery. Structure 18 (5), 616-626. 370. Tripathi, A. et al. (2009) Structural characterization of Tip20p and Dsl1p, subunits of the Dsl1p vesicle tethering complex. Nature Structural &Amp; Molecular Biology 16, 114. 371. Sessa, F. et al. (2005) Mechanism of Aurora B Activation by INCENP and Inhibition by Hesperadin. Molecular Cell 18 (3), 379-391. 372. Jeyaprakash, A.A. et al. (2007) Structure of a Survivin–Borealin–INCENP Core Complex Reveals How Chromosomal 186 References References 187

Passengers Travel Together. Cell 131 (2), 271-285. 373. Elling, R.A. et al. (2008) Structures of the wild-type and activated catalytic domains of Brachydanio rerio Polo-like kinase 1 (Plk1): changes in the active-site conformation and interactions with ligands. Acta Crystallographica Section D 64 (9), 909-918. 374. Cheng, K.Y. et al. (2003) The crystal structure of the human polo‐like kinase‐1 polo box domain and its phospho‐ peptide complex. The EMBO Journal 22 (21), 5757-5768. 375. Tachiwana, H. et al. (2011) Crystal structure of the human centromeric nucleosome containing CENP-A. Nature 476, 232. 376. Kipling, D. and Warburton, P.E. (1997) Centromeres, CENP-B and Tigger too. Trends in Genetics 13 (4), 141-145. 377. Cohen, R.L. et al. (2008) Structural and Functional Dissection of Mif2p, a Conserved DNA-binding Kinetochore Protein. Molecular Biology of the Cell 19 (10), 4480-4491. 378. Garcia-Saez, I. et al. (2004) Crystal Structure of the Motor Domain of the Human Kinetochore Protein CENP-E. Journal of Molecular Biology 340 (5), 1107-1116. 379. Basilico, F. et al. (2014) The pseudo GTPase CENP-M drives human kinetochore assembly. eLife 3, e02978. 380. Hori, T. et al. (2008) CCAN Makes Multiple Contacts with Centromeric DNA to Provide Distinct Pathways to the Outer Kinetochore. Cell 135 (6), 1039-1052. 381. Cho, U.-S. and Harrison, S.C. (2011) Ndc10 is a platform for inner kinetochore assembly in budding yeast. Nature Structural &Amp; Molecular Biology 19, 48. 382. Russell, I.D. et al. (1999) The Unstable F-box Protein p58-Ctf13 Forms the Structural Core of the CBF3 Kinetochore Complex. The Journal of Cell Biology 145 (5), 933-950. 383. Bellizzi, J.J. et al. (2007) Crystal Structure of the Yeast Inner Kinetochore Subunit Cep3p. Structure 15 (11), 1422- 1430. 384. Corbett, K.D. et al. (2010) The Monopolin Complex Crosslinks Kinetochore Components to Regulate Chromosome-Microtubule Attachments. Cell 142 (4), 556-567. 385. Ye, Q. et al. (2016) Structure of the Saccharomyces cerevisiae Hrr25:Mam1 monopolin subcomplex reveals a novel kinase regulator. The EMBO Journal 35 (19), 2139-2151. 386. Koumandou, V.L. et al. (2013) Molecular paleontology and complexity in the last eukaryotic common ancestor. Critical Reviews in Biochemistry and Molecular Biology 48 (4), 373-396. 387. Koonin, E.V. (2007) The Biological Big Bang model for the major transitions in evolution. Biology Direct 2 (1), 21. 388. Rojas, A.M. et al. (2012) The Ras protein superfamily: Evolutionary tree and role of conserved amino acids. The Journal of Cell Biology 196 (2), 189. 389. Poole, A.M. and Gribaldo, S. (2014) Eukaryotic Origins: How and When Was the Mitochondrion Acquired? Cold Spring Harbor Perspectives in Biology 6 (12). 390. Poole, A. and Penny, D. (2007) Engulfed by speculation. Nature 447, 913. 391. Rochette, N.C. et al. (2014) Phylogenomic Test of the Hypotheses for the Evolutionary Origin of Eukaryotes. Molecular Biology and Evolution 31 (4), 832-845. 392. Scannell, D.R. and Wolfe, K.H. (2007) A burst of protein sequence evolution and a prolonged period of asymmetric evolution follow gene duplication in yeast. Genome Research 18 (1), 000-000. 393. Conant, G.C. and Wagner, A. (2003) Asymmetric Sequence Divergence of Duplicate Genes. Genome Research 13 (9), 2052-2058. 394. Panchin, A.Y. et al. (2010) Asymmetric and non-uniform evolution of recently duplicated human genes. Biology Direct 5 (1), 54. 395. Assis, R. and Bachtrog, D. (2013) Neofunctionalization of young duplicate genes in Drosophila. Proceedings of the National Academy of Sciences. 396. Pegueroles, C. et al. (2013) Accelerated Evolution after Gene Duplication: A Time-Dependent Process Affecting Just One Copy. Molecular Biology and Evolution 30 (8), 1830-1842. 397. Ettema, T.J.G. (2016) Mitochondria in the second act. Nature 531, 39. 398. Yutin, N. et al. (2008) The Deep Archaeal Roots of Eukaryotes. Molecular Biology and Evolution 25 (8), 1619-1630. 399. Gabaldón, T. and Huynen, M.A. (2003) Reconstruction of the Proto-Mitochondrial Metabolism. Science 301 (5633), 609. 400. Popa, O. and Dagan, T. (2011) Trends and barriers to lateral gene transfer in prokaryotes. Current Opinion in Microbiology 14 (5), 615-623. 401. Dunning Hotopp, J.C. (2011) Horizontal gene transfer between bacteria and animals. Trends in Genetics 27 (4), 157-163. 402. Alsmark, C. et al. (2013) Patterns of prokaryotic lateral gene transfers affecting parasitic microbial eukaryotes. Genome Biology 14 (2), R19. 403. Gonçalves, I.R. et al. (2016) Genome-wide analyses of chitin synthases identify horizontal gene transfers towards bacteria and allow a robust and unifying classification into fungi. BMC Evolutionary Biology 16 (1), 252. 404. Merchant, S. et al. (2014) Unexpected cross-species contamination in genome sequencing projects. PeerJ 2, e675. 405. van Hooff, J.J.E. et al. (2018) Mosaic origin of the eukaryotic kinetochore. in preparation. 406. Jacox, E. et al. (2016) ecceTERA: comprehensive gene tree-species tree reconciliation using parsimony. Bioinformatics 32 (13), 2056-2058. 188 References

407. Nishitsuji, K. et al. (2016) A draft genome of the brown alga, Cladosiphon okamuranus, S-strain: a platform for future studies of ‘mozuku’ biology. DNA Research 23 (6), 561-570. 408. Curtis, B.A. et al. (2012) Algal genomes reveal evolutionary mosaicism and the fate of nucleomorphs. Nature 492 (7427), 59-65. 409. Clarke, M. et al. (2013) Genome of Acanthamoeba castellanii highlights extensive lateral gene transfer and early evolution of tyrosine kinase signaling. Genome Biol 14 (2), R11. 410. Urushihara, H. et al. (2015) Comparative genome and transcriptome analyses of the social amoeba Acytostelium subglobosum that accomplishes multicellular development without germ-soma differentiation. BMC Genomics 16 (1), 80. 411. Torruella, G. et al. (2015) Phylogenomics Reveals Convergent Evolution of Lifestyles in Close Relatives of Animals and Fungi. Current Biology 25 (18), 2404-2410. 412. Hauser, M. et al. (2013) kClust: fast and sensitive clustering of large protein sequence databases. BMC Bioinformatics 14 (1), 248. 413. Katoh, K. and Toh, H. (2008) Recent developments in the MAFFT multiple sequence alignment program. Briefings in Bioinformatics 9 (4), 286-298. 414. Price, M.N. et al. (2010) FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLOS ONE 5 (3), e9490. 415. Huerta-Cepas, J. et al. (2016) ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data. Molecular Biology and Evolution 33 (6), 1635-1638. 416. Sage Rowan, F. (2003) The evolution of C4 photosynthesis. New Phytologist 161 (2), 341-370. 417. Irish, V.F. and Lit t, A . (20 05) Flower development and evolution: gene duplication, diversification and redeployment. Current Opinion in Genetics & Development 15 (4), 454-460. 418. Flajnik, M.F. and Kasahara, M. (2001) Comparative Genomics of the MHC: Glimpses into the Evolution of the Adaptive Immune System. Immunity 15 (3), 351-362. 419. Koonin, E.V. (2010) The Incredible Expanding Ancestor of Eukaryotes. Cell 140 (5), 606-608. 420. Nguyen Ba, A.N. et al. (2017) Parallel reorganization of protein function in the spindle checkpoint pathway through evolutionary paths in the fitness landscape that appear neutral in laboratory experiments. PLOS Genetics 13 (4), e1006735. 421. Force, A. et al. (1999) Preservation of Duplicate Genes by Complementary, Degenerative Mutations. Genetics 151 (4), 1531. 422. Stoltzfus, A. (2012) Constructive neutral evolution: exploring evolutionary theory’s curious disconnect. Biology direct 7 (1), 35. 423. Isokane, M. et al. (2016) ARHGEF17 is an essential spindle assembly checkpoint factor that targets Mps1 to kinetochores. The Journal of Cell Biology 212 (6), 647. 424. Doolittle, W.F. et al. (2011) Comment on “Does constructive neutral evolution play an important role in the origin of cellular complexity?” DOI 10.1002/bies.201100010. BioEssays 33 (6), 427-429. 425. Galperin, M.Y. et al. (1998) Analogous Enzymes: Independent Inventions in Enzyme Evolution. Genome Research 8 (8), 779-790. 426. Omelchenko, M.V. et al. (2010) Non-homologous isofunctional enzymes: A systematic analysis of alternative solutions in enzyme evolution. Biology Direct 5 (1), 31. 427. Koonin, E.V. (2003) Comparative genomics, minimal gene-sets and the last universal common ancestor. Nature Reviews Microbiology 1, 127. 428. Orias, E. (1991) Evolution of Amitosis of the Ciliate Macronucleus: Gain of the Capacity to Divide. The Journal of Protozoology 38 (3), 217-221. 429. Mani, M. et al. (2015) MoonProt: a database for proteins that are known to moonlight. Nucleic Acids Research 43 (D1), D277-D282. 430. Niklas, K.J. et al. (2018) The evolutionary origins of cell type diversification and the role of intrinsically disordered proteins. Journal of Experimental Botany 69 (7), 1437-1446. 431. Siddiq, M.A. et al. (2017) Evolution of protein specificity: insights from ancestral protein reconstruction. Current Opinion in Structural Biology 47, 113-122. 432. Csűrös, M. and Miklós, I. (2009) Streamlining and Large Ancestral Genomes in Archaea Inferred with a Phylogenetic Birth-and-Death Model. Molecular Biology and Evolution 26 (9), 2087-2095. 433. Doolittle, W.F. et al. (2003) How big is the iceberg of which organellar genes in nuclear genomes are but the tip? Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 358 (1429), 39. 434. Martin William, F. (2017) Too Much Eukaryote LGT. BioEssays 39 (12), 1700115. 435. Roger, A.J. (2018) Reply to ‘Eukaryote lateral gene transfer is Lamarckian’. Nature Ecology & Evolution 2 (5), 755- 755. 436. Martin, W.F. (2018) Eukaryote lateral gene transfer is Lamarckian. Nature Ecology & Evolution 2 (5), 754-754. 437. Szöllősi, G.J. et al. (2015) Genome-scale phylogenetic analysis finds extensive gene transfer among fungi. Philosophical Transactions of the Royal Society B: Biological Sciences 370 (1678). 438. Husnik, F. and McCutcheon, J.P. (2017) Functional horizontal gene transfer from bacteria to eukaryotes. Nature Reviews Microbiology 16, 67. Abbreviations 189

Abbreviations

APC/C anaphase-promoting complex/cyclosome CCAN constitutive centromere-associated network CENP centromere protein Dam1-C Dam1 complex EGT endosymbiotic gene transfer FECA first eukaryotic common ancestor HGT horizontal gene transfer KMN Knl1 complex - Mis12 complex - Ndc80 complex LECA last eukaryotic common ancestor LUCA last universal common ancestor MIM Mad2-interacting motif MCC mitotic checkpoint complex Mis12-C Mis12 complex Ndc80-C Ndc80 complex NRZ NAG-RINT1-ZW10 RZS Rod-Zwilch-Spindly RZZ Rod-Zwilch-ZW10 SAC spindle assembly checkpoint SAR Stramenopila-Alveolata-Rhizaria Ska-C Ska complex WGD whole-genome duplication 190 Samenvatting Samenvatting 191

Samenvatting

Eukaryote cellen, zoals cellen van planten, dieren en schimmels, verschillen fundamenteel van prokaryote cellen, zoals die van bacteriën, bijvoorbeeld wat betreft hun organisatie, omvang en reproductie. Omdat eukaryoten afstammen van prokaryoten, zijn eukaryoten ontstaan na een grote evolutionaire transitie. Waar eukaryote cellen over gespecialiseerde organellen beschikken voor verschillende cellulaire functies, zijn deze in prokaryoten doorgaans afwezig. Eukaryoten zijn gemiddeld groter dan prokaryoten, toegerust met complexe genregulatie en ondergaan dikwijls meiose met genetische recombinatie en seksuele reproductie. De evolutionaire oorsprong van eukaryote eigenschappen zijn wetenschappers nu, dankzij nieuwe moleculaire en analytische technieken, in kaart aan het brengen. Hoewel we misschien nog niet kunnen spreken over ‘proto’–eukaryoten, zijn wetenschappers er wel in geslaagd om prokaryoten te identificeren die mogelijk eukaryootachtige eigenschappen hebben. Aan de hand van dergelijke soorten, in combinatie met verfijnde evolutionaire analyse, kan het ontstaan van eukaryoten uit prokaryoten (‘eukaryogenese’) worden begrepen.

In dit proefschrift concentreer ik me op het kinetochoor, een complexe structuur bestaande uit vele eiwitten, tevens een van de kenmerken die eukaryoten onderscheiden van prokaryoten. Het kinetochoor speelt een essentiële rol in celdeling. Eukaryote celdeling verloopt volgens een vast stappenplan waarvan de kerndeling, genaamd mitose, een belangrijke stap vormt. Gedurende dit stappenplan repliceren de chromosomen; van elk chromosoom worden twee kopieën geproduceerd. De twee ontstane dochterchromosomen kleven aan elkaar. De cel bouwt een machine om deze paren uit elkaar te trekken, zodat elke dochtercel een volledige set van chromosomen ontvangt. Dit systeem bestaat uit een spoelfiguur met twee spoellichamen aan weerszijden van de cel, waaruit draden, de zogenaamde microtubuli, groeien. Precies tussen de spoellichamen in, haaks op de microtubuli, liggen de gedupliceerde chromosomen. Gedurende de kerndeling trekken de microtubuli aan de gedupliceerde chromosomen en verplaatsen zij van ieder paar één dochterchromosoom in de richting van één spoellichaam. De chromosomen zijn verbonden aan de microtubuli via het kinetochoor. Een goed functionerend kinetochoor zorgt er dus voor dat kerndeling goed verloopt.

Hoewel men uit het bovenstaande de indruk kan krijgen dat eukaryoten homogeen zijn, blijkt het tegenovergestelde. Eukaryoten zijn zeer divers, niet alleen in hun voor het oog zichtbare verschijningsvorm maar ook in hun eigenschappen op het cellulaire niveau. Dit geldt ook voor het kinetochoor. In dit proefschrift probeer ik met genetische gegevens van uiteenlopende soorten de evolutie van het kinetochoor te beschrijven en te begrijpen. Ik tracht het ontstaan van het kinetochoor in beeld te brengen, hetgeen 190 Samenvatting Samenvatting 191

plaatsvond vóór de radiatie van de belangrijkste eukaryote groepen, ofwel voor de laatste gemeenschappelijke voorouder van alle eukaryoten, maar voor de afsplitsing van prokaryoten. Tevens probeer ik de divergentie van het kinetochoor te beschrijven, die plaatsvond ná deze laatste gemeenschappelijke voorouder. Mijn aanpak gaat daarbij hoofdzakelijk terug in de tijd, beginnend met de variatie in hedendaagse soorten naar een model van het kinetochoor van de laatste gemeenschappelijke voorouder van alle eukaryoten, en voorts naar de processen die tot het ontstaan van dit kinetochoor hebben geleid.

Voor het uitvoeren van dit onderzoek gebruik ik bioinformatische analyse van genomen van uiteenlopende soorten. Deze analyse stelt ons in staat te bepalen welke soorten bepaalde genen (of eiwitten, deze termen zijn in deze samenvatting grotendeels inwisselbaar) wel hebben, en welke niet. In hoofdstuk 2 werken we uit hoe deze vergelijkende genomica vorm krijgt, zowel conceptueel als praktisch. De eerste stap behelst het identificeren van homologe genen, dat wil zeggen dat ze afstammen van één voorouderlijk gen. De tweede stap specificeert hun homologie: zijn het orthologen (splitsing door speciatie, ofwel soortvorming) of paralogen (splitsing doordat een gen dupliceert, waarna een soort over twee kopieën beschikt)? Voorts, of eigenlijk gelijktijdig, bepaalt men wanneer de genen gesplitst zijn en of ze behoren tot dezelfde orthologe groep. Op basis daarvan is vast te stellen welke soorten een gen wel bezitten, en welke niet. Hoofdstuk 2 presenteert deze stappen en laat zien dat een gedegen analyse dikwijls om handmatig onderzoek vraagt. Dit komt doordat eiwitten sterk verschillen in hun evolutionaire dynamica, waardoor een eenduidig protocol voor het vaststellen van een ‘orthologe groep’ niet bestaat.

Middels vergelijkende genomica kunnen we verschillen en overeenkomsten tussen uiteenlopende soorten onderzoeken, zonder dat daarvoor cellulaire studies nodig zijn. Hoofdstuk 3 beschrijft een onderzoek dat de diversiteit van het kinetochoor onder eukaryoten in kaart heeft gebracht. Van 90 uiteenlopende soorten is bepaald welke van de 70 kinetochooreiwitten gecodeerd zijn in hun genomen. Met deze studie tonen we aan dat het kinetochoor zeer divers is, diverser dan bijvoorbeeld het anafasebevorderende complex, eveneens belangrijk in eukaryote celdeling. De diversiteit is het gevolg van genverlies, genduplicatie en soms ook genuitvinding. Dikwijls gaan genen van hetzelfde kinetochoorcomplex gezamenlijk verloren, als gevolg waarvan het kinetochoor ‘modulair’ evolueert. Deze modulaire evolutie bevestigt de functionele afhankelijkheid van genen binnen een module. Op basis van de evolutionaire signalen zijn we in staat functionele voorspellingen te doen over het kinetochoor in relatief onbekende soorten.

Gedurende het onderzoek dat beschreven is in hoofdstuk 3 bleken twee eiwitcomplexen van het kinetochoor bijzonder interessant, namelijk Ska en Dam1. Ska en Dam1 werden 192 Samenvatting Samenvatting 193

oorspronkelijk geïdentificeerd in respectievelijk mens en gist, waarin ze vergelijkbare rollen leken te spelen. In hoofdstuk 4 laten we zien dat beide complexen voorkomen in uiteenlopende eukaryote soorten, maar zelden in dezelfde soort. Dit patroon suggereert dat deze complexen inderdaad dezelfde rol vervullen, waardoor het hebben van beide complexen overbodig is. Het patroon suggereert verder dat een van de complexen, waarschijnlijk Ska, al in de gemeenschappelijke eukaryote voorouder aanwezig was, terwijl het andere, Dam1, zich heeft verspreid middels horizontale genoverdracht tussen uiteenlopende soorten. Nadat een soort via zo’n genoverdracht Dam1 heeft ontvangen, verliest deze Ska, omdat het Ska niet langer nodig heeft.

Hoofdstukken 3 en 4 bevatten een reconstructie van het complexe kinetochoor van de laatste gemeenschappelijke eukaryote voorouder. In hoofdstuk 5 wordt uitgelicht hoe dit complexe kinetochoor ontstaan is en aan welke andere cellulaire processen en structuren het verwant is. Voor elk eiwit van het kinetochoor van de eukaryote voorouder zijn hebben we bestudeerd aan welke andere eiwitten het homoloog is, om op basis hiervan de oorsprong te kunnen vaststellen. De eiwitten van dit voorouderlijke kinetochoor blijken homoloog aan eiwitten die functioneren in uiteenlopende eukaryote systemen, zoals ubiquitinering, de regulatie van DNA en intracellulair transport, wat suggereert dat het kinetochoor een mozaïsche oorsprong heeft. Verder zijn vele kinetochooreiwitten homoloog aan elkaar; een gevolg van genduplicaties waarvan de ontstane duplicaten beiden een rol zijn blijven spelen in het kinetochoor, zogenaamde ‘intrakinetochoorduplicaties’.

Hoofdstuk 6 verlegt de focus van het ontstaan van het kinetochoor naar het ontstaan van de eukaryote cel, ook wel ‘eukaryogenese’. Dit hoofdstuk borduurt voort op een eerder gepubliceerde studie die de volgorde bepaalde waarin eukaryote kenmerken zijn ontstaan, waaronder het energieproducerende mitochondrion. Dit hoofdstuk voegt twee componenten toe aan deze eerdere studie, namelijk genen zonder prokaryote homologen en de genduplicaties die plaatsvonden tijdens eukaryogenese, zoals de eerder genoemde ‘intrakinetochoorduplicaties’. We concluderen dat eukaryote genen met verschillende prokaryote voorouders in verschillende mate dupliceren. Onze studie suggereert dat er na het ontstaan van het mitochondrion nog genen gedupliceerd zijn, en daarmee dat eukaryogenese nog niet voltooid was.

Dit proefschrift gebruikt genomische diversiteit, onder eukaryoten en onder prokaryoten, om een beeld te schetsen van de evolutie van het eukaryote kinetochoor en om deze evolutie te begrijpen aan de hand van functionele informatie. Tegelijkertijd wordt de evolutionaire kennis geëxploiteerd om voorspellingen te doen over het functioneren van het kinetochoor in zowel bekende als onbekende soorten. Voor een nog diepgaander begrip van kinetochoorevolutie en –functie is een bredere kennis nodig 192 Samenvatting Samenvatting 193

van het kinetochoor in niet-modelorganismen. Dit proefschrift biedt hiertoe een aantal aanknopingspunten. 194 Curriculum Vitae Publications 195

Curriculum vitae

Jolien van Hooff was born on January 10, 1988 in Breda, The Netherlands. She attended high school at the Onze Lieve Vrouwelyceum in Breda, where she obtained her diploma in pre-university education in 2006. In 2007 she started her BSc program in general biology at Utrecht University. Including a minor degree in Philosophy, she completed this program in 2011. She subsequently enrolled to the master program Molecular and Cellular Life Sciences at Utrecht University. Her major internship encompassed the bioinformatic investigation of the hypothesized ancient whole-genome duplication in the common ancestor of Phytophthora, a genus of plant pathogens. This internship was supervised by dr. Michael Seidl and prof.dr. Berend Snel at the Theoretical Biology and Bioinformatics group. Her minor internship focused on the effects of juvenile infection on fertility and ageing under supervision of dr. Jennifer Regan, in the lab of prof.dr. Linda Partridge at University College London. In December 2013 she started her PhD in the labs of prof. dr. Berend Snel and prof.dr. Geert Kops (then UMC Utrecht, now KNAW-Hubrecht Institute), to investigate (co-)evolutionary patterns in the kinetochore. The results of her PhD projects are described in this thesis. 194 Curriculum Vitae Publications 195

Publications

Atherton J, Jiang K, Stangier MM, Luo Y, Hua S, Houben K, van Hooff JJE, Joseph A-P, Scarabelli G, Grant BJ, et al: A structural model for microtubule minus-end recognition and protection by CAMSAP proteins. Nature Structural & Molecular Biology 2017, 24:931

van Hooff JJE, Tromer E, van Wijk LM, Snel B, Kops GJPL: Evolutionary dynamics of the kinetochore network in eukaryotes as revealed by comparative genomics. EMBO reports 2017.

van Hooff JJE, Snel B, Kops GJPL: Unique Phylogenetic Distributions of the Ska and Dam1 Complexes Support Functional Analogy and Suggest Multiple Parallel Displacements of Ska by Dam1. Genome Biology and Evolution 2017, 9:1295-1303

van Hooff JJE, Snel B, Seidl MF: Small Homologous Blocks in Phytophthora Genomes Do Not Point to an Ancient Whole-Genome Duplication. Genome Biology and Evolution 2014, 6:1079-1085 196 Acknowledgments Acknowledgments 197

Acknowledgments

While we should save the best for last, in practically what we often do is ‘saving the hardest for last’, at least I do. And so I ended up with ‘just’ having to finalize this thesis by acknowledging, but it’s no easy task to do justice to those that supported me. It’s no easy task to do justice to my promotors, Berend and Geert. Berend, since our first meeting regarding my master internship, you sparked my enthusiasm about evolutionary genomics. You constantly showed what you were intrigued by. That often also included frustration, maybe even despair, about what we could not know or could not solve. But surprisingly, this never resulted in apathy. Quite the contrary; you are triggered by difficulties and always willing to, hands on, take a seat behind my computer, maybe not to solve them, but at least to understand them. Maybe primarily implicit, but you were truly involved, both in scientific and in personal life. Geert, probably you are the person we, within our kinetochore evolution team, were all looking up to. I recall the many work discussions Berend and I had, and could not help but concluding that we wanted – or needed – your opinion. When looking for honest, valuable and decisive feedback, you were the one I contacted. And yes, although maybe not the answer I hoped for, I appreciate that you recommended me to not step out of my PhD project for a couple of months. You seem in control, and, what I appreciated most was that you combined high expectations with high trust. Furthermore, it amazes me how diverse a group you are heading, although of course I hope you will prioritize your work on cellular diversity and evolution. After writing, this thesis was in the hands of the reading committee, which included Susanne Lens, Martijn Huynen, Guido van der Ackerveken, Liedewij Laan and Bill Wickstead. I consider all of you to be closely associated to the contents of my thesis, and therefore I was delighted to have you assessing it. Dear Snel group, have a good one! Leny, I wish I had more often asked for your careful criticism. Dear John, thank you for your enthusiasm, helpfulness and empathy. Eva, I hope you have now found a nice lab and Julian, I am sure you’ll get a grasp on eukaryogenesis. Of course, Lidija and Alessia. thank you for welcoming me to the lab. And Alessia, you know yourself that this thesis would not have looked the way it does without your artistic mind and effort. Thanks! Dear Kopsies, although I never shared the office with you, I did feel like I was part of your team, and that was because of you. Especially Wilma, Bas, Richard, Spiros, Timo, Ana, Carlos, Nanette, Ajit, Antoinette, I am somewhat jealous of the way you (as far as an outsider’s eye can tell) seem to live according to ‘sharing is caring’ - in fact I was convinced that if I would have ran into trouble, you would have taken care. Dear Binfs, I doubt that I’ll be ever part of such an eclectic collection of people again. However, I think that the common ground of (scientific) curiosity, which I deeply respect 196 Acknowledgments Acknowledgments 197

you for, provides a solid basis to unite you. Rob, Paulien, Bas, Kirsten, Rutger and Can: thank you for creating this dedicated, and a bit anarchical, atmosphere. Jan Kees, thank you for your patience with my stupid questions, bur foremost, thank you for your efforts make me run the analyses sort of smoothly. During my PhD, I joined Toni Gabaldón’s group for a couple of months. Thanks Toni and Marina, for your collaboration in this research project, and thanks to the group members: you were most hospitable and helpful. Thanks also to Kai and Anna, for allowing me to study the evolution of your protein of interest. All labs involved in the cell cycle joined retreats: those were challenging, but brilliant. This thesis has its roots in the pleasure I experienced during my BSc and MSc research project. Therefore, I’d like to thank Luis Lugones, for allowing me to explore beyond the initial research plan. Jenny Regan, for making me experience what it is like to be an experimentalist, your endurance is truly amazing. Michael, you probably were exactly the supervisor my scientific interest needed in order to flourish. You gave guidance, trust and patience, for example when allowing, even encouraging me to start writing a manuscript by myself. Foremost you demonstrated how to be a both a devoted and pragmatic scientist. Thank you! My paranymphs, Eelco and Banafsheh, thank you for accompanying me on the defense day and its preparatory phase, and of course during the PhD itself. Bana, although maybe only in an intuitive level, I think I could learn a lot from the way you combine optimism with realism, a characteristic many consider impossible. Moreover, you showed to be enormously supportive, smart, empathic and funny. I guess you would be a great supervisor. Eelco, I must admit that I regularly found it intimidating to collaborate with you, and yes, that is a compliment. I know few people, if any, that can come up with so many alternative – technical and biological - explanations; in fact I consider you a hypothesis machine. Your endless drive and effort contrasted strongly with my ‘quick fixes’, which often indeed were half-cocked. Ans, Bert, thank you for taking care, for being interested in work and life. I really appreciate your presence. Gijsbert, I feel blessed to have had the oppurtinty to regularly talk to you. I experienced those conversations, about a wide range of subjects, to be analytically the most meaningful ones I’ve had. Moreover, you were willing to use your intelligence to give feedback on various works and considerations – these surely helped me out. To my sisters, Danique and Annabelle: I do not see a way to thank you concomitantly due to your dissimilar roles. Danique, thank you for being so involved, for being critical and for being so tolerant to my grumpiness. I am glad, maybe even somewhat relieved, that although we behave quite differently in various ways, we share how we like to talk about these – and other - matters. Annabelle, probably we differ even more, but luckily that did not stop you from contacting me. You were always up-to-date with what is going on in my life; hopefully I have some idea about yours. And thank you for staying close to and taking care of mom and dad! Mom and dad, Wim and Clementine, I have been pondering about whether or not to 198 Acknowledgments Acknowledgments 199

thank you concomitantly. I think I should, and I think you know why: I primarily regard you as a life-running team. Needless to say that without you being so involved in my personal development, I would not have written this piece of work. You imposed no – or hardly any - limits, instead encouraged me to do nearly anything, including some irresponsible deeds. From practical help in financing my studies to moving me to London, to emotional support when I was unconfident about everything: you standed by. Thank you. Thomas, I do not dare to portray you, for I could only fail in doing so, so consider these words neither complete nor definitive. I admire you for being so truehearted, honest, clear and fast. For being both critical and warmth, for never being indifferent. I thank you for the numerous hours you allowed me to spend with you, from hours in which we could do nothing, which were close to perfect, to hours in which we ran preparing for marathons, or just for the sake of it, and that were close to perfect too. Thank you for your courage to speak when I was on the wrong track, or to say which track to choose when I had no clue. 198 Acknowledgments Acknowledgments 199