Origins and divergence of the eukaryotic kinetochore
Jolien van Hooff Jolien van Hooff (2018) Origins and divergence of the eukaryotic kinetochore PhD thesis, Utrecht University Cover and layout by Alessia Peviani (www.photogenicgreen.nl) Printed by Ridderprint BV (www.ridderprint.nl) ISBN 978-94-6375-165-0 Origins and divergence of the eukaryotic kinetochore
Oorsprong en divergentie van het eukaryote kinetochoor
(met een samenvatting in het Nederlands)
Proefschrift ter verkrijging van de graad van doctor aan de Universiteit Utrecht op gezag van de rector magnificus, prof. dr. H.R.B.M. Kummeling, ingevolge het besluit van het college voor promoties in het openbaar te verdedigen op maandag 10 december 2018 des middags te 2.30 uur
door
Jolien Johanna Elisabeth van Hooff
geboren op 10 januari 1988 te Breda Promotoren: Prof. dr. G.J.P.L. Kops Prof. dr. B. Snel Table of Contents
1.Introduction...... 8 Introduction...... 10 The diversity and origins of eukaryotes...... 11 Cell division and chromosome segregation...... 15 Comparative genomics ...... 20 Scope and outline of this thesis...... 24 2. Inferring the evolutionary history of your favorite protein: A guide for cell biologists...... 26 Summary...... 28 Abstract...... 28 Introduction...... 28 Studying the evolution of a protein: what do we mean?...... 32 A quick (and dirty) guide to inferring the evolutionary history of a protein...... 33 Life is more complicated...... 42 Conclusions & Outlook...... 49 Acknowledgments...... 51 Author contributions...... 51 3. Evolutionary dynamics of the kinetochore network as revealed by comparative genomics...... 52 Abstract...... 54 Introduction...... 54 Results...... 56 Discussion...... 70 Materials and Methods...... 72 Acknowledgments...... 76 Author contributions...... 76 Supplementary Material...... 76 4. Unique phylogenetic distributions of the Ska and Dam1 complexes support functional analogy and suggest multiple parallel displacements of Ska by Dam1...... 82 Abstract ...... 84 Main text...... 84 Materials and Methods...... 93 Acknowledgments...... 97 Author contributions...... 97 Supplementary Text...... 97 Supplementary Material...... 100 5. Mosaic origin of the eukaryotic kinetochore...... 106 Abstract...... 108 Introduction...... 108 Results...... 109 Discussion...... 125 Data and Methods...... 127 Author contributions...... 129 Acknowledgments...... 129 Supplementary Text...... 130 Supplementary Material...... 134 6. Timing large-scale duplications during eukaryogenesis suggests relatively recent origins of eukaryote-specific proteins...... 142 Abstract...... 144 Introduction...... 144 Results...... 147 Conclusion & Discussion...... 155 Materials & Methods...... 159 Author contributions...... 162 Acknowledgments...... 162 Supplementary Material...... 163 7. Discussion...... 164 Kinetochore as a model for evolution of eukaryotic cellular processes?...... 166 Predictions on non-model kinetochores...... 170 More data from extant species make more complex ancestors...... 172 Small- and large-scale studies complement each other in illuminating eukaryogenesis...... 173 References...... 176 Abbreviations...... 189 Samenvatting...... 190 Curriculum vitae...... 194 Publications...... 195 Acknowledgments...... 196
1 Introduction 10 Chapter 1 Introduction 11
Introduction 1 Virtually all life we can see by eye is eukaryotic. Eukaryotes are a group of organisms that encompasses animals, plants and fungi. In addition to large, multicellular organisms, eukaryotes also include many unicellular organisms, such as the causal agent of malaria and the photosynthetic symbiont of corals. Compared to prokaryotes, which are all non-eukaryotic cellular life forms, eukaryotes are characterized by their intracellular complexity.
We can presently access the genomes of more species than ever before, both of eukaryotes and prokaryotes. Extensive analysis of these genomes has taught us that eukaryotic genomes evolve under completely different dynamics than prokaryotic genomes. As a result, these analyses have provided a framework to study the evolution of specifi c proteins encoded by these genomes. In return, small-scale protein evolution studies can uncover new paradigms in genome evolution.
Protein evolution studies are regularly used to get a glimpse of the evolution of the cellular processes they are involved in. In the case of eukaryotes, proteins subject to such analyses often are those that participate in the intricate cellular structures, processes and machineries that separate eukaryotes from prokaryotes. These studies aim to answer questions like: How did this cellular structure originate? How did it differentiate in current-day species? Moreover, evolutionary signatures of proteins can contribute to our understanding of their cellular or molecular function.
One of these typical eukaryotic features is the mitotic spindle. This is a complex machinery that eukaryotes use to segregate their duplicated DNA during cell division. The mitotic spindle pulls apart duplicated chromosomes towards the opposite ends of the cell, after which the cell is ready to divide. The spindle consists of hundreds of proteins. Among these are the pulling elements known as microtubules and the protein complex that links microtubules to the chromosomes. This latter complex is known as the kinetochore, and it is essential for proper cell division.
In this thesis, I describe three studies on the evolution of kinetochore, and one on the generic origins of eukaryotic proteins. In this chapter I elaborate on the phylum of eukaryotes and describe what we currently know about its origins and its evolutionary dynamics. I supply background information on the eukaryotic cell cycle, chromosome segregation and I detail the kinetochore constituents. I outline how comparative genomics aids in understanding the function and evolution of eukaryotic structures such as the kinetochore. I supply key defi nitions essential to the evolutionary interrogation of proteins. 10 Chapter 1 Introduction 11
The diversity and origins of eukaryotes Eukaryotes form a highly diverse group of organisms that all descended from a single 1 common ancestor, the Last Eukaryotic Common Ancestor, or LECA. The Eukarya (or Eukaryota) taxon is currently considered one of the three domains of life, next to Bacteria and Archaea (Figure 1A, [42]). Eukaryotes differ greatly in their morphology, life cycles and styles, and habitats. They can be either uni- or multicellular, autotrophs or heterotrophs, reproduce primarily sexually or clonally, live freely, as obligate pathogens or as obligate symbionts. Nevertheless, they share a plethora of features on the cellular level, which distinguish them from prokaryotes. Eukaryotes are designated by their nucleus (‘eu’ means ‘true’, ‘karyote’ means ‘kernel’), a membrane-bound compartment that encircles most of their DNA, which itself is packed into linear chromosomes. However, eukaryotic cells have many characteristics, such as the membranous compartments of the ER, Golgi and lysosome, intracellular vesicles that serve as transporters, mitochondria, post- transcriptional processing of mRNAs, an intricate cytoskeleton consisting of different building blocks and a large cell size (on average 1000 times larger than the average prokaryote).
Eukaryotes often are taxonomically categorized into fi ve different supergroups (Figure 1B, [29]). The supergroup of Opisthokonta consists of animals and their unicellular relatives, such as choanofl agellates, and fungi and their relatives. The supergroup of Amoebozoa consists mainly of unicellular organisms, but also encompasses aggregative multicellular ones such as slime moulds. The Archaeplastida are mostly photosynthetic, plastid-bearing organisms, like unicellular red and green algal species and multicellular land plants. These contain primary plastids that were derived by endosymbiosis of a cyanobacterium into a eukaryotic cell. The Excavata are largely unicellular and include plastid-bearing species as well as various parasites and free-living heterotrophs. The SAR supergroup is composed of Stramenopila, Alveolata and Rhizaria and consists mainly of unicellular organisms, such as ciliates, but also includes multicellular, plastid-bearing brown algae. Most non-Archaeplastids that harbor plastids acquired these secondarily, by endosymbiosis of a cell containing a primary or secondary plastid [47].
Although these supergroups provide a useful framework to study (genomic) diversity and evolution, they do not form solid units: both the supergroup themselves as well as their members are being revised regularly [32]. Such revisions are often required due to newly available genomes, particularly those that represent newly identifi ed taxa [49, 50]. Moreover, how these supergroups are related to one another is under continuous debate. While the Opistokonta together with the Amoebozoa and some other smaller taxa convincingly form a monophyletic group, whether the Excavata, Archaeplastida and SAR also do is an open question. This question is directly related to the root of the 12 Chapter 1 Introduction 13
eukaryotes. For the position of this root, broadly two models can be distinguished. In the fi rst, LECA diverged into an Opisthokonta/Amoebozoa - lineage and an Excavata/ 1 Archaeplastida/SAR-lineage (the Opimoda-Diphoda model, [31]). In the second, LECA diverged into (a subset of) Excavata and all other eukaryotic lineages (the Excavata- Neozoa model, [33]). In this thesis, I will assume the mentioned fi ve supergroups are related as proposed by the Opimoda-Diphoda model. However, the implications of my results would not be substantially different under the Excavate-Neozoa hypothesis.
A B Eukarya Archaea Opisthokonta
Asgard
Excavata
Amoebozoa
Bacteria
SAR Archaeplastida
Figure 1. Phylogenetic positions of eukaryotes and eukaryotic supergroups. A. The tree of life with the three domains of life (Bacteria, Archaea and Eukarya). The star indicates the probable root of the tree of life. Eukarya evolved from within the Archaea with the Asgard group as their most closely related sister clade. The Asgard represent one of the superphyla within the Archaea [9]. B. The eukaryotic tree of life with the 5 eukaryotic supergroups. The square indicates one of the hypothetical eukaryotic roots, largely in line with the Opimoda-Diphoda model [31], in which the Amoebozoa and the Opisthokonta comprise the Opimoda, and the Excavata, Archaeplastida and SAR the Diphoda. The Excavata that are being studied in this thesis likely are positioned as shown here, however, possibly not all Excavata do: The Excavata might not be monophyletic since some may group with the Opimoda.
Genome evolution in eukaryotes Over the last decade, genomes of many diverse eukaryotes have been sequenced. These genomes provide much phylogenetic information that improved the eukaryotic species phylogeny, due to which for example the Amoebozoa could be related to the Opishtokonta (Figure 1B) [52]. By tracing homologous sequences across genomes, we now have a much better understanding of how eukaryotic genomes evolve. Such studies, which are collectively called ‘comparative genomics’ (see below), have indicated that eukaryotic genomes evolve primarily via gene duplication, including whole-genome duplication, and gene loss [53]. Unlike prokaryotes, eukaryotes do not seem frequently involved in horizontal gene transfer (HGT), which possibly explains why duplication and 12 Chapter 1 Introduction 13
loss might play such important roles in eukaryotic genome evolution. Nevertheless, genes are being transferred horizontally towards and between eukaryotes, albeit at a lower frequency [54]. From or to which eukaryotes genes get transferred, how often they 1 do, and how important these transfers are for eukaryotic genome evolution, is being hotly debated (see also Chapter 7).
Duplication and loss events may occur as part of a more general evolutionary process that shapes genomes. Genome evolution processes have been described in various theories, like as constructive neutral evolution [55] and infl ation-streamlining [56]. The fi rst theory seeks to explain why some genomes are more complex than others (‘constructive’), while they encode very similar functions (‘neutral’). Why would some species have six genes cooperating to accomplish a given task, while another one only has three? According to the constructive neutral evolution theory, a new gene (for example acquired through duplication) might start to function in a given process, e.g. by incorporating into a protein complex. Initially, it is dispensable for this process. However, other genes involved in this process might become mutated. Such a mutation could be allowed for in the presence of the new gene, but not in its absence. As a result, the new gene becomes essential, and the process has become more complex, although its outcome has not changed. Because the new gene cannot be lost, one also speaks of ‘irremediable complexity’ [57]. On a longer time-scale, although initially neutral, the increased complexity may subsequently form a substrate for the evolution of novel features. Furthermore, on a longer time-scale the increased complexity may not be completely irremediable, as we for example observed that many genes that contributed to complexity got lost in recent lineages (Chapter 3).
The infl ation-streamlining theory posits that at the base of many species-rich clades, such as the eukaryotes, animals and plants, the genome complexifi ed rapidly (‘infl ation’). Subsequently, while lineages diverge, they gradually lose some genes that were present in their ancestor (‘streamlining’), sometimes in a differential manner. As a result of differential loss, we are still able to infer how complex this ancestor was. What may drive this evolutionary process? Possibly, the initial genome expansion is partially adaptive, for example through an increase in gene dosage, and partially neutral, for example in case of genes that co-duplicate in the same genomic region. These neutral expansions provide building blocks for adaptations, which in turn might make these genomically expanded lineages successful [58]. Subsequently, genomes get streamlined either via neutral loss, or via positive selection for a smaller genome. Likewise, after a whole-genome duplication (the ‘infl ation’), lineages seem to evolve by diversifying and losing many of the ancestral genes, sometimes reciprocally [59, 60]. The infl ation-streamlining theory may recapture what is often observed in large-scale phylogenomics reconstructions: many ancestors of major taxa, such as LECA, appear to have possessed surprisingly many genes, and 14 Chapter 1 Introduction 15
these genes were often lost in various more recent lineages [56]. This is what we also observe when studying a specifi c cellular structure such as the kinetochore (Chapter 3). 1 However, ancestral genomes may also be overestimated, particularly if the frequency of HGT is underestimated. Of note, although constructive neutral evolution and infl ation- streamlining theories predict different observations in the evolution of the gene contents of species, they are not necessarily mutually exclusive. Among eukaryotes, for example within the animal lineages, we observe the footprints of both processes [61].
The origins of eukaryotes Eukaryotes cells arose from prokaryotes approximately 2 billion years ago [1, 2]. Given the many ways in which eukaryotes differ from prokaryotes, this transition famously has been described as “the greatest single evolutionary discontinuity to be found in the present-day world” [62], although its uniqueness has been challenged [63]. The eukaryotic cell originated between the divergence of the eukaryotic lineage from its prokaryotic ancestors (referred to as the fi rst eukaryotic common ancestor, or FECA) and the eukaryotic species that gave rise to all current-day eukaryotic life (LECA), a process called ‘eukaryogenesis’ (Chapter 5, Figure 1). To shed light on the series of events that comprised eukaryogenesis, it is essential to pinpoint from which prokaryotes the eukaryotes descended. Hence, over the last decades, many studies focused on identifying where eukaryotes are positioned in the tree of life. These have discovered that eukaryotes evolved from within the Archaea [7, 8]. Recently, the archaeal origin of the eukaryotes gained further support by metagenomics studies that identifi ed the closest archaeal relatives of eukaryotes so far, i.e. the Asgard superphylum [9, 10]. This is not the complete story, however: it has long been recognized that mitochondria, the ATP-generating organelles of eukaryotic cells, emerged from the endosymbiosis of a bacterium [17]. Many consider this endosymbiotic event a major or even a key contribution to the evolution of the eukaryotic cell. Unsurprisingly, many efforts have been made to identify the position of this bacterium in the tree of life as well. It is widely recognized to belong to the class of Alphaproteobacteria and recently it was shown to diverge deeply within the alphaproteobacterial species tree, before the diversifi cation of most alphaproteobacterial clades [64]. Hence, eukaryotes stem from an Asgard-related host cell that incorporated an early-branching alphaproteobacterium.
While the source of the mitochondria is more or less resolved, those of the other typical eukaryotic features are not. How, why and when did features like the endomembrane system, post-transcriptional processing, linear chromosomes with telomeres, meiotic sex, and the eukaryotic cytoskeleton evolve? And how is their evolution related to the mitochondrial endosymbiosis? Did this endosymbiosis trigger the autogenous evolution of all other eukaryotic features, or did the alphaproteobacterium enter the Asgard- related host when these were already present? Some of these features, or at least their 14 Chapter 1 Introduction 15
genetic building blocks, might indeed have been present already; the genomes of Asgard species contain various genes previously considered to be eukaryote-specifi c, including some that in eukaryotes have roles in for example the endomembrane system 1 [9, 10]. Since hitherto we cannot culture Asgard cells, we cannot investigate whether these genes fulfi ll a similar role, i.e. to assess whether Asgard cells have something like an (primordial) endomembrane system for which they use these genes. Using phylogenomics, Pittis & Gabaldón found additional evidence for a relatively late entry of the mitochondria, as the host genome might have already encoded various genes playing roles in typical eukaryotic processes [65]. While this was one of the few attempts to systematically assess how complex the host already was, their method was strongly criticized and many continue to consider it more likely that mitochondrial entry was the key event that triggered the evolution of all other typical eukaryotic features [66, 67]. Mitochondria would have provided an ATP-surplus that allowed for an increase in genome and cell size, or would have driven the evolution of the nucleus as a way to cope with the harmful effects of spliceosomal introns [68-70]. A third scenario, coined ‘inside- out’, proposes concomitant evolution of mitochondria and the nucleus [71].
The emergence of eukaryotic cellular complexity coincided with an increase in genome complexity: The genome of a typical eukaryote contains approximately four times as many genes as a prokaryotic genome [68]. Next to the genes that were already present in the common ancestor of the eukaryotes and their archaeal relatives, genes from the alphaproteobacterial endosymbiont also contributed to the eukaryotic genome. Moreover, genes from various other Bacteria likely entered the pre-eukaryotic lineage via horizontal gene transfer [72]. Such horizontally transferred genes include for example certain genes involved in the nuclear pore [73]. Although various previously ‘eukaryote- specifi c’ genes were detected in Archaea, other eukaryotic genes still appear to have no homologs in prokaryotes [74, 75]. These likely evolved de novo between FECA and LECA. Finally, genes (with and without prokaryotic ancestry) duplicated frequently during eukaryogenesis, thus duplications also contributed to the gene complement of eukaryotes [76]. In Chapters 5 and 6, I examine how these duplications contributed to cellular complexity of eukaryotes and what sort of genes duplicated.
Cell division and chromosome segregation
Cell cycle Eukaryotes differ from prokaryotes in the way they regulate and execute cell division. Prokaryotes seem to genomically replicate and segregate, and cellularly divide in diverse manners [27, 77]. Most eukaryotes have conserved the overall layout of these processes, which is referred to as the eukaryotic cell cycle. The cell cycle is divided into 16 Chapter 1 Introduction 17
two parts: interphase, which consists of the G1, S and G2 phases, and mitosis, referred to as the M phase. During G1 and G2, cells primarily grow. During S (‘synthesis’) phase, the 1 nuclear DNA replicates, resulting in duplicated chromosomes that consist of two sister chromatids. These are separated during the M phase, which usually ends with the actual cell division, a.k.a. cytokinesis. Transitions from one phase to another are regulated by kinases (cyclin-dependent kinases, or CDKs) [78], whose essential cofactors (cyclins) are regulated at the transcriptional and translation levels, and by protein degradations. For example, CDK1 and cyclin B collaborate to initiate mitosis. In M phase, sister chromatid separation is regulated by the anaphase-promoting complex/cyclosome (APC/C, see below).
The mitotic machinery In S phase, sister chromatids of duplicated chromosomes are ‘glued’ to each other by cohesin, a protein complex that encircles them in a ring-like structure. In M phase, when the sister chromatids segregate, cohesin needs to be removed. In order to ensure that each daughter cell receives a complete set of chromosomes, the chromatids need to be segregated equally. Segregation is therefore a strictly regulated process, and missegregation can have detrimental effects on organismal fi tness [79]. Chromosomes are segregated by the mitotic spindle (Figure 2A). This spindle consists of four elements: (1) microtubule-organizing centers (MTOCs), which are typically located at opposite ends of the cell, and form the ‘poles’ of the spindle, (2) microtubules, which emanate from these MTOCs, forming the spindle ‘fi bers’, (3) kinetochores, which connect the spindle microtubules to the chromosomes, and (4), the duplicated chromosomes themselves, consisting of two sister chromatids bound by cohesin. The sister chromatids are pulled apart by spindle microtubules. In human cells, the midzone of this spindle forms the place where later the cytoplasm will divide during cytokinesis, but eukaryotes differ in how they organize cytokinesis [80]. As stated, for sister chromatid separation, cohesin needs to be removed. This removal only occurs after all chromosomes have their sister kinetochores stably attached to microtubules from opposite spindle poles, referred to as ‘bioriented attachment’. Cohesin is removed through cutting of its ring-like structure by separase. Separase is activated by the APC/C, which itself is activated only when stable, bioriented attachments have been established for all chromosomes. As a result, sister chromatids are only allowed to separate if they are connected to microtubules from the opposite spindle poles. The surveillance mechanism that monitors the kinetochore attachment status is called the spindle assembly checkpoint (SAC) [81]. The SAC directly senses the attachment status of kinetochores: SAC proteins localize to kinetochores if these are unattached, and from this kinetochore they emit a signal that inhibits the APC/C.
As far as is known, all eukaryotes use microtubules to separate their nuclear chromosomes, 16 Chapter 1 Introduction 17
A mitotic spindle B kinetochore (human)
ARHGEF17
spindle sister Ska1 Ska2 Mps1 microtubule chromatids kinetochore Ska3 1
Ndc80 Apc15 Nuf2
Spc24 Zwilch Cdc20 Astrin Spindly TRIP13 Spc25 Mad2 p31comet SKAP ZW10 F E Rod Mad1 Dsn1 Bub1 BubR1 Nsl1 Knl1 Cep57 Bub3 Outer KT Mis12 Zwint-1 Nnf1 Plk BugZ O U R P Inner KT H I C Q L N K M
W T Centromeric DNA A S X Sgo1 Aurora Borealin Inner Centromere Incenp Survivin
Protein complexes
Chromosomal Passenger Complex (CPC) Rod-Zwilch-ZW10-Spindly Complex (RZZS*)
Constitutive Centromere-Associated Network (CCAN) Spindle Assembly Checkpoint (SAC) poleward-movement Mis12 Complex (Mis12-C) TRIP13-p31comet
Ndc80 Complex (Ndc80-C) KMN network Astrin-SKAP
microtubule-organizing Knl1 Complex (Knl1-C) Ska Complex (Ska-C) center (MTOC)
Figure 2. Illustrations of the eukaryotic mitotic spindle and the human kinetochore. A. The mitotic spindle in eukaryotes separate sister chromatids from duplicated chromosomes by pulling forces of spindle microtubules. Spindle microtubules depolymerize at the chromosome-bound end, due to which they get shortened and pull along the chromatids towards the microtubule-organizing centers (MTOCs). B. Kinetochores form the attachment site of the microtubules to the chromosomes. The human kinetochore is shown here, and its proteins are colored according the protein complex they belong to. The budding yeast kinetochore is largely similar. Notable difference include the RZZ (which yeast lacks), the Ska complex (which yeast also lacks, while instead it has the Dam1 complex - see Chapter 4) and the inner kinetochore Nkp complex and Csm1 protein (which yeast has, but human lacks).
which is why probably LECA did so as well. However, eukaryotes differ in the ways they organize their spindle, of which I will describe a few examples here. In metazoan lineages the mitotic MTOCs consist of centrosomes. At the beginning of M phase, the nuclear membrane disassembles (‘open mitosis’), allowing the microtubules from the centrosomes to connect to the chromosomes. In contrast, in budding yeasts the MTOCs are so-called spindle pole bodies, which are embedded in the nuclear membrane. In these species, the nuclear membrane stays intact during mitosis (‘closed mitosis’). In the Alveolata group dinofl agellates, the nucleus stays intact as well, but the spindle forms in the cytoplasm, and connects to the nuclear envelope-embedded kinetochore [82]. Next to these examples of ´open´ and ´closed´ mitosis, also intermediate forms exist. For example, the fungus Aspergillus nidulans has a semi-open mitosis, in which the nuclear pores disassemble but the envelope does not [83]. Moreover, in human, kinetochores connect to 15-20 microtubules [84], whereas in the Amoebozoa Dictyostelium discoideum they bind 2-3 [85], and in budding yeasts only to a single one [86]. The eukaryotic spindle morphologies are thus highly different [87, 88], due to which it is diffi cult to infer what sort of mitosis LECA had [89]. 18 Chapter 1 Introduction 19
Eukaryotic chromosomes have specialized regions, called centromeres, that execute their separation [90]. The centromere is the spot where the kinetochore is assembled and where 1 the chromosomes connect to microtubules. In eukaryotic species with a characterized centromere, the centromere is specifi ed by its alternative nucleosome composition. A ‘regular’ nucleosome consists of DNA that is wrapped around two copies of the proteins H2A, H2B, H3, and H4. A centromeric nucleosome contains CenpA (or ‘CenH3’) instead of H3. Although the centromere is an important structure for cell division, centromeric DNA sequence and centromere size are not conserved [91, 92]. In fact, centromeric DNA evolves tremendously fast [93] and has been observed to expand and contract. This dynamic nature of the centromere is referred to as the ‘centromere paradox’ [94]. It may be explained by the adaptive evolution for a stronger, more microtubule- binding centromere in species with unequal female meiosis, which is called ‘meiotic drive’, which may also affect the evolution of centromere-binding proteins, i.e. CenpA [95] (see also Chapter 3). Across many species, however, centromeres are composed of tandem repeats [96]. The DNA sequences of these regions are generally not suffi cient to defi ne the centromere. These are defi ned epigenetically, which may partially explain why the sequences are relatively free to evolve. Budding yeasts form a notable exception: their centromeres are defi ned by a short, specifi c DNA sequence of approximately 120 base pairs [97]. In contrast to these ‘point’ centromeres of budding yeast, most species have ‘regional centromeres’, spanning a larger region of the chromosome. Yet other species, such as nematodes, certain insects and the sedge Rhynchospora pubera, have centromeres that are dispersed along the full length of their chromosomes. These are called ‘holocentromeres’. Holocentromeres are found to have different confi gurations [98, 99], varying from dispersed point centromeres [100] to CenpA-lacking centromeres [101] to multiple, interspersed satellite regions [99].
Kinetochore & the inner centromere The kinetochore connects the sister chromatid to spindle microtubules at the site of the centromere (Figure 2B). In addition, it serves as a signaling platform for the SAC. The kinetochores of human and budding yeast have been studied thoroughly, due to which we know their proteins [102]. The kinetochore is subdivided into the inner kinetochore, which is closely associated to the centromere, and the outer kinetochore, which captures spindle microtubules and activates SAC signaling. Below, I will describe these regions and the proteins that constitute them. I will also describe inner centromeric proteins, which contribute to the regulation of microtubule attachment. Unless mentioned otherwise, I will primarily describe the most important proteins of the kinetochore that human and yeast share, and that are likely also part of the kinetochores of other eukaryotic clades. In Chapters 3 and 4, the conservation and divergence of these kinetochore proteins across eukaryotes will be examined and discussed in more detail. 18 Chapter 1 Introduction 19
The inner kinetochore The typical CenpA-containing nucleosomes recruit the Constitutive Centromere- Associated Network (CCAN), which is the main constituent of the inner kinetochore. 1 The CCAN bridges the centromeric chromatin to the microtubule-binding site formed by the outer kinetochore. The CCAN consists of CenpC and four protein complexes: CenpT-W-S-X, CenpO-P-Q-U (plus CenpR in human), CenpL-N, CenpH-I-K-M. The CenpA-containing nucleosomes are recognized by CenpC and CenpN [103]. CenpC serves as a scaffold within the inner kinetochore. Furthermore, CenpC recruits, together with CenpT, the outer kinetochore, microtubule-binding KMN network (Knl1-C, Mis12-C and Ndc80-C, see below) [104, 105]. In addition to CenpC and CenpT, the subcomplex CenpQ-U has been shown to interact with the outer kinetochore KMN network, possibly serving as an extra bridge [106]. CenpU also has a role in recruiting Plk, a protein kinase that operates in the SAC [107]. In addition, the yeast CCAN contains Nkp1 and Nkp2, which constitute the Nkp complex. If these proteins have a specifi c role in the CCAN is currently unknown [108].
The outer kinetochore & the SAC Kinetochores bind microtubules primarily via the KMN network. This network consists of three complexes: Knl1-C (Knl1-Zwint-1), Mis12-C (Mis12-Nnf1-Dsn1-Nsl1) and Ndc80-C (Ndc80-Nuf2-Spc25-Spc24). The KMN network is recruited by CenpC via Mis12 and by CenpT via Spc24 and Spc25 [109]. Ndc80 is the main microtubule-binding protein of the kinetochore. If Ndc80 is not bound by microtubules, it activates the SAC. It does so by binding to Mps1. Mps1 is an outer kinetochore kinase that governs the SAC. If recruited to the kinetochore, Mps1 phosphorylates Knl1. Phosphorylated Knl1 recruits Bub3 and MadBub (BubR1 in human, Mad3 in yeast). Together, Bub3, MadBub and Mad2, which is recruited by Mad1, bind Cdc20. These four proteins form the Mitotic Checkpoint Complex, or MCC. The MCC effectuates the SAC. In an unbound state, Cdc20 activates the APC/C, but as part of the MCC it inhibits the APC/C. Hence, as long as no microtubule is bound, the kinetochore inactivates the APC/C. Consequently, the sister chromatids are not allowed to separate (see above). Once Ndc80 is attached to microtubules, it no longer binds Mps1, thus no more MCC is being produced and the APC/C becomes activated by free Cdc20. The human outer kinetochore also contains the Rod-Zwilch-ZW10 (RZZ) complex, which recruits Mad1 and Spindly. The latter protein serves to ´clean´ the kinetochore from SAC proteins after the Ndc80 has become captured by microtubules. Although Ndc80 is the primary microtubule-binding protein, microtubules also interact with Knl1, with the three-subunit Ska complex (human) and the ten-subunit Dam1 complex (budding yeast). These latter two complexes serve to stabilize the kinetochore-microtubule interactions (see also Chapter 4). 20 Chapter 1 Introduction 21
The inner centromere Although not part of the kinetochore, proteins of the inner centromere are critical for 1 kinetochore function. The inner centromere is the chromatin region between the sister kinetochores. The inner centromere proteins that impact kinetochore function are those that form the Chromosomal Passenger Complex (CPC, Aurora, Incenp, Borealin, Survivin) and Sgo [110, 111]. These proteins destroy erroneous microtubule-kinetochore attachments, like two sister kinetochores being attached to microtubules from the same spindle pole [112]. They do so via the kinase module of the CPC, the protein kinase Aurora B, which phosphorylates Ndc80 to diminish its affi nity for microtubules. Aurora B additionally phosphorylates subunits of the Ska and Dam1 complexes [113], which, as a result, detach from the microtubules [114, 115].
In this introduction, I outlined the most important and shared components of the human and yeast kinetochores, which, compared to those of other eukaryotic species, were studied into great detail. As shown in Chapters 3 and 4, the kinetochores of other eukaryotes differ strongly. In fact, although not identical, the human and yeast kinetochores are relatively similar. I cannot exclude that other eukaryotes contain yet other important kinetochore components, which we, given the lack of experimental data in these species, do not know of yet. The kinetochores of trypanosomes (Excavata) may contain (some) other components [116], but I consider these likely to be derived and not shared with other eukaryotes, which is why I do not describe them here (see also Chapter 7).
Comparative genomics
In the work I describe in this thesis, I used comparative and evolutionary genomics to map the diversity and ancestry of eukaryotic kinetochores, and to uncover the processes that gave rise to the genome of LECA. In Chapter 2, I argue that specifi c research questions require specifi c approaches and I delineate my perspective on how to study the evolution of a single protein. Here I provide some key defi nitions and explain how comparative genomics can be used to detect co-evolution, and how this in turn can be used to predict gene functions.
Over the last decades, the life sciences have become increasingly infl uenced by genomic data. Whereas initially most sequenced genomes were prokaryotic, because these are small and relatively simple to sequence, now more and more eukaryotic genomes become available. Not only the number of sequenced eukaryotic genomes increased, but also their diversity. Likely the eukaryotic genome data will continue to increase, since sequencing no longer necessitates culturing and since new sequencing projects intend 20 Chapter 1 Introduction 21
to cover a yet wider, and more representative, genomic diversity [117-119]. By analyzing genomes, we can identify genes or proteins that share common ancestry (‘homologs’), both in different species as well as within a single species. Homologs shed light on 1 the evolutionary history of proteins and can be used to predict protein functions by transferring functional characterizations across homologs [120]. As such, comparative genomics has proven to be a powerful tool to uncover large-scale evolutionary events, to guide experimental biology and to construct or improve the species tree, to name a few applications [121]. Unsurprisingly, as the amount of genome information propagates, so does the number and quality of tools that are used to identify and analyze homologs.
Homology, orthologs, paralogs, analogs As mentioned, proteins that share common ancestry are called ‘homologs’. Homology may not just apply to proteins (or genes - in the remainder of this chapter these can be replaced with one another). It may apply to any pair of biological entities that descend from a single ancestral feature. In comparative genomics, genes/proteins are recognized to be homologous if their aligned (DNA/amino acid) sequences are suffi ciently similar. Homology is a boolean trait: either two features are homologous, or they are not. Hence, saying that “proteins A and B are 60% homologous” is strictly speaking incorrect. Often, with using this phrase one actually intends to say that 60% of their DNA/amino acid positions are identical or similar. Homology is also transitive trait, which means that if sequences (in the narrow meaning of: a sequence of nucleotides or amino acids, not necessarily a full protein or gene) A and B are homologous, and B and C are homologous, A and C are homologous as well. We discriminate various types of homology relationships between proteins: homologous proteins might be orthologs or paralogs, depending on which sort of event split them since their common ancestor (see Chapter 2, Box 2) [122]. If two proteins were derived from speciation of an ancestral species, they are orthologous. If they are derived from a gene duplication within a given ancestral species, they are paralogous. Moreover, some homologous proteins underwent a horizontal gene transfer since their last common ancestor. These are sometimes called ‘xenologs’, although logically these can also be classifi ed as either orthologous or paralogous.
Why is it important to discriminate between orthologs and paralogs, rather than simply using ‘homologs’? To answer this question, we have to consider how we think the function of a protein evolves after either a speciation or a duplication. Ohno, in his famous book Evolution by Gene Duplication, proposed that after a gene duplicates into two paralogs, one of these paralogs is released from the evolutionary pressure to conserve the function of the ancestral, pre-duplication protein (free from ‘purifying selection’) [123]. Hence, it may diverge its sequence and thereby acquire a new function. In other words: it may ‘neofunctionalize’. As a result, duplication paves the way for molecular and functional innovation. Since Ohno’s publication, many have observed that paralogs may not just 22 Chapter 1 Introduction 23
‘neofunctionalize’ but also ‘subfunctionalize’, if the function of the ancestral protein is divided over the duplication-derived paralogs [124]. Nevertheless, subfunctionalization 1 still implies that paralogs often do not retain the (entire) ancestral function. In contrast, orthologs are expected to have identical, or at least biologically equivalent, functions. This model about the alternative ways in which a protein’s function evolves after either a duplication or speciation is known as ‘the ortholog conjecture’ [125]. The ortholog conjecture has been challenged by arguments and observations about an equally strong or stronger functional correlation between paralogs compared to orthologs [126, 127]. Nevertheless, this conjecture is commonly used to rationalize why the function of a characterized protein a given species may be transferred to an uncharacterized protein in another species: this transfer is allowed for if these proteins are orthologs, not if they are paralogs.
While orthologs may have biologically equivalent functions, functionally similar proteins may not be orthologous. In fact, they might not even be homologous: many studies have reported about proteins with similar functions [128, 129], but without detectable homology. Such proteins are called ‘analogs’. Analogs might arise if one protein, for example encoded by a novel gene, takes over the function of another. As a result, the latter may become dispensable and lost. We refer to this process as ‘non-homologous displacement’. In some cases, proteins initially thought to be analogs turned out to be homologs. Possibly, their shared ancestry is not detected because their sequences diverged extensively since their common ancestor. In these cases, their solved crystal structures may show the similarity that points to their homology.
The terms ‘homologs’, ‘orthologs’ and ‘paralogs’ are used to describe the relationship between two genes or proteins. However, comparative genomics is about studying many species, and therefore many proteins, at the same time. For example, a common comparative genomics question is: which species have a particular protein? To answer this question, we aim to establish a set of homologous sequences that comprise an orthologous group (Chapter 2, Box 2). Such an orthologous group encompasses all proteins that were derived from a single protein in a selected last common ancestor, such as LECA or the last common ancestor of all animals. Within the orthologous group, two proteins can be paralogs. This is the case if they resulted from a gene duplication that occurred in a more recent lineage than the ancestral species we chose to defi ne the orthologous group for. These proteins are called ‘inparalogs’. Two proteins that result from a gene duplication prior to the last common ancestor of choice are called ‘outparalogs’; they are in different orthologous groups. In this thesis, I in most cases apply comparative genomics to the domain of eukaryotes. Therefore, I generally defi ne the orthologous group as those sequences that were derived from a single protein in LECA. However, not all proteins I study date back to LECA. Some likely were invented 22 Chapter 1 Introduction 23
more recently. In those cases, the orthologous group only contains sequences from species that descend from the lineage in which the protein was invented. 1 Phylogenetic profi ling & detecting co-evolution Comparative genomics often aims to fi nd out which species have a given protein, that is, a member of an orthologous group, and which do not. This information can be used to predict this protein’s function based on its predicted (functional or physical) interactors. The presences and absences of a protein are captured by a so-called ‘phylogenetic profi le’ which simply is a binary vector, consisting of e.g. ones and zeros, or‘P’s (presences) and ‘A’s (absences) (Figure 3). Often species are ordered according to their position in the species tree. This phylogenetic profi le is used to infer whether the protein of interest co-occurs with another protein across species. If two proteins have identical or very similar phylogenetic profi les, this suggests that they are functionally related, or maybe even interdependent. As a result, phylogenetic profi ling can be used to predict functions for uncharacterized proteins in a ‘guilt-by-association’-approach [130], which has proven to be successful for many proteins [28]. Moreover, the phylogenetic profi le of a protein can be compared to those of other biological entities, such as a cellular feature, a functional protein motif or a metabolic product. For example, many proteins involved in the centrosome have a phylogenetic profi le that is very similar to that of the centrosome itself [131]. Reversely, this method can also be used to fi nd out which (other) proteins are involved in a certain process. Phylogenetic profi les are not only carrying information if they are very similar, they also do when they are very dissimilar [132]. If two proteins are both present in many species, but never co-occur, they might be functionally redundant. After all, if two proteins do exactly the same job, it does not make sense to have them both, especially if they cannot cooperate or alternate. Such proteins might be analogs.
Both similar or dissimilar phylogenetic profi les are the result of evolutionary processes. Similar profi les often indicate that proteins co-evolved, for example because if one of the protein is lost in a given lineage, the other one cannot function any longer and is lost in this lineage as well. Asimilar profi les might indicate displacement, which I mentioned above. Phylogenetic profi ling hence also provides a means to study a gene’s (co-)evolutionary dynamics, albeit in an indirect manner. Co-evolution can also be more directly studied by comparing phylogenetic trees of proteins: if these are very similar, this points towards co-evolution [28]. In this approach, one is able to time when in evolution a protein was lost, but also when it for example duplicated, and if this coincided with the loss or duplication of another protein. Thereby, one directly observes that these proteins co-evolved, and how they did. Moreover, co-evolution may not only occur at the level of proteins, but for example also at the level of residues [133], or between a protein and a protein module [134]. Such co-evolution can equally well provide functional cues. 24 Chapter 1 Introduction 25
species tree phylogenetic profile (pp)
H. sapiens C. owczarzaki 1 S. cerevisiae F. alba A. castellanii D. discoideum N. gruberi T. brucei E. siliculosus P. falciparum B. natans A. thaliana C. reinhardtii G. sulphuraria
gain / presence protein A/B/C loss protein A/B/C
correlating pp’s: collaboration? protein A protein B protein C anticorrelating pp’s: substitution?
Figure 3. Phylogenetic profi ling of proteins refl ects co-evolution and might indicate functional interactions. Hypothetical proteins A, B and C were found to be present or absent in certain lineages, as indicated by their phylogenetic profi les (right side). Proteins A and B are often present or absent together, which may indicate that they depend on each other for their function and hence collaborate in current-day species, for example in a single pathway or in a protein complex. Proteins A and C, and B and C, are never both absent or both present in a single species. This might indicate that they perform the same functions in different species as analogous proteins. The phylogenetic profi les of proteins are the results of their evolutionary trajectories, which are represented on the species tree. From this evolutionary reconstruction, one can directly conserve that proteins A and B co-evolve: they often get lost on the same branch. Moreover, loss of A and B may either predate or postdate the gain of protein C.
Scope and outline of this thesis
In the work described in this thesis, I used comparative genomics to study the evolution of eukaryotic proteins since and before the last eukaryotic common ancestor, specifi cally those proteins that constitute the eukaryotic kinetochore. I mainly applied comparative genomics on a small scale, that is, by manually studying the evolution of individual proteins. In Chapter 2, I expose how I approach this type of study. Thereby, this chapter serves as a guide for other researchers who want to investigate the evolution of a particular protein of interest. Moreover, this chapter reveals how I examined the evolution of various kinetochore proteins in Chapters 3, 4, and 5. In Chapter 3, I report the overall 24 Chapter 1 Introduction 25
presences and absences of characterized kinetochore proteins across eukaryotic species. I try to explain what (co-)evolutionary dynamics are responsible for these presences and absences. This inventory revealed the unique presence-absence patterns of the 1 analogous outer kinetochore complexes Ska and Dam1. In Chapter 4, I discuss our in- depth study on the evolution of these complexes. I propose they arose from ancient gene duplications and that Dam1 spread via horizontal gene transfer and displaced Ska in the recipient lineages. In the analysis performed for Chapter 3, we furthermore noticed that in addition to the Ska and Dam1 complexes, also other kinetochore complexes contain paralogous proteins. In Chapter 5, we uncover the ancient origins of kinetochore proteins, which revealed that the kinetochore is of mosaic origin and that duplication played an important role in its expansion during eukaryogenesis. In Chapter 6, I generalize the latter topic by addressing how gene duplications during eukaryogenesis contributed to the complex genome of LECA. I studied the prokaryotic origins and the duplication histories of LECA’s genes through phylogenomics. In Chapter 7, I end this thesis with a discussion on the evolutionary phenomena I encountered, their impact on kinetochore evolution and why investigating them can be a challenge.
2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists
Jolien JE van Hooff, Eelco Tromer, Teunis JP van Dam, Geert JPL Kops and Berend Snel
Manuscript submitted 28 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 29
Summary
Van Hooff et al. expose how to approach the evolutionary analysis of individual proteins. By outlining different evolutionary scenarios, they provide an analytical scheme for molecular and cellular biologists who aim to study the evolution of their protein of interest.
Abstract 2 Comparative genomics has proven a fruitful approach to acquire many functional and evolutionary insights into core cellular processes. Such studies typically yield a set of sequences orthologous to the protein of interest. They lack a simple protocol, because different proteins have different evolutionary dynamics, and therefore demand different approaches. For the same reason, automatic approaches to establish sets of orthologs often fall short. We here discuss this challenge from a practical (what are the observations?) and conceptual (how do these indicate what happened in evolution?) viewpoint, with the aim to guide investigators who want to analyze the evolution of their protein of interest. We argue that one should fi rst and foremost generate a scenario for the protein’s evolutionary dynamics, because it aids in choosing the appropriate strategy. By sharing how we draft, test and update such a scenario and how it directs our investigations, we hope to illuminate how to execute molecular evolution studies and how to interpret them.
Introduction
Historically, comparative and evolutionary analysis of genes and proteins has been a versatile and useful approach in molecular biology to guide experiments that investigate the cellular role of human proteins. Typically, their evolutionary interrogation is used to answer the following questions: What is the function of characterized homologs in other species, such as budding yeast? Which residues are conserved and therefore likely functionally important? For example, the telomere function of the human protein Rap1 was predicted from the telomere function of its yeast ortholog [135], and the multiple sequence alignment of S6 kinases identifi ed the TOS motif, which is essential for TOR signaling [136]. Evolutionary analysis can also have more advanced applications (Table 1). The function of a protein might be predicted from ‘guilt-by-association’: if a protein co-occurs nicely with another protein across species, it is likely to play a role in the same process [137]. The structure of a protein might be predicted from a multiple sequence alignment: a multiple sequence alignment might reveal which residues co-evolve, and therefore likely are in close proximity in the protein’s 3D structure [138]. 28 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 29
Table 1. Evolutionary examination of a protein serves a variety of purposes.
Purpose Example studies Key challenges Key observations Function prediction The human protein Rap1 Finding (pairs of) Orthologs have of human proteins operates at telomeres [135] orthologs (Box 2). biologically equivalent functions.
Identifying conserved S6 kinase - TOS motif [136], Using correct Functional sequence functional domains/ Nbs1/Ku80/ATRIP - ATM sequences for the motifs are often motifs interaction motif [139] multiple sequence conserved patches 2 alignment in unconserved regions.
Understanding CenH3 and holocentricity [101], Retrieving Protein loss and function and Effector proteins, horizontal information on gain (e.g. via evolution of gene transfer and pathogenicity biological (e.g. horizontal gene biological features [140] cellular) features transfer) correlate to feature loss and gain.
Reconstructing GKPID’s ability to bind Pins Obtaining a high- New versus old evolution of evolved via a single amino acid quality gene tree protein functions function by substitution [141] can be distinguished ancestral sequence based on sequence reconstruction information
Function prediction BBS proteins are involved in the Correct Non-classical by co-evolution cilium [142], identifi cation of phylogenetic or phylogenetic TRIP13 has conserved roles in presences as well profi les predict profi ling both mitosis and meiosis [14] as absences across analogy and species bifunctionality
Understanding MadBub duplicates diverged Allowing for loss Duplicate subfunc- functional divergence into a Bub-like and Mad-like of domains in tionalization might after gene protein through reciprocal homology searches, be more common duplication domain/motif loss [143, 144] obtaining a high- than neofunctional- quality gene tree ization
Predicting gene CAMSAP tracks the minus ends Finding (pairs of) Orthologs have functions or cellular of microtubules in Trichomonas orthologs biologically make-up of non- vaginalis and Tetrahymena equivalent functions. model organisms thermophila [145] 30 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 31
Phylogenetically Kinesins [146], histones [147], Distinguishing Large-scale classifying proteins ATPases [148], myosins [149, different duplications before that are part of a 150], Rab GTPases [151] orthologous groups LECA. large protein family and identifying ancient duplications
Uncovering ancient Prokaryotic origins of various Detecting remote Characteristic origins of eukaryotic ‘eukaryotic signature proteins’ homology eukaryotic features cellular features [10, 152-154] are encoded by 2 genes shared with Archaea.
In addition to helping to elucidate function, evolutionary analysis of a protein reveals the evolution of the processes it is involved in. This way, molecular evolutionary biologists study the origins of eukaryotic cellular complexity, just as classic evolutionary biologists study the origins of physiological innovations such as living on land, fl ight or warm- bloodedness. Studies that unraveled the evolutionary history of proteins have also profoundly altered biological paradigms. They revealed the large contribution of gene duplication to genome expansion and the pivotal role of promiscuous protein domains in shaping signaling networks [155, 156]. In some cases, protein evolutionary analyses yielded results that are relevant to both cellular as well as evolutionary biology. An example is MadBub. This protein independently duplicated at least 16 times in diverse eukaryotic lineages. Most duplicates diverged by reciprocally losing one of the two biochemical functions of the ancestral protein (‘subfunctionalization’). This observation predicted a function for one of the human duplicates, and it provided a (thus far) unique example of parallel duplication and subfunctionalization [143, 157]. In general, obtaining relevant functional and evolutionary insights requires a strong interplay between the two: projecting functional knowledge onto a protein can explain its evolutionary history, and the evolutionary history aids in understanding or predicting the functions of a protein across species.
We foresee that evolutionary analysis of single proteins will continue to pay off in the future. These analyses generally have the highest quality if performed manually, by studying proteins through in-depth inspection. Automated approaches often have certain biases that make them perform well for some proteins, but not for others [158]. Manual analysis allows one to apply a customized approach suitable to the protein of interest [159]. Cellular biologists would greatly benefi t from doing such a manual analysis themselves. In fact, due to their knowledge on a protein and its functional regions, they frequently give complementary evolutionary interpretations compared to bioinformaticians. In collaboration with cellular biologists, we studied diverse eukaryotic proteins, varying from chromatin remodelers and bZIP transcription factors to components of the fl agellum and 30 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 31
Box 1. What can happen to a protein during evolution? Most evolutionary interrogations are focused on, or at least include, determining what happened to a protein and its coding DNA sequence during its evolution across different lineages. By ‘this protein’ we here refer to the protein sequences that constitute an orthologous group (OG), and which protein sequences are ‘this protein’ depends on the level at which the OG is defi ned (see Box 2).
Event Defi nition Frequency in Eukaryotic Proteins with eukaryotes clades with elevated frequency 2 compared to elevated prokaryotes frequency Genesis/de Birth of a coding Unknown Parasitic lineages Host-parasite novo gene sequence from non- interactions origin coding sequence
Duplication Duplication of a High Land plants Regulation (signal whole or partial (whole-genome transduction, coding sequence, duplications), transcription either part of a small- animals factors), metabolic scale or of a whole- enzymes,host genome duplication. pathogen interaction
Loss Pseudogenization Low Parasitic lineages, Genes with few and/or elimination of obligatory interaction partners, coding sequence symbionts lower expression, higher mutation rates
Horizontal Transfer of coding Low Fungi Plant pathogenicity Gene Transfer sequence whereby effectors, metabolic (HGT) donor and acceptor enzyme clusters are not in a parent- offspring relationship
Fusion Joining of two (Higher than Metazoa Signal transduction coding sequences fi ssion) into a single ORF.
Sequence Mutations in coding High Intra cellular Genes with fewer divergence sequence that alter parasites, interaction partners, the AA sequence Caenorhabditis lower expression elegans rates. host-pathogen interaction 32 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 33
kinetochore [14, 160-162]. We experienced that for non-bioinformaticians various challenges arise, regarding technical practicalities, the knowledge of frequent genome evolutionary events and of the species tree, and, importantly, intuiting how complicated evolutionary histories are represented in a myriad of computational tools and sequence databases. If not overcome, such challenges may cause erroneous inferences such as incorrect, or incomplete, function prediction [159]. In this article we outline what we think are the most important principles for studying (or reading a study about) the evolutionary relationships and history of a protein. We argue that the most crucial yet overlooked skill 2 is to be able to recognize, postulate and revise different evolutionary scenarios, a skill that is typically diffi cult to automate.
Studying the evolution of a protein: what do we mean?
Although evolutionary analysis of a protein can be applied in different ways (Table 1), it typically involves answering these two main questions: 1) When did this protein originate? 2) Which other current-day species have this protein? These questions come down to asking which events occurred during the evolutionary history of a protein, such as origination, loss and duplication. The evolutionary history also entails sequence divergence and the gain and loss of specifi c domains or functional protein motifs. For an overview of such evolutionary events, see Box 1. The resolved evolutionary history can for example be used to infer whether the protein is an ancient or a novel component of the cellular process it is involved in, or to predict if after a gene duplication, the protein is likely to have maintained its function. Note that, although these two main questions are seemingly straightforward, the devil is in the semantics. What do we mean with ‘origin’, or with ‘this protein’? As we illustrate in the next sections, we sometimes actually redefi ne the ‘origin’ and ‘this protein’ during the analysis.
It’s orthologs we (mostly) care about Unraveling the evolutionary history of a gene entails fi nding the protein in other species. Note that, particularly at longer evolutionary distances, protein sequences are more informative than DNA sequences. When searching for a given protein in other species, we actually search for orthologs of this protein. Orthologs are homologous proteins that diverged due to speciation. For those not aware of the distinction between different types of homologs, we recommend to read Box 2 (‘Key defi nitions: Homology, orthology, paralogy orthologous groups (OGs), inparalogs and outparalogs). Why are we – and others- often specifi cally interested in the orthologs, rather than all homologs? The answer is found in the ‘ortholog conjecture’, which states that orthologs have biologically equivalent functions in different organisms, whereas paralogs have different 32 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 33
functions due to divergence after gene duplication [122, 125]. After all, if paralogs did not diverge functionally in some way after gene duplication, they are redundant and therefore most often one copy is lost [163]. Proteins that descend from a single ancestral protein together constitute an orthologous group (OG, see also Box 2). For example, proteins may form an orthologous group if they descended from a single protein in the last eukaryotic common ancestor (LECA). In this example, the OG was defi ned on the level of ‘eukaryotes’, but in other cases one might choose a lower level, like ‘animals’ (single protein in the last common ancestor of all animals), or a higher level, like ‘all cellular life’ (single protein in the last universal common ancestor). For details on the OG 2 composition and the relationships between members of the same OG and between homologous proteins of different OGs, see Box 2. For some proteins, establishing an OG requires making a gene tree (explained in ‘A quick (and dirty) guide to inferring the evolutionary history of a protein’), for other proteins the OG can be established without it. Whether required or not, a gene tree will reveal the evolutionary events that occurred to the protein of interest (Box 1), and feed into subsequent investigations.
Here we will focus on the evolutionary investigation of proteins involved in eukaryotic cell biology. Therein, we assume we search for orthologs of a given human protein in eukaryotes, and aim to infer the evolutionary events that happened since LECA (although we also touch upon what happened before). However, the same principles hold when studying e.g. a budding yeast protein, or studying the evolution of a protein since the last common ancestor of animals. We assume readers have general knowledge of the eukaryotic species tree (Box 3), suffi cient to use more detailed resources like NCBI taxonomy (https://www.ncbi.nlm.nih.gov/taxonomy) [15] or the Tree of Life Web Project (http://tolweb.org/tree/) [164]. Also, we assume readers have some basic experience with computational analyses, such as similarity searches with NCBI’s BLASTP [15] or HMMER’s phmmer [165] and multiple sequence alignments, such as Clustal Omega at the EMBL-EBI web interface [166]. Such skills provide a solid base to execute the research strategies that we propose. We will outline the initial steps of the analysis, their potential results, and what these results suggest about the protein’s evolution. The latter, in turn, will guide subsequent steps in the analysis. We will elaborate on how we iterate from initial results to a fi nal hypothesis using different approaches, which we select based on the specifi c challenges posed by the protein of interest.
A quick (and dirty) guide to inferring the evolutionary history of a protein
We here describe how to infer the evolutionary history of a protein from the perspective of a cellular biologist studying a eukaryotic cellular process in humans. This researcher 34 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 35
Box 2. Key defi nitions: Homology, orthology, paralogy, orthologous groups (OGs), inparalogs and outparalogs. Homology refers to an evolutionary relationship between two biological features. Features are homologous if they descend from a single common ancestral origin, regardless of whether they are morphological structures or molecules. Hence, proteins that do not share a common ancestor but that do share a similar function should not be referred to as ‘functional homologs’: those proteins are better named ‘analogs’. Homology is a qualitative trait, not a quantitative one, such that saying that “sequences 2 A and B are 60% homologous” does not make sense. In addition, it is a transitive trait, which means that if sequences (in the narrow meaning of: a sequence of nucleotides or amino acids, not necessarily a full-length protein or gene) A and B are homologous, and
A Homo sapiens (Metazoa)
Acanthamoeba castellanii (Amoebozoa) 1 LECA
Naegleria gruberi (Excavata) 2
Protein X Plasmodium falciparum (Alveolata)
Protein Y
Protein Z Arabidopsis thaliana (Archaeplastida) LECA: last eukaryotic common ancestor
B OG protein X homo_sapiens X X acanthamoeba_castellanii X naegleria_gruberi X arabidopsis_thaliana Xα arabidopsis_thaliana Xβ 1 homo_sapiens Yα homo_sapiens Yβ Y acanthamoeba_castellanii Yβ plasmodium_falciparum Y OG orthologous group 2 OG protein Y arabidopsis_thaliana Y speciation acanthamoeba_castellanii Z Z duplication naegleria_gruberi Z inparalogs plasmodium_falciparum Z outparalogs (not all indicated) OG protein Z arabidopsis_thaliana Z orthologs (not all indicated)
other homologs (eukaryotic or prokaryotic)
may aim to perform such an analysis in order to for example make a multiple sequence alignment based on the correct sequences, or to fi nd which other organisms have the protein of interest. A standard evolutionary analysis ideally consists of searching for 34 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 35
B and C are homologous, A and C are homologous as well. In molecular evolution of eukaryotes, the two most important ways in which two proteins can be related are by speciation (orthology) or by duplication (paralogy). Proteins are orthologs if they result from a species divergence, and they are paralogs if they result from a gene duplication in a given species. The terms ‘homologs’, ‘orthologs’ and ‘paralogs’ are used to describe the relationship between two genes or proteins. However, comparative genomics is often about studying many species, and therefore many proteins, at the same time. Therefore, we use the concept of the orthologous group (OG). An OG encircles the set of proteins that diverged from a single protein in the last common ancestor, such 2 as ‘the last eukaryotic common’ (LECA) or ‘the last common ancestor of all animals’. Within the orthologous group, two proteins can be paralogs if they resulted from a gene duplication that occurred in a more recent lineage than the ancestor we chose to defi ne the orthologous group for. These proteins are called ‘inparalogs’. Two proteins that result from a gene duplication prior to the last common ancestor of choice are called ‘outparalogs’; they are in different orthologous groups.
A. The evolution of hypothetical proteins X (green), Y (violet) and Z (orange) before and after LECA, projected onto the species tree. Before LECA an ancestral protein duplicated twice (‘1’ and ‘2’) to give rise to three proteins in LECA. These duplications gave rise to the protein orthologous groups (OGs) X, Y, and Z. After LECA, protein X duplicated in the lineage leading to Arabidopsis thaliana, and protein Y duplicated before the common ancestor of Acanthamoeba castellanii and Homo sapiens. Various eukaryotic lineages lost a protein, e.g. human lost Z. The bullets behind the species indicate the supergroups to which they belong (Box 3). B. Gene tree of hypothetical proteins X, Y and Z, rooted on distant homologs. Regularly, the nodes in the tree that unite sequences from the same species are inferred to be duplication nodes, the nodes in the tree that unite sequences from different species are mostly speciation nodes. However, when the protein topology differs from the species topology, ancient duplication nodes and lineage-specifi c losses may have to be inferred, although there are no sequences from the same species on both sides of these nodes [36]. Duplication nodes prior to LECA separate outparalogs, duplication nodes after LECA separate inparalogs. Proteins that belong to the same OG descent from a single protein in LECA (internal nodes ‘X’, ‘Y’ and ‘Z’). Note that within this fi gure, not all pairs of outparalogs and pairs of orthologs are indicated. For example, protein Y of A. thaliana and protein Z of A. thaliana are also outparalogs. The protein Z of A. castellanii and Plasmodium falciparum form a pair of orthologs, so do proteins Yα of H. sapiens and Y of P. falciparum. Since Yβ (H. sapiens) is also orthologous to Y of P. falciparum, human Yα and Yβ are called ‘co-orthologs’ to this P. falciparum protein.
homologous sequences, aligning these, using this multiple sequence alignment to infer a gene tree, and interpreting this tree in light of the species tree to reconstruct the evolutionary history of the protein (‘tree reconciliation’). While method sections in scientifi c 36 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 37
papers read like this, the analysis often requires more steps and not-so-straightforward decisions, which are not explicitly reported. In fact, most analyses will have gone through several iterations, while often only the last one is documented. In order to shed light on this process, we here discuss how we start and proceed our analysis. In this process, we continuously take into account a scenario (a.k.a. draft hypothesis or model) for the evolution of the protein of interest. We do not provide a guide on how to use specifi c bioinformatics tools, but aim to provide practical stepping-stones and their conceptual underpinnings. 2 The draft hypothesis is based on initial observations Inferring the evolutionary history of a protein starts with generating the observations. These basic observations provide information to draft a hypothesis, which in our case is the evolutionary scenario for the protein. Such a scenario should account for potential technical problems, because these problems might inhibit us to directly deduce the ‘true’ evolutionary history of a protein from the results. We acquire the fi rst basic observations by collecting information about the protein of interest, such as its domain architecture, functional sequence motifs, and suggested orthologs from experiments in other species or from databases with precomputed comparative genomics information, such as InParanoid [167], PANTHER [168] or Ensembl Compara [169]. Then, we perform one or a few similarity searches with the protein sequence of interest (‘the query’) using for example BLASTP or phmmer [165, 170] (Figure 1A). In order to obtain results that are feasible to interpret, we recommend selecting a subset of species, for example 20 species from diverse eukaryotic clades including at least human. In Box 4, we suggest such a species set. The hits we obtain are statistically likely homologous to our protein if they have a low E-value (e.g. below 0.001 or 0.01, see also ‘Life is more complicated: Exploring and validating grey zone hits’) [171]. The obtained homologs will tell us (1) which organisms contain homologs, and which do not, and (2) how many homologs are present in human (and other vertebrates or animals, if included in the search set). Together, the external information and this initial result(s) enable us to draft an initial scenario for the evolutionary history of our protein of interest. This initial scenario can broadly be categorized into four classes. After having determined which scenario applies to our protein of interest, we can decide how to infer a gene tree (what sequences to use for it) and what sort of information we expect to retrieve from this tree. Moreover, some proteins demand more complicated strategies, which we discuss in ‘Life is more complicated’.
Scenario 1 (‘easy’): Clear initial observations that generate a simple scenario The fi rst scenario is straightforward, and comes with clear-cut observations. If a single search gives us highly signifi cant single hits across a range of species, these results indicate that the protein is relatively well conserved (Figure 1B). Moreover, the protein 36 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 37
has a clear single point of origin: Since likely no ancient duplications occurred, the protein can be safely inferred to have originated in or before the last common ancestor of the species containing the homologs, which then by defi nition are also all orthologs. Being present in eukaryotes that are most distantly related to human, such Plasmodium falciparum or plants (see Box 3) thus implies an origin in or before LECA. If the protein is
A query similarity search database >PROTEIN_X_HUMAN Homo sapiens MAAPEAEVLSSAAVPDLEWYEKSEETHASQIEL Xenopus tropicalis LETSSTQEPLNASEAFCPRDCMVPVVFPGPVS Drosophila melanogaster species tree QEGCCQFTCELLKHIMYQRQQLPLPYEQLKHF Saccharomyces cerevisiae YRKPSPQAEEMLKKKPRATTEVSSRKCQQALA Schizosaccharomyces pombe Homo sapiens ELESVLSHLEDF Arabidopsis thaliana Plasmodium falciparum Xenopus tropicals 2 Trichomonas vaginalis Drosophila melanogaster
Saccharomyces cerevisiae
Schizosaccharomyces pombe
Arabidopsis thaliana B hits make msa + gene tree Plasmodium falciparum Trichomonas vaginalis
homo_sapiens
xenopus_tropicalis
drosophila_melanogaster
schizosaccharomyces_pombe
arabidopsis_thaliana
highly-weakly similar
Scenario 1: Easy protein origin
Figure 1. Outline of the initial analysis steps and their results in case of scenario 1: easy. A. After having collected basic protein information, the initial analytic step is to perform sequence similarity searches with a query sequence (here illustrated as a human sequence) against a proteome database. This database may consist of a subset of species, ideally species of which we know the phylogenetic relationships, as indicated by the species tree. The colored bullets indicate the supergroups to which the species belong, as in Box 3. Here we illustrated the proteome database as a subset of eukaryotic species. The search yields sequence similarity hits that inform about potential scenarios. After hit protein sequences are collected, a multiple sequence alignment (msa) and gene tree can be inferred. B. The easy scenario is hinted at by highly similar hits, mostly single hits in a variety of species. Likely the protein of interest is old and relatively well conserved at the sequence level. Not all species in the database might have it due to gene loss. As a result, the gene tree does not include all species. NB: The gene tree within the species tree in the left panel is an illustration of the protein’s evolutionary history. Unlike the gene tree in the right panel, the gene tree within the species tree has a root that indicates the protein’s origin and branches for lost protein sequences. The gene tree indicated here does not contain branch length values and support values, such as bootstrap percentages. Furthermore, while we here use the term ‘gene tree’, because the evolution occurs at the genetic level, it also describes the evolution of the protein as a consequence of that. Generally amino acid sequences are being used to infer this gene tree, particularly when studying evolution on longer evolutionary timescales. These remarks also apply to Figure 2A-C. 38 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 39
search hits gene tree highly-weakly similar protein origin
A Scenario 2: Taxonomically limited? protein duplication
homo_sapiens
xenopus_tropicalis
drosophila_melanogaster
2
B Scenario 3: Lineage-specific duplication
homo_sapiens xenopus_tropicalis drosophila_melanogaster homo_sapiens xenopus_tropicalis drosophila_melanogaster schizosaccharomyces_pombe arabidopsis_thaliana trichomonas_vaginalis
C Scenario 4: Ancient duplication
homo_sapiens
xenopus_tropicalis
drosophila_melanogaster
schizosaccharomyces_pombe
arabidopsis_thaliana
homo_sapiens
drosophila_melanogaster
saccharomyces_cerevisiae
schizosaccharomyces_pombe
arabidopsis_thaliana
plasmodium_falciparum
short (e.g. < 150 amino acids) or if the sequence divergence is somewhat high (as observed by the fi rst search) we often use a single iteration of an iterative searching tool like PSI-BLAST or jackhmmer [165, 170]. Such an iterative search may help to fi nd some additional orthologs in species that were not present in the fi rst search. Iterative searching tools make a ‘sequence profi le’ of the protein, based on the multiple sequence alignment of the hits found in the fi rst search. They then use this profi le to again search through the database. Because this profi le contains much more information than a single query sequence, it is more sensitive and able to detect divergent orthologous sequences. With such a profi le-based iteration we might obtain a similar result as with a single search applied to a longer or less diverged sequence, implying the same, simple 38 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 39
Figure 2. Similarity search results and expected gene tree for three alternative scenarios. A. The taxonomically limited (?) scenario delivers few hits in the sequence similarity searches, only in species closely related to that of the query sequence. Although this might suggest a recent origin (upper left star), the protein might in fact be older (faded star below). This protein might evolve rapidly in different lineages, due to which sequence similarity is below signifi cance and hence various orthologs are not seen in the similarity output. If no further, more advanced searches are executed, the gene tree includes only sequences from closely related species. B. The lineage-specifi c duplication scenario predicts highly similar, multiple hits in the query species and in species that are closely related. The gene tree aids in determining when this duplication occurred. In this case, this duplication took place in a common ancestor of vertebrates and fruit 2 fl y, after their divergence from fungi. This may have been the common ancestor of animals. C. The ancient duplication scenario predicts multiple hits across the species that are part of the database, in which one has a higher similarity to the query sequence than the other(s). In this scenario, sequences are hit from multiple OGs (Box 2) and a gene tree is needed to discriminate which sequences belong to which OG.
evolutionary scenario. An example of a protein that adheres to this straightforward scenario is Incenp, a protein that localizes to the inner centromere, which is present as single copy in the majority of eukaryotes [14].
Scenario 2 (‘taxonomically limited?’): Limited taxonomic distribution, protein recently invented or homologs that are diffi cult to detect? We hypothesize the second scenario if we retrieve only hits in closely related organisms, with low sequence similarity (e.g. a <75% similarity hit in mouse when searching with a human protein sequence). At fi rst sight this might indicate a recent origin of this protein. However, it more likely refl ects the evolutionary history of an older protein that diverged rapidly and therefore does not allow us to detect the more distant orthologs in the fi rst search (Figure 2A). Proteins that are typically prone to such detection failures are those having many non-globular regions (mainly coiled-coil or intrinsically disordered). In fact, many proteins are much older than suggested by simple similarity searches such as BLASTP. Under closer bioinformatic or biochemical scrutiny, these proteins often can be shown to have orthologs in more distantly related species than observed before [172]. Truly lineage-specifi c proteins are the exception, especially if examining proteins involved in core cellular processes. Advanced search strategies to cope with this scenario are discussed in ‘Life is more complicated’, under ‘Exploring and validating grey zone hits’.
Scenario 3 (‘lineage-specifi c duplication’): Multiple human sequences are highly similar and point to lineage specifi c duplication(s) We encounter the third scenario if we search with a human query protein and fi nd hits in human with a higher similarity than hits in more distantly related species. Such results indicate that the protein underwent recent lineage-specifi c duplications, for example 40 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 41
in the ancestor of animals (Figure 2B). Although this evolutionary scenario is easy to comprehend, it does not necessarily give an easy answer to the question “when did this protein originate”. If the functional differentiation between these duplicates is minimal, they likely both still have the function of the pre-duplication protein. Hits in species that diverged before this duplication (such as a hit in the choanofl agellateSalpingoeca rosetta in the example of a duplication in the metazoan ancestor, see Box 4) are also expected to perform this ancestral function, and therefore we would also consider these hits to be the “same” protein. However, if there was functional innovation in the lineage leading 2 to the query protein (‘neofunctionalization’) [124], it makes more sense to consider those pre-duplication hits not to be the “same protein”. For example, haemoglobin evolved after duplication of an ancestral globin in the vertebrate lineage [173]. Hence, while globins clearly exist in other animals and eukaryotes, it is not very meaningful to posit that plants contain haemoglobin even though they are technically orthologous (see Box 2). In this case, we might opt to defi ne the orthologous group not on the level of all eukaryotes, but on the level of vertebrates. To distinguish between these two alternative histories, we use functional information about the protein of interest, about the (human) paralog and about the pre-duplication orthologs to assess whether the protein of interest neofunctionalized. Furthermore, in order to pinpoint the timing of the duplication (was it indeed in the common ancestor of vertebrates, or maybe already in the ancestor of all chordates?), we need to make a gene tree.
Scenario 4 (‘ancient duplication’): Similar proteins in all eukaryotes likely are paralogs that result from an ancient duplication In the fourth scenario, the search result contains homologs in human, but many hits in more distant eukaryotes are more similar to the query than this human sequence (Figure 2C). This suggests that before LECA (a) gene duplication(s) resulted in two or more proteins that probably functionally diverged, and that each gave rise to a separate orthologous group (OG, see Box 2). After LECA, these proteins likely conserved their functions in different eukaryotic lineages. This scenario applies to for example tubulins and Rab GTPases [151, 174, 175]. In case of well-conserved folds, such as kinases, protein sequences that descended from different proteins in LECA are highly likely to be mingled in the similarity search output. In order to assign these sequences to the correct OG, and hence to establish the OG of the protein of interest, we have to make a gene tree.
How to make and use gene trees under the different scenarios? Although particularly important for the third and fourth scenario, we also make and interpret (‘reconcile’) gene trees for proteins we assign to the other scenarios. Note that while we here use the term ‘gene trees’ - this term highlights that they model the evolution of a gene, as apposed to the ‘species tree’ - for longer evolutionary distances 40 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 41
Box 3. Eukaryotic phylogeny Eukaryotic cells are generally estimated to have arisen from prokaryotes approximately 2 billion years ago [1, 2]. All eukaryotes diverge from a single common ancestor, the so-called ‘last eukaryotic common ancestor’, or LECA. Eukaryotes evolved from within the Archaea [7, 8], likely from the newly identifi ed Asgard superphylum [9, 10]. Also Alphaproteobacteria had a role in eukaryotic evolution: an alphaproteobacterium became the endosymbiont of this Asgard-related archaeal lineage that would evolve into the mitochondrion [17]. Eukaryotes often are taxonomically categorized into fi ve different supergroups [29]: Opisthokonta (animals and fungi), Amoebozoa, Archaeplastida (algae 2 and land plants), SAR (Stramenopila, Alveolata and Rhizaria) and Excavata. The exact position of the root of eukaryotes and the defi nitions of the supergroups are still under debate [31-33], but these uncertainties do not compromise the ideas we put forward in this work. Species tree showing the position of Bacteria the eukaryotes in the tree of life. According to the two primary domains model, the last universal common LUCA Archaea ancestor (LUCA) diverged to give rise to Bacteria and Archaea, and the latter Opisthokonta gave rise to the pre-eukaryotic host of Amoebozoa the pre-mitochondrial endosymbiont LECA [16]. Current-day eukaryotes descend Archaeplastida from a single common ancestor, called the last eukaryotic common ancestor SAR (LECA). After LECA, eukaryotes diverged Excavata quickly into the major lineages that we refer to as ‘supergroups’: Opisthokonta, Archaea Amoebozoa, Archaeplastida, SAR
LUCA: last universal common ancestor (Stramenopila-Alveolata-Rhizaria), LECA: last eukaryotic common ancestor SAR: Stramenopila-Alveolata-Rhizaria Excavata.
we typically use protein (amino acid) sequences to make such a gene tree. We make gene trees for two reasons. First, by making a tree we test whether the proposed scenario was actually the correct one. The lineage-specifi c and ancient duplication scenarios will clearly yield very different gene trees (Figure 2B,C, right-sided panels). Second, the gene tree uncovers the evolutionary history of the protein into greater detail. Importantly, when making the gene tree, we should be sure that the sequences that we input are in fact homologous to one another: only homologous protein sequences can be aligned and phylogenetically scrutinized. Although we use the gene tree to test the predicted 42 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 43
scenario, our predicted scenario itself determines how we execute the gene tree analysis. In the simple and taxonomically limited scenarios, we simply input all homologs found, or, if the database is very large, we select a subset that represents the diversity of the hits and the species in which they are found. In the lineage-specifi c duplication scenario, if we think a gene duplicated in e.g. animals, we include all sequences from organisms that diverged just before (e.g. the choanofl agellateS. rosetta) and after (e.g. the sponge Amphimedon queenslandica) the hypothesized duplication. Conversely, under the ancient duplication scenario we include two or more sequences from various more 2 distant genomes, while ensuring that each eukaryotic supergroup is well-represented [32]. After having constructed the tree, we compare it to the initially proposed scenario in order to assess whether it was drafted correctly. Moreover, we use the tree to explain in detail what happened during the evolution of this protein (Box 1). When, in which ancestral lineages, did it duplicate, got lost or even horizontally transferred? This process of interpreting the gene tree by comparing it to the species tree is called tree reconciliation, on which detailed guides can be found in bioinformatic textbooks [176, 177]. Tree reconciliation can be quite challenging, particularly if the gene tree contains errors (see ‘Life is more complicated: Gene trees may be incorrect’). New computational methods are being developed regularly in order to facilitate tree reconciliation [178].
Life is more complicated
While executing the evolutionary analysis as described above, we often encounter challenges in either the data or in the analysis itself. We consider it important to be aware of these issues, in order to recognize them and apply the appropriate strategies to cope with them, or to make the appropriate disclaimers in a manuscript.
Multi-domain proteins The protein of interest may consist of multiple domains, which is diffi cult if each of these domains has its own evolutionary history. The latter is quite common, given that domains seem to fuse more often than they split [179, 180], and the evolution of new domain architectures has been relatively frequent in metazoa [181]. Different evolutionary histories may be hinted at by inconsistencies in the similarity search results, for example if hits from different taxa align to different regions of the query. If the domains indeed have different histories, we suggest three ways to investigate and represent the evolutionary history of the protein. First, if all domains are relevant, all deserve a thorough investigation and representation of their evolutionary history. Second, if one domain can functionally be considered the main domain and the other an accessory domain, we perform primarily a detailed analysis of the main domain. Afterwards, we may infer when in evolution the accessory domain was acquired and/or lost. This is the way others and we represented 42 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 43
the evolution of the kinetochore protein kinase Mps1, which in human is joined to a TPR domain (Figure 4). We also approached the evolution of human RasGRP proteins this way, whereby the Cdc25 homology domain, the main domain, fused to different other domains in different lineages [182]. Third, sometimes the domain combination itself is relevant for the biological function of interest. In this case, although the composite domains each have their own possibly byzantine history, only after they fused “this protein” arose, and therefore we focus on determining the phylogenetic time point of this fusion and on the evolution thereafter. Although orthologous sequences in species that diverged before the fusion likely exist (having only one of the domains), these are 2 probably not very informative. An example of this third case is R-spondin, a protein of the Wnt signaling pathway, that only came into existence after two Fu domains joined a TSP1 domain in the ancestor of deuterostomes [183].
Exploring and validating grey zone hits As discussed in the taxonomically limited scenario, quite often we expect a protein to be widely present, e.g. because it is involved in a core cellular process, but end up with a set of sequences from only a limited set of species. This often occurs if the protein under investigation is largely unstructured, and therefore its amino acid sequence is poorly conserved. In this situation, we often scrutinize hits with a higher E-value than the default cutoff, which we call ‘grey zone’ hits. Such a grey zone hit may be supported by additional, less conventional similarity searches. For example, we may search iteratively during multiple rounds, thereby starting the search not with the initial (e.g. human) query sequence, but with another trusted ortholog that may serve as a ‘bridge’. We experienced that searching with another sequence occasionally yields orthologs in many species that previously had appeared to lack them. We select such ‘bridging’ queries by their taxonomic position of the species. To verify if indeed no ortholog exists in a given species, we search with a trusted ortholog in a species that is closely related, for example in another ascomycete in case of budding yeast. Sometimes, newly found sequences themselves will trace even other, previously undetected sequences (homology is a transitive feature, Box 2). For example, in order to establish whether the Sos7 protein (fi ssion yeast) is orthologous to the Kre28 protein (budding yeast), we searched with various other ascomycete fungal hits of Sos7, and this indeed brought us to Kre28 (Figure 3). Similarly, stepping from one orthologous sequence to another also helped us to identify orthology between Sos7 and Zwint-1 (human).
We regularly attempt to increase confi dence in the grey zone hit with other information, for example from the actual alignment: does the grey zone hit contain certain functionally characterized sequence motifs? It may be truncated, due to which scores are lower than expected. Other information that is worthwhile checking includes whether the grey zone hit is a bidirectional best hit to the query, whether it has been reported to execute the 44 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 45
same function, or whether it interacts with a protein that is an ortholog of the query’s interaction partner [184]? The grey zone hit may have a similar secondary or higher-level structure as the query sequence, which can be assessed with various tools. Using this multimodal approach we were able to determine a much broader species distribution of the RPGRIP1 proteins, which are crucial components of the ciliary transition zone [185], than previously found [186].
2 Homo sapiens (Zwint-1) 5 Animals Oncorhynchus mykiss 4 Strongylocentrotus purpuratus 3 Capsaspora owkzarzaki
Saccharomyces cerevisiae (Kre28) 4 Zygosaccharomyces rouxii 3 Wickerhamomyces ciferrii 2 2 Ascomycetes Bipolaris oryzae 1 Fungi Schizosaccharomyces pombe (Sos7) 1 Trichosporon asahii
other eukaryotic species
Kre28-like single similarity search
Sos7-like iterative similarity search
Zwint-1-like similarity search E-value > 0.001
Figure 3. Orthology established via bridging sequences. Zwint-1 (human), Kre28 (budding yeast) and Sos7 (fi ssion yeast) are orthologous to one another and belong to a single OG. This conclusion was based on successive similarity searches that indicated homology between various Zwint-1-like, Kre28-like and Sos7-like proteins of different fungal, animal and animal- related species [14]. Starting with the Sos7 protein (fi ssion yeast,Schizosaccharomyces pombe), the ordered searches indicate which ‘bridging’ sequences were used to establish orthology to Kre28 and Zwint-1. Note that in one instance, iterative homology searching was needed, and in two others, hits with E-values higher than 0.001 were included, which can be considered grey zone hits. The species tree indicates which species’ sequences were used to establish this OG. 44 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 45
Iterative searches may be risky Note that sometimes, iterative searches do not converge at all and instead continue to return new hits. In this case, non-homologous sequences might get included, a risk typically faced by e.g. highly charged proteins and by coiled-coil proteins [187-189]. Such protein sequences are prone to evolve convergently due to compositional bias in their amino acid frequencies or simple, repeated sequence motifs. In fact, inferring common descent for such sequences is notoriously diffi cult [187], and therefore we think that sets of orthologs that are only based on sequence similarity of coiled-coil regions should be treated with suspicion. However, if a candidate sequence shares a coiled-coil 2 region with the protein of interest in addition to other putative homologous regions, the coiled-coil may serve as additional evidence for their homology.
Gene trees may be incorrect In each of the scenarios discussed, the evolutionary analysis is aided by making a gene tree, and a gene tree is particularly essential in the scenarios involving duplications. What sort of problems might a gene tree encounter? Quite regularly, the tree suffers from poor statistical supports, as indicated by low bootstrap values, approximate likelihood- based measures or posterior probabilities. These undermine the tree’s reliability as representation of the proteins’ evolutionary history. Another, often coinciding reason for distrusting the tree could be that it is typically hard to reconcile with the species tree. Such reconciliation might require an inconceivable number of duplications and losses, or multiple horizontal gene transfers. Note that, even for proteins that are present in single copy in most of the species studied, the gene tree in fact seldom exactly follows the species tree [190]. Problematic trees may arise if the aligned sequences are not homologous across the full length, if sequences are very short and hence contain little information, or if they are highly divergent, which may lead to so-called long-branch attraction (the tendency of rapidly evolving sequences to cluster together in the gene tree, although they are in reality not closely related) [191, 192]. The fi rst problem can easily be solved by selecting the homologous positions of the multiple sequence alignment, for which also specifi c tools exist [193]. The other problems are more diffi cult to overcome, although it may help to add more sequences, and particularly ones that increase the phylogenetic diversity. Moreover, sequences that align poorly and/or have very long branches could be removed from the multiple sequence alignment, and eventually replaced with a more conserved ortholog from a closely related species.
Mixed scenarios, multiple problems In contrast to the relatively clear outlines of the previously discussed scenarios, the reality might be a mix. For example, in investigating the evolution of the plant Polycomb repressive complex 1 (PRC1) component EMF1, through sensitive sequence analysis we fi rst discovered multiple fl owering plant-specifi c duplicates before we discovered 46 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 47
divergent orthologs in gymnosperms [194]. Here, lineage-specifi c duplications and ‘invisible’ orthologs thus conspired to make the previous inferences in the literature fl awed. Moreover, even proteins with common domains, such as certain kinases, might have diverged extensively, due to which the initial gene tree is very likely to lack orthologs. As a result, some lineages may incorrectly appear to have lost the protein. In this case, again, iterative similarity searching with a sequence profi le might help to fi nd sequences that should be added to the tree. Such a search is most sensitive if it starts with an OG- specifi c profi le, i.e. a profi le that is based on the multiple sequence alignment of the OG 2 that was already identifi ed from an earlier tree.
Truly absent, truly present? Quite often, an evolutionary analysis yields unexpected absences of orthologs in certain species, or unexpected presences. Are these unexpected observations true or not? Absences may be false if the similarity searches fail to detect homologous sequences, of if the predicted proteome or assembled genome is incomplete. We assessed that protein-coding sequences quite frequently are not, or incompletely, predicted [195]. It might therefore pay off to search for homologous sequences on the (six frame translated) genomic DNA. Less frequently, unexpected presences are observed, for example if a protein that seemed eukaryote-specifi c is found in a few bacteria. These presences could be true, for example resulting from a eukaryote-to-bacterium horizontal gene transfer. It may also be false, if for example the bacterial genome is contaminated with eukaryotic sequences [196]. It is not always easy to identify between these, because A) contamination can be diffi cult to prove, and B) because, although horizontal transfer is rare in eukaryotes, it does occur [197].
When and from where did my protein originate? We above referred to ‘the origin of a protein’ mostly as the position on the species tree in which the protein came into existence. Logically, errors in the proposed evolutionary history of a protein, as discussed above, may also result in errors in the inferred origin. A protein may be estimated too young, due to undetected homologs in more distantly related species, which is mainly observed in the taxonomically limited scenario. In some instances, the age of a protein might simply be underestimated because crucial genomes that point to an older age are not (yet) available. The TPR domain of the protein kinase Mps1 had been inferred to have originated in the ancestor of deuterostomes, and its presence in the oomycete Albugo laibachii was regarded as contamination or horizontal gene transfer [198]. However, when additional species were included, this TPR domain was also found in various other clades across the eukaryotic tree of life (Figure 4). Hence, it likely was already present in LECA but lost in multiple eukaryotic lineages in parallel [199]. In other cases, a protein may be estimated too old if a wide taxonomic distribution is actually due to horizontal gene transfer. Incorrect estimation of a protein’s age may of 46 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 47
Mps1-TPR study with limited species set Mps1-TPR study with expanded species set
Deuterostomes Homo sapiens Homo sapiens Strongylocentrotus purpuratus Drosophila melanogaster contamination? transfer? Drosophila melanogaster Monosiga brevicollis Protostomes Schistosoma mansonii Saccharomyces cerevisiae Saccharomyces cerevisiae Spizellomyces punctatus LECA Acanthamoeba castellanii Dictyostelium discoideum LECA Arabidopsis thaliana Arabidopsis thaliana Physcomitrella patens Physcomytrella patens Albugo laibachii Chlamydomonas reinhardtii Paramecium tetraurelia Albugo laibachii Trichomonas vaginalis Phaeodactylum tricornutum Paramecium tetraurelia gain/presence Mps1 protein LECA: last eukaryotic common ancestor loss Mps1 protein Naegleria gruberi gain/presence TPR domain Trichomonas vaginalis loss TPR domain 2 Figure 4. More ancient origin of a protein’s domain after addition of species. Left panel: Evolutionary analysis of the protein kinase Mps1 initially showed that this kinase was likely ancient, present in LECA, and seemed to have fused to a TPR domain recently. This TPR domain was inferred to have fused to the kinase in the common ancestor of deuterostomes [198]. The presence of this domain in the Mps1 protein of Albugo laibachii was hypothesized to result from contamination of the genome or from horizontal gene transfer. Right panel: After genomes of other species were included in the analysis, various early branching species turned out to possess the TPR domain in their Mps1 proteins [199]. Hence, likely the Mps1 protein of LECA already had this domain. As a result, it must have gone lost in various eukaryotic clades. The presented phylogenies are species trees with the branches colored according to the eukaryotic supergroups to which the species belong. Purple: Opisthokonta (animals and fungi), blue: Amoebozoa, green: Archaeplastida (land plants, algae), red: SAR (Stramenopila, Alveolata and Rhizaria), orange: Excavata (see Box 3). The species that were important for inferring the ancient origin of Mps1’s TPR domain are indicated in bold.
course also result from an incorrect species tree. Sometimes we are not just interested in the time point of origin, but also in the ‘source’ of its origin: where did the protein come from? Was it invented de novo? Did it arise by duplication, and if so, what may have been the function of the pre-duplication protein? Does the protein have prokaryotic homologs, and may it thereby be derived from these prokaryotes? Determining this ‘source’ may not be so straightforward. For example, some previously identifi ed eukaryote-specifi c proteins turned out to have homologs in prokaryotes after all (Table 1), and other core eukaryotic proteins may seem to be invented de novo just prior to LECA, but do in fact have eukaryotic homologs, hence they arose by a pre-LECA duplication. In these cases, after speciation (from prokaryotes) or duplication (in the pre- LECA lineage), the sequences evolved rapidly, due to which sequence similarity searches sometimes fail to uncover these distantly related homologs.
Using a tailor-made database yields more comprehensible and more interesting results Since public databases nowadays contain tons of sequence data, it is computationally and practically quite a challenge to interpret the results of a simple similarity search. 48 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 49
Box 4. A species selection for eukaryotes We consider it useful to work with a selection of genomes as a search database, because this often gives more comprehensible results. Using such a subset prevents that only closely-related species are visible in the sequence similarity search output, which may be the case in large databases such as ‘nr’, the non-redundant protein database from NCBI [15]. A smaller selection also facilitates checking for species that lack hits. We would always suggest to include the species of the query in this database, because the search results then can reveal lineage-specifi c or ancient duplications. Moreover, we 2 suggest including species that represent the diversity of the taxon that one studies, such as eukaryotes. Thereby, one should make sure that all major clades are represented by multiple species. Ideally, one includes for each of these major clades species that do not have strongly reduced genomes. For this reason, we for example recommend to avoid having only pathogenic species for a given clade. Such species often lack orthologs of the protein of interest, and may therefore for example lead to erroneous conclusions about the age of a protein (‘When and from where did my protein originate?’). Moreover, it may be helpful to include model organisms if these harbor additional experimental data, which often applies to budding and fi ssion yeast, and sometimes to Arabidopsis thaliana. Experimental data may give hits about conservation of function, and/or indicate candidate orthologs (‘Exploring and validating grey zone hits’). Although for specifi c proteins other subsets may be more useful (‘Using a tailor-made database yields more comprehensible and more interesting results’), as an initial search database we suggest the species in this table. Of course, if any species not in this list is likely to be relevant for the protein of interest, one may add it or let it replace closely related species. Species suggested can be selected when using BLASTP or phmmer online, or their proteome sequences can be downloaded from NCBI when running similarity searches locally. The species are colored according to the supergroup to which they belong (Box 3).
Furthermore, this output does not reveal which species lack a hit. For this reason, in the previous section we recommended to select a subset of species as a search database (see also Box 4). As an alternative, we prefer to use a tailored, local proteome database of which we beforehand know the species and the (approximate) species tree. Although it may seem quite some work, we think it does pay off if one intends to study the evolution of multiple proteins. An in-house database also facilitates the detection of co-evolution among different proteins (Table 1). The database may be revised or recompiled if a protein turns out to have a particularly interesting function or evolutionary history in a specifi c clade. Various studies demonstrated that zooming in into the species tree and adding species at key phylogenetic positions yield highly interesting patterns, such as 48 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 49
Species Taxon Eukaryotic supergroup Taxonomy ID Homo sapiens Chordata Opisthokonta 9606 Xenopus tropicalis Chordata Opisthokonta 8364 Drosophila melanogaster Arthropoda Opisthokonta 7227 Salpingoeca rosetta Choanofl agellida Opisthokonta 946362 Saccharomyces cerevisiae Ascomycota Opisthokonta 4932 Spizellomyces punctatus Chytridiomycota Opisthokonta 109760 Thecamonas trahens Apusozoa 529818 2 Acanthamoeba castellanii Longamoebia Amoebozoa 5755 Dictyostelium discoideum Mycetozoa Amoebozoa 44689 Amborella trichopoda Streptophyta Archaeplastida 13333 Klebsormidium fl accidum Streptophyta Archaeplastida 3175 Chlamydomonas reinhardtii Chlorophyta Archaeplastida 3055 Bathycoccus prasinos Chlorophyta Archaeplastida 41875 Cyanidioschyzon merolae Rhodophyta Archaeplastida 45157 Ectocarpus siliculosus Stramenopila SAR 2880 Phytophthora infestans T30-4 Stramenopila SAR 403677 Plasmodium falciparum Alveolata SAR 5833 Paramecium tetraurelia Alveolata SAR 5888 Plasmodiophora brassicae Rhizaria SAR 37360 Naegleria gruberi Discoba Excavata 5762 Bodo saltans Discoba Excavata 75058 Giardia intestinalis Metamonada Excavata 5741 Trichomonas vaginalis Metamonada Excavata 5722
when the protein turns out to be present in species having a certain (cellular) biological feature. A nice example is provided by the centromeric histone variant CenH3, a protein that was found to be absent from species that have a specifi c type of centromere (a so- called holocentromere, which runs along the length of the chromosome), in the lineage of insects [101].
Conclusions & Outlook
With this article, we hope to have given relevant insights into our evolutionary analysis 50 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 51
approaches, and we hope that these insights serve as guide for doing this analysis. Essentially, it most often is a process that entails many feedback loops that revise the initial evolutionary scenario. This process is often unfi nished: new data that become available, such as a resolved 3D structure or newly sequenced genomes, may alter the scenario, as may technological advances, such as in homology detection and improved tree building models and algorithms. The manually solved evolutionary history should therefore be considered the best estimate at the moment, not necessarily the defi nitive one. To address this uncertainty, we suggest that authors report doubtful cases: which 2 set of orthologs possibly contains false positives or false negatives, and why? Which evolutionary events may require reexamination?
Various studies demonstrated that new information or new tools revise the evolutionary history of a protein. For example, genomes of Archaea showed that previously labeled eukaryote-specifi c proteins are in fact older than eukaryotes [200]. Very likely certain scientifi c trends will bring in information that alters existing scenarios for the evolution of certain proteins, like increasing diversity in available genomes and evolutionary cell biology. Various evolutionary biologists called for a better representation of eukaryotic diversity by studying and sequencing non-parasitic species, including unicellular heterotrophs [201, 202]. Indeed, the fi rst sequenced non-parasitic excavate (Box 3), Naegleria gruberi, turned out to contain many more genes present in other eukaryotic lineages than parasitic excavates, and as a result many of these genes could be assigned to have been present in LECA [203]. In light of the up-and-coming fi eld of ‘evolutionary cell biology’, more non-model organisms will be studied on the cellular and molecular level, which allows us to validate our evolutionary predictions. Does the predicted ortholog indeed fulfi ll the same function in this organism, as postulated by the ortholog conjecture? Or, if no ortholog was found in the genome of this species, does it have another, analogous, protein fulfi lling this role? By answering these and related questions, evolutionary cell biology will shed light on the association between the evolution of proteins and the evolution of function.
The common efforts of many researchers generated detailed hypotheses on the evolution of a wide array of proteins. Likely, many cellular biologists would be interested to quickly retrieve this information. Unfortunately, these data are now often hidden in research articles. If others and we would share manually defi ned orthologs on a wider, and homogeneously formatted platform, these would be easier to access. This does not necessarily need to be a new platform: maybe our research community could aid to improve existing databases such as EggNOG, PANTHER or TreeBASE by adding the knowledge of the proteins we studied [34, 168, 204, 205]. 50 Chapter 2 Inferring the evolutionary history of your favorite protein: A guide for cell biologists 51
Acknowledgments We kindly thank Carlos Sacristan, Simona Antonova and Mathilde Galli for their critical and useful feedback on our manuscript.
Author contributions BS designed the manuscript. JJEH, ET, TJPD, GJPLK and BS wrote the manuscript.
2
3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics
Jolien JE van Hooff, Eelco Tromer, Leny M van Wijk, Berend Snel# and Geert JPL Kops#
# joint senior authors
EMBO Reports, 2017 54 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 55
Abstract
During eukaryotic cell division, the sister chromatids of duplicated chromosomes are pulled apart by microtubules, which connect via kinetochores. The kinetochore is a multiprotein structure that links centromeres to microtubules, and that emits molecular signals in order to safeguard the equal distribution of duplicated chromosomes over daughter cells. Although microtubule-mediated chromosome segregation is evolutionary conserved, kinetochore compositions seem to have diverged. To systematically inventory kinetochore diversity and to reconstruct its evolution, we determined orthologs of 70 kinetochore proteins in 90 phylogenetically diverse eukaryotes. The resulting ortholog sets imply that the last eukaryotic common ancestor (LECA) possessed a complex kinetochore and highlight that current-day kinetochores differ substantially. These kinetochores diverged through gene loss, duplication and, less frequently, invention and displacement. Various kinetochore components co-evolved with one another, albeit 3 in different manners. These co-evolutionary patterns improve our understanding of kinetochore function and evolution, which we illustrated with the RZZ, TRIP13, the MCC and some nuclear pore proteins. The extensive diversity of kinetochore compositions in eukaryotes poses numerous questions regarding evolutionary fl exibility of essential cellular functions.
Key words: kinetochore, co-evolution, eukaryotic diversity, gene loss, evolutionary cell biology
Introduction
During mitotic cell division, eukaryotes physically separate duplicated sister chromatids using microtubules within a bipolar spindle. These microtubules pull the sister chromatids in opposite directions, toward the spindle poles from which they emanate [206]. Current knowledge indicates that all eukaryotes use microtubules for chromosome separation, suggesting that the last eukaryotic common ancestor (LECA) also did. Microtubules and chromatids are connected by the kinetochore, a multi-protein structure that is assembled on the centromeric chromatin [207, 208]. Functionally, the kinetochore proteins can be subdivided into three main categories: proteins that connect to the centromeric DNA (inner kinetochore), proteins that connect to the spindle microtubules (outer kinetochore), and proteins that perform signaling functions at the kinetochore in order to regulate chromosome segregation. These signaling functions consist of the spindle assembly checkpoint (SAC), which prevents sister chromatids from separating before all have stably attached to spindle microtubules, and attachment error correction, which ensures that these sister chromatids are attached by microtubules that emanate 54 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 55
from opposite poles. Together, the SAC and error correction machineries ensure that both daughter cells acquire a complete set of chromosomes.
Although microtubule-mediated chromosome segregation is conserved across eukaryotes, their mitotic mechanisms differ. For example, some species, such as those in animal lineages, disassemble the nuclear envelope during mitosis (‘open mitosis’), while others, such as yeasts, completely or partially maintain it (‘(semi-)closed mitosis’) [209]. Species differ also in their kinetochore composition, both in the inner and in the outer kinetochore. For example, Drosophila melanogaster and Caenorhabditis elegans lack most components of the constitutive centromere-associated network (CCAN), a protein network in the inner kinetochore. In the outer kinetochore, diverse species employ either the Dam1 (e.g. various Fungi, Stramenopila and unicellular relatives of Metazoa) or the Ska complex (most Metazoa and Viridiplantae and some Fungi) for tracking depolymerizing microtubules [210]. The kinetochore of the Excavate species Trypanosoma brucei mostly consists of proteins that do not seem homologous to the 3 ‘canonical’ kinetochore proteins [113, 116]. Studying the evolution of kinetochore proteins revealed how kinetochore diversity was shaped by different modes of genome evolution: The inner kinetochore CenpB-like proteins were recurrently domesticated from transposable elements [211], the outer kinetochore protein Knl1 displays recurrent repeat evolution [212], SAC proteins Bub1/BubR1/Mad3 (MadBub) duplicated and subfunctionalized multiple times in eukaryotic evolution [143, 144] and the SAC protein p31comet was recurrently lost [213].
Prior comparative genomics studies reported on kinetochore compositions in eukaryotes [213, 214]. These studies raised various questions, such as: Are kinetochores in general indeed highly diverse? How often do kinetochore proteins evolve in a recurrent manner in different lineages? How frequent is loss of kinetochore proteins? Does the kinetochore consist of different evolutionary modules? To address these and other questions, we studied the eukaryotic diversity of the kinetochore by scanning a large and diverse set (90) of eukaryotic genomes for the presence of 70 kinetochore proteins. We deduced the kinetochore composition of LECA and shed light on how, after LECA, eukaryotic kinetochores diversifi ed. To understand this evolution functionally, we detected co- evolution among kinetochore complexes, proteins and short linear motifs: Co-evolving kinetochore components are likely functionally interdependent. Furthermore, we found that certain species contain yet inexplicable kinetochore compositions, such as absences of proteins that are crucial in model organisms. We nominate such species for further investigation into their mitotic machineries. 56 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 57
Results
Eukaryotic diversity in the kinetochore network We selected 70 proteins that compose the kinetochore (see Materials and Methods). For comparison, we also included proteins that constitute the Anaphase-Promoting i i a yi yi io ia us us xa us us r a ata um te r um a r r ydis i n i n ucei ipes idae vum ulea ersa vskii allax
Opisthokonta assa utum utum r ube r a v mans erens iabilis i f visiae ad o ahens n asitica asinos vicollis estans aurelia ticillata ub r o r usculus aginalis a r e mophila f x ca r ectensis a c r e r ichopoda r o Amoebozoa ia g r aia culicis alcipa r wczarzaki ico r v yza sati v ow al e esleea n yces lactis v a in f Danio re erguelense a pa r on merolae ugia mal a pa r v r k icha t r yces pombe ol v ugo laibachii O r wia lipolytica a a o wiella natans la k ia anguillulae V y z ia sulphu r idium pa r B r yces hansenii r m V a pseudonana Mus m ium limaci n Homo sapiens pus siliculosus oon intestinalis yces punctatus Excavata kinsus ma r Guillardia theta yces cer e ero m Al b Ustilago m tierella elongata akifugu r xoplasma gondii ium k arr o o m Emiliania huxl e yt r Ciona intestinalis e r Candida glab m T ycoccus p r Oxyt r tierella Naegle r Edhazardia aedis o yces macrogy n Mnemiopsis leidyi Leishmania major ium dendrobatidis Y y abidopsis thaliana Chlorella v Xenopus tropicalis Giardia intestinalis yxa subellipsoidea ypanosoma b r P T Aquilegia coe r r Neurospo yt r ymena the r r Bigel o yscomitrella patens Coemansia r e anopho r ichoplax adhaerens yces b amecium tet r Mucor circinelloides Mo r Salpingoeca rosetta Anopheles gambiae ytophtho r Monosiga br ichomonas v A r Blastocystis hominis r Stramenopila-Alveolata-Rhizaria yt r T ello m ostelium discoideum Micromonas species anchiostoma flo Galdie r ydomonas reinhardtii Kluy v r Thecamonas t r Catena r Mo r Amborella t r T a h a r Bat h P h Schistosoma mansoni Entamoeba histolytica T C y Ectoca r yptospo r P h anidiosc h B r Nematostella Allo m P Plasmodium f Symbiodinium mi n Conidiobolus coronatus Caenorhabditis elegans yptococcus neo Capsaspo Deba antioc h et r osaccharo olysphondylium pallidum C r Spi z yco m Drosophila melanogaster Archaeplastida Selaginella moellendorffii Dict y T C y Acanthamoeba castellanii Saccoglossus Ostreococcus lucima r aloperonospo r P Nannochloropsis gaditana u r C r Thalassiosi r Cocco m Saccharo Chla m Phaeodactylum t r achoc h A P h Encephalito z Aplanoc h Amphimedon queenslandica H y ureococcus anophagef f Schi z present absent A Bat r Mad2 • MadBub • Borealin • Cdc20 • Spc24 • Zwint-1 • Mad1 • Mps1 • 3 Knl1 • Nnf1 • Mis12 • Nsl1 • Dsn1 • Bub3 • ZW10 • CenpX • CenpS • Spc25 • CenpC • CenpA • Ndc80 • Nuf2 • Zwilch • Rod • Spindly ARHGEF17 Survivin • Cep57 • Plk • CenpI • CenpN • CenpP • CenpH • CenpK • CenpT • CenpL • CenpO • CenpW • CenpU CenpQ CenpF CenpM • Astrin SKAP CenpR Apc15 • Sgo • Ska3 • Ska1 • Ska2 • BugZ • TRIP13 • p31comet • CenpE • Aurora • Dad2 Duo1 Spc19 Hsk3 Spc34 Dad3 Dad4 Ask1 Dam1 Dad1 Ctf13 Ndc10 Cep3 Incenp • Skp1 •
Figure 1. The kinetochore network across 90 eukaryotic lineages. Presences and absences (“phylogenetic profi les”) of 70 kinetochore proteins in 90 eukaryotic species. Top: Phylogenetic tree of the species in the proteome set, with colored areas for the eukaryotic supergroups. Left side: Kinetochore proteins clustered by average linkage based on the pairwise Pearson correlation coeffi cients of their phylogenetic profi les. Protein names have the same colors if they are members of the same complex. Proteinsinferred to have been present in LECA are indicated (●). The orthologous sequences (including sets of APC/C subunits, NAG, RINT1, HORMAD, Nup106, Nup133, Nup160) are available as fasta fi les in Dataset 1, allowing full usage of our data for further evolutionary cell biology investigations. 56 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 57
Complex/Cyclosome (APC/C), which is targeted by kinetochore signaling. We identifi ed orthologous sequences of these kinetochore and APC/C proteins in 90 diverse eukaryotic lineages by performing in-depth homology searches. Our methods were aimed at maximizing detection of a protein’s orthologs even if it evolves rapidly, which is the case for many kinetochore proteins (as we discuss below). The resulting sets of orthologous sequences are available (Dataset 1). We projected the presences and absences of proteins (‘phylogenetic profi les’) across eukaryotes (Figure 1, Materials and Methods). In spite of our thorough homology searches, for some proteins the ortholog in a given species might have diverged too extensively to recognize it, resulting in a ‘false’ absence. We however think that, globally, our analysis gives an accurate representation of kinetochore proteins in eukaryotes (Discussion).
We inferred the evolutionary histories of the proteins by applying Dollo parsimony, which allows only for a single invention and infers subsequent losses based on maximum parsimony. Of the 70 kinetochore proteins, 49 (70%) were inferred to have been present 3 in LECA (Figure 1, Figure 2A, C). CenpF, Spindly and three subunits of the CenpO/P/Q/ R/U complex probably originated more recently. The Dam1 complex likely originated in early fungal evolution and may have propagated to non-fungal lineages via horizontal gene transfer [210]. i i a yi yi io ia us us xa us us r a ata um te r Opisthokonta um a r r ydis ipes i n i n ucei idae vum ulea ersa vskii allax assa utum utum r ube r a v mans erens iabilis i f visiae ad o ahens n asitica asinos vicollis estans aurelia ticillata ub r o r usculus aginalis a r e mophila f x ca r
Amoebozoa ectensis a c r r e r ichopoda r o ia g r aia culicis alcipa r wczarzaki ico r v yza sati v ow al e esleea n yces lactis v a in f Danio re erguelense a pa r on merolae ugia mal a pa r v r k icha t r yces pombe ol v ugo laibachii O r wia lipolytica a a o wiella natans la k ia anguillulae V y z ia sulphu r idium pa r B r
Excavata yces hansenii r m V a pseudonana Mus m ium limaci n Homo sapiens pus siliculosus oon intestinalis yces punctatus kinsus ma r Guillardia theta yces cer e ero m Al b Ustilago m tierella elongata akifugu xoplasma gondii ium k arr o o m Emiliania huxl e yt r Ciona intestinalis e r Candida glab m T ycoccus p r Oxyt r tierella Naegle r Edhazardia aedis o yces macrogy n Mnemiopsis leidyi Leishmania major ium dendrobatidis Y y abidopsis thaliana Chlorella v Giardia intestinalis Xenopus tropicalis yxa subellipsoidea ypanosoma b r P T Aquilegia coe r Stramenopila-Alveolata-Rhizaria r Neurospo yt r ymena the r r Bigel o yscomitrella patens Coemansia r e anopho r ichoplax adhaerens yces b amecium tet r Mucor circinelloides Mo r Salpingoeca rosetta Anopheles gambiae ytophtho r Monosiga br ichomonas v A r Blastocystis hominis r yt r T ello m ostelium discoideum Micromonas species anchiostoma flo Galdie r ydomonas reinhardtii Kluy v r Thecamonas t r Catena r Mo r Amborella t r T a h a r Bat h P h Schistosoma mansoni Entamoeba histolytica T C y Ectoca r yptospo r P h anidiosc h B r Nematostella Allo m P Plasmodium f Symbiodinium mi n Conidiobolus coronatus Caenorhabditis elegans yptococcus neo
Archaeplastida Capsaspo Deba antioc h et r osaccharo olysphondylium pallidum C r Spi z yco m Drosophila melanogaster Selaginella moellendorffii Dict y T C y Acanthamoeba castellanii Saccoglossus Ostreococcus lucima r aloperonospo r P Nannochloropsis gaditana u r C r Thalassiosi r Cocco m Saccharo Chla m Phaeodactylum t r achoc h A P h Encephalito z Aplanoc h Amphimedon queenslandica H y ureococcus anophagef f Schi z
present absent A Bat r
Cdh1 Apc8 Apc6 Apc2 Apc3 Apc5 Apc4 Apc13 Apc7 Apc1 Apc16 Apc15 Apc12 Cdc20 Apc10 Apc11 Apc9
Figure EV1. Anaphase-promoting complex/cyclosome (APC/C) subunits across 90 eukaryotic lineages. Presences and absences (“phylogenetic profi les”) of APC/C subunits in 90 eukaryotic species. Top: Phylogenetic tree of the species in the genome set, with colored areas for the eukaryotic supergroups. Left side: APC/C proteins clustered by average linkage based on the pairwise Pearson correlation coeffi cients of their phylogenetic profi les. The orthologous sequences are available as fasta fi les in Dataset 1, allowing full usage of our data for further evolutionary cell biology investigations. 58 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 59
A Homo sapiens B Tetrahymena thermophila
ARHGEF17
Ska1 Ska2 Mps1 Ska3
Ndc80 Ndc80 Outer kinetochore Nuf2 Spc24 Zwilch Astrin Spindly Spc25 E F E SKAP ZW10 Rod Dsn1 Nsl1 Knl1 MadBub Mad1 Cep57 Bub3 TRIP13 Zwint-1 Mad2 comet Mis12 Apc15 p31 Nnf1 Cdc20 Plk Plk Cdc20 BugZ O U BugZ Inner kinetochore R P C Q H I L
N K M
W T Centromeric DNA A S X A Sgo Aurora Borealin Aurora Inner centromere Incenp Survivin
C Saccharomyces cerevisiae D Cryptococcus neoformans
Dad1 Dad1 Duo1 Duo1 Mps1 Dam1 Mps1 Dam1 Dad2 Dad2 Dad3 Dad3 Ndc80 Ask1 Ndc80 Ask1 Nuf2 Spc19 Nuf2 Spc19 Outer kinetochore Dad4 Dad4 Hsk3 Hsk3 3 Spc24 Spc24 Spc34 Spc34 Spc25 Spc25
Dsn1 Dsn1 Nsl1 Knl1 MadBub Mad1 Nsl1 Knl1 MadBub Mad1 Zwint-1 Bub3 Mad2 Zwint-1 Bub3 Mad2 Mis12 Apc15 Mis12 Apc15 Nnf1 Cdc20 Nnf1 Cdc20 O U Inner kinetochore Q P H C I C L N K
Skp1* Skp1*
W T Ndc10 Cep3 Centromeric DNA A S X A Ctf13
Sgo Sgo Inner centromere Aurora Survivin Aurora Survivin Incenp Incenp
0-0.4 0.4-0.6 0.6-0.8 0.8-0.1
Protein in LECA: Protein not in LECA Protein present Protein absent Frequency
Figure 2. Kinetochores of model and non-model species. A. The human kinetochore. The colors of the proteins indicate if they were inferred to be present in LECA and their occurrence frequency across eukaryotes (see Materials and Methods). B. The predicted kinetochore of Tetrahymena thermopila projected onto the human kinetochore. C. The budding yeast kinetochore. Similar to panel (B). D. The predicted kinetochore of Cryptococcus neoformans projected onto the budding yeast kinetochore.
Kinetochore proteins are less conserved than APC/C subunits (Figure EV1, Appendix Table S1, [215]). Species on average possess 48% of the kinetochore proteins, compared to 70% of the APC/C subunits. Species that we predict to contain relatively few kinetochore proteins include Tetrahymena thermophila (Figure 2B) and Cryptococcus neoformans (Figure 2D). Some kinetochore proteins are absent from many different lineages, likely resulting from multiple independent gene loss events. We counted losses of kinetochore and APC/C proteins during post-LECA evolution using Dollo parsimony. On average, kinetochore proteins were lost 16.5 times since LECA, while APC/C proteins were lost 58 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 59
A B Sgo Skp1 Apc10 Bub3 BugZ
100 Cdh1 0.6 Apc15 Apc3 Apc13 Apc8 Apc6 Apc2 Plk Mad2 Apc4 TRIP13 Apc5 Apc1 Knl1 Apc7
Apc12 CenpP ZW10 Ska3 0.5 Ndc80 CenpC Cdc20 Mad1 Zwilch Rod CenpM Borealin 80 Mis12 Ska2 Aurora Cep57 CenpX CenpK Survivin Ska1 CenpL Spc25 CenpH Spc24 Nuf2 CenpI Mps1 CenpI 0.4 MadBub Incenp CenpN Ska2 CenpP CenpO CenpK CenpA Nsl1 p31comet CenpN CenpH CenpE CenpO MadBub Ska3 CenpW Mis12 Ska1 CenpE 0.3
Nuf2 60 Nsl1 Zwint−1 p31comet Survivin Knl1
Zwilch Mps1 Zwint−1 CenpL Borealin CenpW Cep57 CenpC dN/dS human−mouse Incenp % Identity human−mouse Spc25 Sgo 0.2 Cdc20 CenpA Apc12 Rod Aurora Ndc80 ZW10 CenpM CenpX
Spc24 40
0.1 Apc1 Apc5 TRIP13 Mad2 Apc4 Plk Apc8 Mad1 Apc13 Apc3 Apc6 Apc2 BugZ Apc7 Skp1 Apc10 Cdh1 Bub3 Apc11 Apc11 0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 Losses Losses
Figure EV2. Loss frequencies and sequence evolution of kinetochore and APC/C proteins. A, B. Scatter plots for loss frequencies and dN/dS values (A) and percent identity (B) of human–mouse orthologs for the kinetochore and APC/C proteins that were inferred to have been present in LECA. 3 Loss frequencies and dN/dS values positively correlate (P = 3.9e-5, Spearman correlation), whereas loss frequencies and percent identity negatively correlate (P = 0.0005, Spearman correlation)
13.1 times (not signifi cantly different for kinetochore vs. APC/C). Our homology searches hinted at some kinetochore proteins evolving also rapidly on the sequence level. The kinetochore proteins indeed have relatively high dN/dS values, a common measure for sequence evolution: when comparing mouse and human gene sequences, kinetochore proteins scored an average dN/dS of 0.24, compared to 0.06 for the APC/C proteins (p=0.0016) and 0.15 for all human proteins (p=4.8e-5). The loss frequency and sequence evolution seem to be correlated, suggesting a common underlying cause for poor conservation (Figure EV2, Discussion). Overall, the kinetochore seems to evolve more fl exibly than the APC/C.
We not only mapped the presences and absences of kinetochore proteins, we also counted their copy number in each genome (Figure EV3). As observed before, MadBub and Cdc20 are often present in multiple copies. These proteins likely duplicated in different lineages and subsequently the resulting paralogs subfunctionalized [143, 144, 213]. CenpE, Rod, Survivin, Sgo and the mitotic kinases Aurora and Plk also have elevated copy numbers. Possibly these proteins also underwent (recurrent) duplication and subfunctionalization, as for example suggested for Sgo: In the lineages of Schizosaccharomyces pombe, Arabidopsis thaliana and mammals, Sgo duplicated and likely subsequently subfunctionalized in a recurrent manner [216-218]. 60 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 61 i i a yi yi io ia us xa us us us r a ata um te r Opisthokonta um a r r ydis i n i n ucei ipes idae vum ulea ersa vskii allax assa utum utum r ube r a v mans erens iabilis i f visiae ad o ahens n asitica asinos vicollis estans aurelia ticillata ub r o r usculus aginalis a r e mophila f x ca r
Amoebozoa ectensis a c r e r ichopoda r o ia g r aia culicis alcipa r wczarzaki ico r v yza sati v ow al e esleea n yces lactis v a in f Danio re a pa r erguelense on merolae ugia mal a pa r v r k icha t r yces pombe ol v ugo laibachii O r wia lipolytica a a o wiella natans la k ia anguillulae V y z ia sulphu r idium pa r B r yces hansenii
Excavata r m V a pseudonana Mus m ium limaci n Homo sapiens pus siliculosus oon intestinalis yces punctatus kinsus ma r Guillardia theta yces cer e ero m Al b Ustilago m tierella elongata akifugu r xoplasma gondii ium k arr o o m Emiliania huxl e yt r Ciona intestinalis e r Candida glab m T ycoccus p r Oxyt r tierella Naegle r Edhazardia aedis o yces macrogy n Mnemiopsis leidyi Leishmania major ium dendrobatidis Y y abidopsis thaliana Chlorella v Xenopus tropicalis Giardia intestinalis yxa subellipsoidea ypanosoma b r P T Aquilegia coe r r Neurospo
Stramenopila-Alveolata-Rhizaria yt r ymena the r r Bigel o yscomitrella patens Coemansia r e anopho r ichoplax adhaerens yces b amecium tet r Mucor circinelloides Mo r Anopheles gambiae Salpingoeca rosetta ytophtho r Monosiga br ichomonas v A r Blastocystis hominis r yt r T ello m ostelium discoideum Micromonas species anchiostoma flo Galdie r ydomonas reinhardtii Kluy v r Thecamonas t r Catena r Mo r Amborella t r T a h a r Bat h P h Schistosoma mansoni Entamoeba histolytica T C y Ectoca r yptospo r P h anidiosc h B r Nematostella Allo m P Plasmodium f Symbiodinium mi n Conidiobolus coronatus Caenorhabditis elegans yptococcus neo Archaeplastida Capsaspo Deba antioc h et r osaccharo olysphondylium pallidum C r Spi z yco m Drosophila melanogaster Selaginella moellendorffii Dict y T C y Acanthamoeba castellanii Saccoglossus Ostreococcus lucima r aloperonospo r P Nannochloropsis gaditana u r C r Thalassiosi r Cocco m Saccharo Chla m Phaeodactylum t r achoc h A P h Encephalito z Aplanoc h Amphimedon queenslandica H y
Copy numbers ureococcus anophagef f Schi z A 0 2 4 6 8 ≥10 Bat r Aurora Incenp Survivin Borealin Sgo CenpA Ndc10 Ctf13 Cep3 Skp1 CenpT CenpW CenpS CenpX CenpC CenpL CenpN CenpH CenpI CenpK CenpM CenpO CenpP CenpQ CenpR CenpU Plk Mis12 3 Nnf1 Dsn1 Nsl1 Cep57 SKAP Astrin Knl1 Zwint-1 Spc24 Spc25 Nuf2 Ndc80 Mps1 ARHGEF17 Ska1 Ska2 Ska3 Dam1 Duo1 Dad1 Dad2 Dad3 Dad4 Hsk3 Ask1 Spc19 Spc34 Rod Zwilch ZW10 Spindly MadBub Bub3 BugZ CenpF CenpE Mad1 Mad2 Cdc20 Apc15 p31comet TRIP13
Figure EV3. Copy numbers of kinetochore proteins. Heatmap indicating the copy numbers of each kinetochore protein in the 90 eukaryotic lineages. Please note that these copy numbers might contain some over- and underestimates due to unpredicted or imperfectly predicted genes and database errors.
Co-evolution within protein complexes of the kinetochore Subunits of a single kinetochore complex tend to co-occur across genomes: they have similar patterns of presences and absences (‘phylogenetic profi les’, Figure 1A). Such co-occurring subunits likely co-evolved as a functional unit [28]. To quantify how similar phylogenetic profi les are, we calculated the Pearson correlation coeffi cient (r) for each kinetochore protein pair. We defi ned a threshold of r=0.477 for protein pairs likely to be interacting, based on the scores among established interacting kinetochore pairs 60 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 61
(Appendix Figure S1). All pairwise scores were used to cluster the proteins (Figure 1 including Dataset 1 and Dataset 2) and to visualize the proteins using t-Distributed Stochastic Neighbor Embedding (t-SNE, Appendix Figure S2) [12]. Many established interacting proteins correlate well and, as a result, cluster together and are in close proximity in our t-SNE map. Examples include the SAC proteins Mad2 and MadBub, centromere proteins (CENPs) located in the inner kinetochore (discussed below), the Ska complex and the Dam1 complex. Such complexes, with subunits having highly similar phylogenetic profi les, evolved as a functional unit.
While some kinetochore proteins have highly similar phylogenetic profi les, others lack similarity, pointing to a more complex interplay between evolution and function. First, two proteins might have strongly dissimilar, or inverse, phylogenetic profi les, potentially because they are functional analogs [132]. In the kinetochore network, phylogenetic dissimilarity is observed for proteins of the Dam1 complex and of the Ska complex, which are indeed analogous complexes [210, 219, 220]. Second, proteins that do interact in 3 a complex might nevertheless have little similarity in their phylogenetic profi les. Either such a complex did not evolve as a functional unit since its subunits started to interact only recently [221], or because one of its subunits serves a non-kinetochore function and thus also co-evolves with non-kinetochore proteins [222]. An example of a potentially recently emerged interaction is BugZ-Bub3, that form a kinetochore complex in human [223, 224], but have little similarity in their phylogenetic profi les, measured by their low correlation (r=0.187). In general, BugZ’s phylogenetic profi le is different from other kinetochore proteins’, hence this protein might be recently added to the kinetochore [225, 226]. An example of a kinetochore protein that co-evolves with non-kinetochore proteins is ZW10, which joins Rod and Zwilch in the RZZ complex. The phylogenetic profi le of ZW10 is dissimilar from those of Rod and Zwilch (r=0.218 for Rod, r=0.236 for Zwilch), while those are very similar to each other (r=0.859, Figure 3), due to ZW10 being present in various species that lack Rod and Zwilch. In those species, ZW10 might not localize to the kinetochore but perform only its role in vesicular traffi cking, in a complex with NAG and RINT1 (NRZ complex [227]). Indeed, the ZW10 phylogenetic profi le is much more similar to that of NAG (r=0.644) and RINT1 (r=0.512) compared to Rod and Zwilch. Hence, ZW10 more strongly co-evolves with NAG and RINT1. The Rod and Zwilch phylogenetic profi les are similar to that of Spindly (r=0.730 for Rod, r=0.804 for Zwilch), a confi rmed RZZ-interacting partner [228-230]. These similarities argue for an evolutionary ‘Rod-Zwilch-Spindly’ (RZS) module, rather than an RZZ module.
The phylogenetic profi les of kinetochore proteins shed new light on these proteins’ (co-) evolution and on their function, examples of which are discussed in detail below. 62 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 63
Tree scale: 0.1 Bootstraps: 1-100 Centromeric H3 CenpA cluster
3
The CCAN evolved as an evolutionary unit that is absent from many lineages The kinetochore connects the centromeric DNA, mainly via CenpA, to the spindle microtubules, mainly via Ndc80. In human and yeast, CenpA and Ndc80 are physically linked by the constitutive centromere-associated network (CCAN, reviewed in [231]). Physically, the CCAN comprises multiple protein complexes (Figure 2). Evolutionarily, however, it comprises a single unit, as the majority of CCAN proteins have highly similar phylogenetic profi les (Figure 1, average r=0.513). Four CCAN proteins are very different from the others: CenpC, CenpR, CenpX and CenpS. CenpC is widely present and is suffi cient to assemble at least part of the outer kinetochore in D. melanogaster and humans [104, 232]. CenpR seems a recent gene invention in animals. CenpX and CenpS have a more ubiquitous distribution compared to other CCAN proteins, possibly due to their non-kinetochore role in DNA damage repair [233, 234].
Our study confi rmed that most CCAN proteins have no (detectable) homologs in C. 62 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 63
Figure EV4. Gene phylogeny of histone H3 homologs. To fi nd the putative orthologs of CenpA, we fi rst aligned candidate orthologous sequences, which were experimentally identifi ed centromeric H3 variants in divergent species (indicated with a pink branch in this phylogeny). From this alignment, we constructed a profi le HMM and performed multiple HMM searches through our local proteome database. From these searches, we selected 831 sequences (belonging to the histone H3 family), aligned these and constructed the gene phylogeny, which is presented in this fi gure (see also Materials and Methods). We rooted the phylogeny on the cluster that contained all of these experimentally identifi ed centromeric H3 variants and some additional sequences that, based on best blast hits, were also likely to be orthologous to CenpA. The cluster did not contain the candidate orthologs in Toxoplasma gondii [18]. Since we do not know whether this is due to an error in the gene phylogeny, or to parallel invention of a centromeric H3 variants in this species, which would mean that it is not orthologous to CenpA. Nevertheless, we included these sequences in the orthologous group. The candidate centromeric H3 variants that are part of the CenpA cluster include sequences from all fi ve eukaryotic supergroups:Homo sapiens [30], Saccharomyces cerevisiae [37], Drosophila melanogaster [38], Caenorhabditis elegans [40], Schizosaccharomyes pombe [41] (Opisthokonta), Dictyostelium discoideum [43] (Amoebozoa), Arabidopsis 3 thaliana [44] (Archaeplastida), Tetrahymena thermophila [45], Plasmodium falciparum [46] (SAR), Giardia intestinalis [48] and Trichomonas vaginalis [51] (Excavata). The original gene tree in newick format is provided (Dataset 3).
elegans and D. melanogaster. The CCAN is not only absent from these model species, but also from many other lineages, such as various animals and fungi, and all Archaeplastida. Because the CCAN is found in three out of fi ve eukaryotic supergroups, it likely was present in LECA, and subsequently lost multiple times in diverse eukaryotic lineages. Alternatively, the CCAN was invented more recently and horizontally transferred among eukaryotic supergroups. However, under both scenarios the CCAN was recently lost in various lineages, for example in the basidiomycete fungi: while Ustilago maydis has retained the CCAN, its sister clade C. neoformans eliminated it (Figure 2D). The fi nding that most of the CCAN (with the exception of CenpC) is absent in many eukaryotic lineages poses questions about kinetochore architectures in these species. Since they generally possess a protein binding to the centromeric DNA (CenpA, see Figure EV4 for details on identifying the orthologs of CenpA) and a protein binding to the spindle microtubules (Ndc80), their kinetochore is not wholly unconventional. Is the bridging function of the CCAN simply dispensable, as proposed for D. melanogaster [235] or is it carried out by other, non-homologous protein complexes? In order to answer these questions, the kinetochores of diverse species that lack the CCAN should be experimentally examined in more detail.
Absence of co-evolution between RZS and its putative kinetochore receptor Zwint-1 Various studies suggested that the RZZ/RZS complex is recruited to the kinetochore 64 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 65
present
absent t 0 1 Metazoa Fungi Stramenopila Alveolata Embryophyta Chlorophyta r NAG ZW10 RINT1 Zwint-1 Knl1 Zwilch Rod Spindly Rod Knl1 NAG ZW10 RINT1 Zwilch Zwint-1 Spindly
Figure 3. Phylogenetic profi les of the Rod–Zwilch–ZW10 (RZZ) complex, its mitotic interaction partners (Knl1, Zwint-1, and Spindly), and ZW10’s interphase interaction partners in the NRZ (NAG and RINT1) complex Presences and absences across eukaryotes of the RZZ subunits, Spindly, Zwint-1, and Knl1, and of the NRZ subunits, NAG and RINT1. Colored areas indicate eukaryotic supergroups as in Figure 1. Right side: Pairwise Pearson correlation coeffi cients (r) between the phylogenetic profi les including a heatmap. The indicated 3 threshold t represents the value of r for which we found a sixfold enrichment of interacting protein pairs (see Appendix Fig S1). See also Appendix Fig S3 for the procedure by which homology between Zwint-1, Sos7, and Kre28 was detected.
primarily by Zwint-1. Zwint-1 itself localizes to the kinetochore by binding to Knl1 [236, 237]. We compared the phylogenetic profi le of Zwint-1 to the profi les of these interaction partners: RZZ/RZS and Knl1 (Figure 3). While we searched for orthologs of Zwint-1, we concluded that Zwint-1, Kre28 (S. cerevisiae) and Sos7 (S. pombe) likely belong to the same orthologous group [238, 239], collectively referred to as ‘Zwint-1’. Although these sequences are only weakly similar, they can be linked by multidirectional homology searches (Appendix Figure S3).
Our set of 90 species contains many species that possess a Zwint-1 ortholog (36 species), but lack RZS, and vice versa (11 species, -0.065 < r < 0). This lack of correlation strongly suggests that, at least in a substantial amount of lineages, RZZ/RZS is not recruited to kinetochores by Zwint-1, but by another, yet unidentifi ed factor. Support for this inference was recently presented in studies using human Hela cells [240, 241]. Compared to RZS, the phylogenetic profi le of Zwint-1 is more similar to that of Knl1 (Figure 3, r=0.506), and of Spc24 and Spc25 (Figure 1, r=0.529 for Spc24, r=0.499 for Spc25), two subunits of the Ndc80 complex that are located in close proximity to Knl1-Zwint-1 [242]. Perhaps Zwint-1 stabilizes the largely unstructured protein Knl1 [241], thereby indirectly affecting the recruitment of RZZ/RZS.
Higher-order co-evolution between the AAA+ ATPase TRIP13 and HORMA domain proteins SAC activation and SAC silencing are both promoted by the AAA+ ATPase TRIP13. 64 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 65
TRIP13 p31comet HORMAD Mad2
A Mad2 B p31comet Metazoa
ATP ADP TRIP13
HORMAD
NTD Fungi ATP ADP AAA+
C TRIP13 p31comet 3 42
23 concordant
1 species TRIP13 HORMAD_p31comet 24 65 r=0.526 discordant 17 concordant
7 species TRIP13 HORMAD
1 Stramenopila 55 r=0.766 discordant 17 concordant
7 species
11 Alveolata discordant r=0.517 Embryophyta Chlorophyta
Figure 4. The co-evolutionary patterns of the multifunctional protein TRIP13. A. Model for the mode of action of TRIP13 as recently suggested [11]. By hydrolyzing ATP, TRIP13 would change the conformation of HORMAD and Mad2 from closed to open, the latter via binding to co-factor p31comet, which forms a heterodimer with Mad2. TRIP13 has a C-terminal AAA+ ATPase domain (AAA+) and a N-terminal domain (NTD) and forms a hexamer [20]. B. Presences and absences of TRIP13 and of its interaction partners p31comet and HORMAD. Colored areas indicate eukaryotic supergroups as in Figure 1. C. Numbers of lineages in which TRIP13 is present or absent, compared to the presences p31comet, HORMAD or their joint presences. Also the Pearson correlation coeffi cient of the phylogenetic profi les as inB ( ) is given. 66 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 67
TRIP13 operates by using the HORMA domain protein p31comet to structurally inactivate the SAC protein Mad2, also a HORMA domain protein (Figure 4A). Since the SAC requires Mad2 to continuously cycle between inactive and active conformations, TRIP13 enables SAC signaling in prometaphase. In metaphase, however, when no new active Mad2 is generated, TRIP13 stimulates SAC silencing. [243-245]. The TRIP13 ortholog of budding yeast, Pch2, probably has a molecularly similar function in meiosis: Pch2 is proposed to bind oligomers of the HORMA domain protein Hop1 (HORMAD1 and HORMAD2 in mammals, hereafter referred to as ‘HORMAD’) and to structurally rearrange one copy within the oligomer, resulting in its redistribution along the chromosome axis. HORMAD, p31comet and Mad2 are homologous and belong to the family of HORMA-domain proteins that also includes Rev7 [246] and autophagy-related proteins Atg13 and Atg101 [247, 248]. All of these proteins likely descend from an ancient HORMA-domain protein that duplicated before LECA.
3 Although the TRIP13 phylogenetic profi le is relatively similar to both the profi les of p31comet (r=0.526) and HORMAD (r=0.517), TRIP13 does not co-occur with these proteins in multiple species (Figure 4B). These exceptions to the co-occurrences of TRIP13/p31 and TRIP13/HORMAD can be explained by the dual role of TRIP13, which is to interact with both p31comet and with HORMAD. If we combine profi les of p31comet and HORMAD, the similarity with TRIP13 increases: the joint p31comet and HORMAD profi le strongly correlates with the TRIP13 profi le (r=0.766, Figure 4C). TRIP13 was indeed expected to co-evolve with both of its interaction partners, as has been demonstrated for other multifunctional proteins [222]. Based on the phylogenetic profi les, we conclude that TRIP13 is only retained if at least p31comet or HORMAD is present (with the exception of the diatom Phaeodactylum tricornutum). We predict that TRIP13-containing species that lost p31comet but retained HORMAD, such as S. cerevisiae and Acanthamoeba castellanii, only use TRIP13 during meiosis and not in mitosis.
The phylogenetic profi les of SAC proteins predict a role for nuclear pore proteins in the SAC. Because similar phylogenetic profi les refl ect the functional interaction of proteins, similar phylogenetic profi les also predict such interactions. We applied this rationale by comparing the phylogenetic profi les of the kinetochore proteins (Figure 1) to those of proteins of the genome-wide PANTHER database in search of unidentifi ed connections. PANTHER is a database of families of homologous proteins from complete genomes across the tree of life. We assigned all proteins present in our eukaryotic proteome database to these homologous families (see Materials and Methods). For each kinetochore protein in Figure 1, we listed the 30 best matching (with the highest Pearson correlation coeffi cient) families in PANTHER, and screened which PANTHER families occur often in these lists (Appendix Table S3). Within this list, we considered the nuclear 66 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 67
pore protein Nup160 an interesting candidate, because it is part of the Nup107-Nup160 nuclear pore complex that localizes to the kinetochore [249, 250]. The phylogenetic profi le of Nup160 (as defi ned by PANTHER) was particularly similar to that of the SAC protein MadBub (r=0.718). In order to improve the phylogenetic profi le of Nup160, we manually determined the orthologous group of Nup160 in our own proteome dataset. We also determined those of Nup107 and Nup133, two other proteins of the Nup107- Nup160 complex. The Nup160, Nup133 and Nup107 phylogenetic profi les strongly correlated to those of SAC proteins MadBub (0.541 < r < 0.738) and Mad2 (0.528 < r < 0.715, Figure 5) - even stronger than these three nuclear pore proteins correlated with one another (0.475 < r < 0.601). Furthermore, Nup160, Nup133 and Nup107 correlate better with MadBub and Mad2 than these SAC proteins do with the other SAC proteins (MadBub: average r=0.563, Mad2: average r=0.511) and far better than these SAC proteins do with all kinetochore proteins (MadBub: average r=0.290, Mad2: average r=0.239). While previous studies have shown that the Nup107-Nup160 complex localizes to the kinetochore in mitosis, our analysis in addition suggests that these proteins may 3 function in the SAC and that they potentially interact with Mad2 and MadBub.
0.2 t 1 r
Bub3 Cdc20 Mad1 Mps1 Nup107 Nup160 Nup133 Mad2 MadBub Bub3 Mad2 Mps1 Mad1 Cdc20 Nup133 Nup160 Nup107 MadBub
Figure 5. Correlations between proteins of the Nup107-160 complex and proteins of the SAC. Heatmap indicating the pairwise Pearson correlation coeffi cients (r) of the phylogenetic profi les of proteins of the Nup107-160 complex and of the SAC. The clustering (average linkage) on the left side of this heatmap was also based on these correlations. The indicated threshold t represents the Pearson correlation coeffi cient for which we found a 6-fold enrichment of interacting protein pairs (see Appendix Figure S1). 68 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 69
A Mutation of canonical MIM Arabidopsis thaliana...... Aquilegia coerulea...... Oryza sativa...... Amborella trichopoda...... Selaginella moellendor i...... Physcomitrella patens...... Streptophyta Klebsormidium accidum...... Chorokybus atmophyticus..... Chlamydomonas reinhardtii. Viridiplantae Volvox carteri...... Coccomyxa subellipsoidea..... Chlorella variabilis*...... canonical MIM Ostreococcus lucimarinus...... land plant MIM
Chlorophyta Bathycoccus prasinos...... Micromonas species*...... ‘transition’ MIM
B Mad1 MIM C Mad2 Mad1 Mad2 Mad1 66 66
R QL L AD 10 6 3 R F S S KSN E I T VTMY VMKG H X X L S E PT G VH K N F V Q T C I H AY M QT Q W I R HDNQ A RHAI VLF KAHGNE S Cdc20 MIM P X 7 X 1 species P X 7 X 6 RS A KT R S NE R Q GAA N QQT SQ R PKN L K N D T K Y KP T V AI S WL S G L F HQH V ST C P I K H E E YV L P L YDMM QRY GMFWV I A HN r=0.492 r=0.609 G L HGGD KEAVI I EHF C T F F C MIM de nition Φ Φ P P or L + Φ Φ D Mad2 Cdc20 Mad2 Cdc20 72 69
X X 5 3
12 6 species X X X protein absent 1 6 X X MIM r=0.440 r=0.519
Figure EV5. Evolution of the Mad2-interacting motif (MIM) in green plants and co-occurrences of Mad2 with the MIM under a less strict motif defi nition. A. Viridiplantae (green plants) phylogeny [3] and the occurrences of the canonical MIM or the ‘land plant’ MIM in Mad1 orthologs of the associated species. *Species lacking an aligned MIM, possibly caused by incomplete gene prediction of Mad1 orthologs. B. The sequence logos of the MIMs of Mad1 (upper panel) and Cdc20 (lower panel) based on the alignments of the motifs present in the right-sided panels of C and D. Below is indicated the required amino acid sequence of the MIM (+: positive residue, Φ: hydrophobic residue, P: proline). In contrast to Figure 6, the MIM is considered present if it agrees with the pattern [ILV] (2)X(3,7)P or [RK][ILV](2), in order that the land plant motif suffi ces. C, D. Left side: numbers of presences and absences of Mad2 in 90 eukaryotic species and its interaction partners Mad1 (C) and Cdc20 (D). Right side: frequencies of Mad2 and MIM (according to defi nition in B) occurrences in species having Mad1 C( ) or Cdc20 (D), respectively. Also the Pearson correlation coeffi cients (r) for the corresponding phylogenetic profi les are shown. 68 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 69
A Mad1 MIM B Mad2 Mad1 Mad2 Mad1 66 52
R L K L 10 14 D QF AM E S S S TM N X I X H Y G E G R T V V PT F V AYH L Q V R S QT I HQI K K CWMANGA R I VL H N Cdc20 MIM P X 7 X 0 species P X 7 X 7 S RT NE AS KSAA Q K NQR T PDS RYQK L Q GH I R CW T K E P V V V EQK H M L L YF NML NKS E HG I T D L A KVFI I HF T GGAC r=0.492 r=0.512 MIM de nition + Φ LΦ P P C Mad2 Cdc20 Mad2 Cdc20 72 65
X X 5 7
X 12 X 0 species X 1 X 12 r=0.440 r=0.755
X Protein absent Mad2-interacting motif (MIM) 3
Figure 6. Phylogenetic co-occurrence of Mad2 with its interaction partners Mad1 and Cdc20 and their Mad2-interacting motifs (MIMs). A. The sequence logos of the MIMs of Mad1 (upper panel) and Cdc20 (lower panel) based on the multiple sequence alignments of the motifs. Below is indicated the required amino acid sequence of the MIM (+: positive residue, Φ: hydrophobic residue, P: proline) which is restricted by the pattern [RK] [ILV](2)X(3,7)P. B, C. Left side: numbers of presences and absences of Mad2 in 90 eukaryotic species and its interaction partners Mad1 (B) and Cdc20 (C). Right side: frequencies of Mad2 and canonical MIM occurrences in species having Mad1 (B) or Cdc20 (C), respectively. Also the Pearson correlation coeffi cients (r) for the corresponding phylogenetic profi les are shown.
The Mad2-interacting motif (MIM) in Mad1 and Cdc20 is coupled to Mad2 presence While interacting proteins are expected to co-evolve at the protein-protein level, as exemplifi ed by many complexes within the kinetochore, interacting proteins might also co-evolve at different levels, such as protein-motif. Co-evolution between a protein and a protein motif has been incidentally detected before, for example in case of CenpA and its interacting motif in CenpC [251] and in case of MOT1 and four critical phenylalanines in TBP [160]. We here explore potential co-evolution of Mad2 with the protein motif it interacts with in Cdc20 and Mad1: the Mad2-interacting motif (MIM). Both the Mad2- Mad1 and the Mad2-Cdc20 interactions operate in the SAC [252, 253]. We defi ned the phylogenetic profi les of the MIM in Mad1 and Cdc20 [254, 255] (Figure 6A) by inspecting the multiple sequence alignments of Mad1 and Cdc20. These alignments revealed that the MIM is found at a similar position across the Mad1 and Cdc20 orthologs, hence the motif likely predates LECA in both these proteins. Notable differences exist between the 70 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 71
MIMs of Cdc20 and Mad1, which could refl ect differences in binding strength to Mad2.
The phylogenetic profi les of Mad2 and of the MIM in Cdc20 or Mad1 orthologs correlated stronger than the full-length proteins (Figure 6B,C). In particular, species lacking Mad2, but having Mad1 and/or Cdc20, never contained the canonical MIM in either their Cdc20 or their Mad1 sequences (hypergeometric test: p<10-4, p<10-9 for Mad1 and Cdc20, respectively). Such species hence likely lost Mad2 and subsequently lost the MIM in Mad1 and Cdc20, because it was no longer functional. Moreover, absence of the MIM in Mad1/Cdc20 supports that in these species Mad2 is indeed absent. While we expected to only fi nd a MIM in species that actually have Mad2, we also expected the reverse: that species that have Mad2 also have a MIM in their Mad1/Cdc20. This is however not the case, most notably for Mad1: many lineages (14) have both Mad1 and Mad2 but lack the MIM in Mad1. A substantial fraction (six) of this group belongs to the land plant species that have a somewhat different motif in Mad1 that is conserved within this lineage (Figure 3 EV5A). This altered land plant motif might mediate the Mad1-Mad2 interaction, which has been reported in A. thaliana [256]. If we consider this plant motif to be a ‘valid’ MIM, the Mad1-MIM and Mad2 correlate substantially better (Figure EV5B-D). Overall, under both motif defi nitions the protein-motif correlations are higher than the protein-protein correlations. Hence, including protein motifs can expose that interaction partners co- evolve, albeit at a different level, and may aid to predict functional interactions between proteins de novo.
Discussion
Our evolutionary analyses revealed that since LECA, the kinetochores of different lineages strongly diverged by different modes of genome evolution: kinetochore proteins were lost, duplicated and/or invented, or diversifi ed on the sequence level. In addition to straightforward protein-protein co-evolution, we found alternative evolutionary relationships between proteins that hint at a more complex interplay between evolution and function. Some established interacting proteins have not co- evolved (Zwint-1 and RZS, Bub3 and BugZ) which has been previously shown for other interaction partners to refl ect evolutionary fl exibility [221]. Lack of co-evolution may also refl ect that a protein has multiple different functions, for which it interacts with different partners. The phylogenetic profi le of such a multifunctional protein differs from either of its interaction partners, and instead is similar to the combined profi les of its interaction partners [222], as we showed for HORMAD and p31comet with TRIP13. Some co-evolutionary relationships predicted novel protein functions, such as nuclear pore proteins operating in the SAC, which should be confi rmed with experiments. Finally, not only proteins, but also functional protein motifs co-evolved with their interaction 70 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 71
partner, as we found for Mad2 and the MIMs in Cdc20/Mad1. Probably, including more proteins and (known and de novo predicted) motifs/domains will not only improve the correlation between known interaction partners, but will also enhance predicting yet unknown interactions and functions.
While we carefully curated the orthologous groups of each of the kinetochore proteins, their phylogenetic profi les might contain some false positives and/or false negatives: incorrectly assigned presences (because a protein sequence in fact is not a real ortholog) and incorrectly assigned absences (because a species does contain an ortholog, but we did not detect it). For the majority of kinetochore proteins, we estimate the chance of false negatives larger than of false positives, mainly because they likely are vulnerable to homology detection failure, given that their sequences evolve so rapidly (Appendix Table S1, Results). Such false negatives of a particular protein will result in falsely inferred gene loss events. A failure to detect homology might therefore also cause sequence divergence to correlate to loss frequency (Figure EV2). Specifi c examples of suspicious 3 absences (potential false negatives) include the inner centromere protein Borealin in S. cerevisiae and the KMN network proteins Spc24, Spc25, Nsl1/Dsn1 in D. melanogaster and C. elegans, and possibly Ndc80 in T. brucei, since functional counterparts of these proteins have been characterized in these species [113, 257-262]. Moreover, species that we predicted to have very limited kinetochore compositions, such as T. thermophila (Figure 2B), might actually contain highly divergent orthologs that we could not detect. If such a species’ kinetochore would be examined biochemically, its undetected orthologs might be uncovered. Although the phylogenetic profi les of the kinetochore proteins presented here might contain some of such errors, we think that our manual curation of the orthologs groups (see Materials and Methods) yields an accurate global representation of the presences and absences of these proteins among eukaryotes. We think this accuracy is supported by the high similarity of phylogenetic profi les of interacting proteins.
The set of kinetochore proteins we studied here is strongly biased towards yeast and animal lineages; lineages that are relatively closely related on the eukaryotic tree of life. This bias is due to the extensive experimental data available for these lineages. Highly different kinetochores might exist, such as the kinetochore of T. brucei [113, 116]. If in the future we know the experimentally validated kinetochore compositions of a wider range of eukaryotic species, we could sketch a more complete picture of kinetochore evolution and could potentially expand and improve our functional predictions.
Since the kinetochore seems highly diverse across species, several questions arise. Is the kinetochore less conserved than other core eukaryotic cellular systems/pathways, as comparing it to the APC/C suggested? And if so, why is it allowed to be less conserved, 72 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 73
or are many of the alterations adaptive to the species? Why do certain lineages (such as multicellular animals and plants) contain a particular kinetochore submodule (such as the Ska complex) while others (such as most fungi) lack it, or have an alternative system (such as Dam1)? Do these genetic variations among species have functional consequences for kinetochore-related processes in their cells? To answer such questions, our dataset should be expanded with specifi c (cellular) features and lifestyles, when this information becomes available for the species in our genome dataset. Together with biological and biochemical analyses of processes in unexplored species, an expanded dataset may reveal the true fl exibility of the kinetochore in eukaryotes and show how chromosome segregation is executed in diverse species. The comparative genomics analysis that we presented here provides a starting point for such an integrated approach into studying kinetochore diversity and evolution, since it allows for informed decisions about which species to study. 3 Materials and Methods
Constructing the proteome database To study the occurrences of kinetochore genes across the eukaryotic tree of life, we constructed a database containing the protein sequences of 90 eukaryotic species. This size was chosen because we consider it to be suffi ciently large to represent eukaryotic diversity, but also small enough to allow for manual detection of orthologous genes. We selected the species for this database based on four criteria. First, the species should have a unique position in the eukaryotic tree of life, in order to obtain a diverse set of species. Second, if available we selected two species per clade, which facilitates the detection of homologous sequences and the construction of gene phylogenies. Third, widely used model species were preferred over other species. Fourth, if multiple proteomes and/ or proteomes of different strains of a species were available, the most complete one was selected. Completeness was measured as the percentage of core KOGs (248 core eukaryotic orthologous groups [263]) found in that proteome. If multiple splice variants of a gene were annotated, the longest protein was chosen. A unique protein identifi er was assigned to each protein, consisting of 4 letters and 6 numbers. The letters combine the fi rst letter of the genus name with the fi rst three letters of the species name. The versions and sources of the selected proteomes can be found in Appendix Table S2.
Ortholog detection The set of kinetochore proteins we studied were selected based on three criteria: (1) localizing to the kinetochore, (2) being present in at least three lineages and (3) having an established role, supported by multiple studies, in the kinetochores and/or kinetochore signaling in human or in budding yeast. We applied a procedure comprising two different 72 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 73
methods to fi nd orthologs for the kinetochore proteins in our set within our database of 90 eukaryotic proteomes, and the same procedure was followed for determining orthologs of the APC/C proteins, NAG, RINT1, Nup107, Nup133, Nup160 and HORMAD. The method of choice depended on whether or not it was straightforward to fi nd homologs across different lineages for a specifi c protein. In both methods, initial searches started with the human sequence, or, if the protein is not present in humans, with the budding yeast sequence. Method 1. If many homologs were easily found, the challenge was to distinguish orthologs from outparalogs. Here we defi ned an orthologous group as comprised of proteins that result from speciation events and that can be traced back to a single gene in LECA, whereas outparalogs are related proteins that resulted from a pre-LECA duplication. For example, Cdc20 and Cdh1 are homologous proteins, both having their own orthologous groups among the eukaryotes. They resulted from a duplication before LECA, therefore members of the Cdc20 and Cdh1 group are outparalogs to each other. To fi nd homologs, we used blastp online to search through the non-redundant protein 3 sequences (nr) as a database [264]. We aligned the sequences found with MAFFT [265] (version v7.149b, option einsi, or linsi in case of expected different architectures) to make a profi le HMM (www.hmmer.org, version HMMER 3.1b1). If the homologs are known to share only a certain domain, that domain was used for the HMM, otherwise we used the full-length alignment. This HMM was used as input for hmmsearch to detect homologs across our own database of 90 eukaryotic proteomes. From the hits in this database, we took a substantial number of the highest scoring hit sequences, up to several hundreds. We aligned the hit sequences using MAFFT and trimmed the alignment with trimAl [266] (version 1.2, option automated1). Subsequently, RAxML version 8.0.20 [267] was used to build a gene tree (settings: varying substitution matrices, GAMMA model of rate heterogeneity, rapid bootstrap analysis of 100 replicates). We interpreted the resulting gene tree by comparing it to the species tree and thereby determined which clusters form orthologous groups. These orthologous groups were identifi ed by fi nding the cluster that contained sequences from a broad range of eukaryotic species and had a sister cluster that also has sequences from this broad range of species. The cluster that contained the initial human query sequence was the orthologous group of interest, while the sister cluster is the group of outparalogs. In our search of orthologs of CenpA, we applied this fi rst method. CenpA is part of the large family of histone H3 proteins and has long been recognized to diverge rapidly, due to which it is a challenge to reconstruct CenpA’s evolution [268]. We determined this orthologous cluster with help of experimentally identifi ed centromeric histone H3 variants in a wide range of species and we included two Toxoplasma gondii sequences that were not part of this cluster. For details, see Figure EV4. The tree in this fi gure was visualized using iTOL [269]. Method 2. If homologs were not easily found, no outparalogs were obtained by these searches and hence the homologs defi ned the orthologous group. For these cases 74 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 75
we used a different strategy to fi nd the orthologous group in our database. Iterative searching methods (jackhmmer and/or psi-blast) were applied to fi nd homologs across the nr and UniProt database [270]. In specifi c cases we cut the initial query sequence, for example to remove putative coiled-coil regions. If a protein returned very few hits, we tried to expand the set of putative homologous sequences by using some of the initially obtained hits as a query. If candidate orthologous proteins were reported in experimental studies in species other than human or budding yeast, but not found by initial searches, we specifi cally searched using those as a query. If this search yielded hits overlapping with previous searches, these candidate orthologous sequences were added to the set of hits. The sequences in this set were aligned to obtain a refi ned profi le HMM. In addition, we searched for conserved motifs in the hit sequences using MEME [271] (version 4.9.0), which aided in recognizing conserved positions that could characterize the homologs. The obtained profi le HMM was used to search for homologs across in local database. The resulting hits were checked for motifs identifi ed by MEME 3 and applied to online (iterative) homology searches to check whether we retrieved sequences already identifi ed as orthologous. Based on this evaluation of individual hits, we defi ned a scoring threshold for the hmmsearch with this profi le HMM and searched our database until no new hits were found. The resulting set of sequences was the orthologous group of interest. The sequences of the orthologous groups can be found in the Dataset 1.
Calculating correlations between phylogenetic profi les In order to study the co-evolution of the kinetochore proteins and to infer potential functional relationships of these genes based on co-evolution, we derived the phylogenetic profi les of these genes. The phylogenetic profi le of a gene is a listof its presences and absences across our set of 90 eukaryotic genomes based on the composition of the orthologous groups. The phylogenetic profi le consists of a string of 90 characters containing a “1” if the gene is present in a particular species (either single- or multi-copy), and a “0” if it is absent. To reveal whether two genes often co-occur in species, we measured how similar their phylogenetic profi les were using the Pearson correlation coeffi cient [4]. All pairwise scores can be found in Dataset 2. To identify pairs of proteins that potentially have a functional association, we applied a threshold of r=0.477. Appendix Figure S1 clarifi es why the Pearson correlation coeffi cient was opted for and how the threshold was set. The Pearson correlation coeffi cients of all gene pairs were converted into distances (d = 1-r) and the genes were clustered based on their phylogenetic profi les using average linkage. The Pearson correlation coeffi cients were also used to map the kinetochore proteins in 2D by Barnes-Hut t-SNE (Appendix Figure S2) [12]. 74 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 75
Detecting the MIM in Mad1 and Cdc20 orthologs We made multiple sequence alignments of the Cdc20 and Mad1 orthologous groups using MAFFT (option einsi). We used these alignments to search for the Mad2-interacting motif (MIM). The typical MIM is defi ned by [KR] [IVL](2)X(3,7)P for both Mad2 and Cdc20 [254, 255], but we also used an alternative defi nition: [ILV](2)X(3,7)P or [RK][ILV](2). We inferred that the location of the motif in the protein is conserved in Mad2 as well as in Cdc20, because the position of the MIM in the multiple sequence alignments was the same in highly divergent species (e.g. plants and animals). For all orthologous sequences, we checked whether the motif, either the typical MIM (Figure 6) or the alternative MIM (Figure EV5) was present on these conserved positions.
Finding novel proteins functioning in the kinetochore To fi nd new proteins performing essential roles at the kinetochore by phylogenetic profi ling, a reference protein set was needed. This reference set was based onthe protein families present in PANTHER. More specifi cally, we assigned the proteins within 3 our proteome database of 90 eukaryotic species to PANTHER (sub)families [272] (version 10). This assignment was done by applying hmmscan to the protein sequences of our database, using the complete set of PANTHER family and subfamily HMMs as a search database. Each protein was assigned to the PANTHER (sub)family to which it had the highest hit, if signifi cant. If a protein was assigned to a subfamily, it was also assigned to the full family to which that subfamily belongs. For each PANTHER (sub)family, a phylogenetic profi le was constructed and compared to the phylogenetic profi les of the kinetochore proteins. For each kinetochore protein, the best 30 matches of PANTHER (sub)families were selected. The proteins often occurring in these top lists can be found in Appendix Table S3.
Comparing diversity of kinetochore and APC/C proteins For the kinetochore and APC/C proteins in this dataset, we calculated their occurrence frequencies and entropies across 90 eukaryotic species. The entropy refl ects a protein’s diversity of presences and absences across species: a protein that is present in half of the species has the highest entropy. We also calculated and compared all pairwise Pearson correlation coeffi cients of the phylogenetic profi les for both of these protein datasets. To assess how complete the kinetochores and APC/C complexes of the species in our dataset are, we calculated the percentage of present kinetochore proteins in species having Ndc80 and CenpA (because those species are expected to have a kinetochore), and we calculated the percentage of present APC/C proteins in species having the main APC/C enzyme Apc10. Loss frequencies were inferred from Dollo parsimony for all kinetochore and APC/C proteins inferred to have been present in LECA. Transitions (also a measure for the evolutionary dynamics of proteins) were measured for each protein by counting all changes in state (so from present to absent, or from absent to present) along 76 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 77
a phylogenetic profi le. Since the ordering of the species in the phylogenetic profi le is an indication of their relatedness, these transitions are expected to refl ect the evolutionary fl exibility of proteins as well. dN/dS and percent identity scores for human and mouse sequences were derived from from Ensembl [273] (downloaded via Enseml BioMart on November 24, 2016). If multiple one-to-one orthologs for a single orthologous group/ family exist, the average dN/dS or percent identity was taken. The results of these kinetochore-APC/C comparisons can be found in Appendix Table S1.
Acknowledgments We thank the members of the Kops and Snel labs for critical reading and helpful discussion on the manuscript. We thank John van Dam for his contribution to compiling the eukaryotic genome database. This work was supported by the UMC Utrecht and the Netherlands Organisation for Scientifi c Research (NWO-Vici 865.12.004 to GK).
3 Author contributions BS and GK designed the research. JH and ET performed the research. LW contributed the eukaryotic genome database. JH, BS and GK analyzed the data and wrote the paper.
Supplementary Material
Appendix Table 2 and Datasets 1-3 can be found online: http://bioinformatics.bio.uu.nl/ jolien/thesis/chapter3_eukaryotic_kinetochore_evolution/
Appendix Table S1. Measures of protein diversity in the set of kinetochore and APC/C proteins. Scores present the average across the proteins, except in the case of completeness (average across species). Statistical validity was assessed by performing an unpaired, two-sided t-test.
Diversity feature Kinetochore (average) APC/C (average) (p-value kinetochore vs. APC/C) Frequency (p=0.0051) 0.463 0.689 Entropy (p=0.0248) 0.731 0.578 Pearson correlation coeffi cient (p=0.0006) 0.219 0.267 Completeness (p=6.98E-13) 0.485 0.701 Losses (Dollo parsimony, p=0.146) 16.4 13.1 Transitions (p=0.672) 0.173 0.184 % Identity (human-mouse, p=0.0016) 74.8% 89.2% dN/dS (human-mouse, p=1.80e-5) 0.245 0.059 76 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 77
Appendix Table S3. Phylogenetic profi les of the kinetochore proteins were compared to PANTHER10 (sub)families. The PANTHER10 (sub)families that have similar phylogenetic profi les as many kinetochore proteins (as measured by the Pearson correlation coeffi cient, indicated by their frequency in the top 30 of each kinetochore protein) are shown here.
PANTHER10 (sub)family Frequency Pearson correla- Protein Informa- (top 30) tion coeffi cient tion from (r, average) PTHR11444.SF3|ARGININOSUC- 12 0.491 ARGININOSUCCI- human CINATE LYASE NATE LYASE PTHR28080|FAMILY NOT NAMED 12 0.486 Pex3:Peroxisomal human Biogenesis Factor 3 PTHR12309.SF12|CENTROMERE 12 0.770 CenpN human PROTEIN N PTHR14582|FAMILY NOT NAMED 11 0.683 CenpO human 3 PTHR23342.SF0|N-ACETYLGLU- 11 0.621 NAGS:N-Acetylglu- human TAMATE SYNTHASE, MITO- tamate Synthase CHONDRIAL PTHR28262|FAMILY NOT NAMED 10 0.773 Spc19 human PTHR12856|TRANSCRIPTION INI- 10 0.568 GTF2H1:General human TIATION FACTOR IIH-RELATED Transcription Factor IIH Subunit 1 PTHR14401|FAMILY NOT NAMED 10 0.715 CenpK human PTHR10606.SF39|6-PHOS- 10 0.773 Similar to 6-phos- yeast PHOFRUCTO-2-KINASE/FRUC- phofructo-2-kinase TOSE-2,6-BISPHOSPHATASE enzymes YLR345W-RELATED PTHR14778|FAMILY NOT NAMED 10 0.642 Dsn1 human PTHR31749|FAMILY NOT NAMED 10 0.624 Nsl1 human PTHR18460|UNCHARACTERIZED 10 0.544 TTI1: TELO2 Inter- human acting Protein 1 PTHR34832|FAMILY NOT NAMED 10 0.636 CenpW human PTHR10555.SF136|VACUOLAR 10 0.773 Vps17 yeast PROTEIN SORTING-ASSOCIAT- ED PROTEIN 17 PTHR28017|FAMILY NOT NAMED 10 0.773 Dad3 yeast PTHR21286|NUCLEAR PORE 10 0.533 Nup160 human COMPLEX PROTEIN NUP160 78 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 79
PTHR31382.SF4|NA(+)/H(+) 9 0.764 NHA1: Na+/H+ yeast ANTIPORTER antiporter PTHR24343.SF137|SERINE/ 9 0.761 RTK1 yeast THREONINE-PROTEIN KINASE RTK1-RELATED PTHR11689.SF93|ANION/PRO- 9 0.761 Gef1 yeast TON EXCHANGE TRANSPORT- ER GEF1 PTHR11266.SF8|MPV17-LIKE 9 0.592 MPV17L2:Mito- human PROTEIN 2 chondrial Inner Membrane Protein Like 2 PTHR31740.SF2|CENTROMERE 9 0.667 CenpL human PROTEIN L 3 PTHR28051.SF1|RESISTANCE 9 0.709 REG1:Regulatory yeast TO GLUCOSE REPRESSION subunit of type 1 PROTEIN 1 protein phospha- tase Glc7p PTHR28113|FAMILY NOT NAMED 9 0.799 Dam1 yeast PTHR23139.SF60|CENTROMERE 9 0.661 CenpI human PROTEIN I PTHR12064.SF29|PROTEIN 9 0.756 Mam3 yeast MAM3 PTHR28036|FAMILY NOT NAMED 9 0.787 Dad2 yeast PTHR23168|MITOTIC SPINDLE 9 0.550 Mad1 human ASSEMBLY CHECKPOINT PRO- TEIN MAD1 MITOTIC ARREST DEFICIENT-LIKE PROTEIN 1 PTHR28662|FAMILY NOT NAMED 9 0.655 CenpH yeast PTHR28077|FAMILY NOT NAMED 9 0.731 Kei1:Kex2-cleavable yeast protein Essential for Inositol phosphoryl- ceramide synthesis PTHR11365.SF11|SUBFAMILY 9 0.731 OXP1:OxoProlinase yeast NOT NAMED 78 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 79
15 14 13 12 Measure 11 Phylogeny-insensitive 10 Chance co−occurrence probability distribution 9 Jaccard index 8 Mutual information ichment 7 Pearson correlation coefficient En r 6 Phylogeny-sensitive (Dollo parsimony) Dollo Fisher's exact 5 Differential Dollo 4 Dollo overall 3 2 1 0 3 0 50 100 150 200 250 300 350 400 450 500 Coverage
Appendix Figure S1. Performance of various measures that compare phylogenetic profi les in predicting physically interacting proteins. Various metrics quantify the similarity between phylogenetic profi les, such as Pearson correlation coeffi cient, hamming distance, chance co-occurrence probability distribution, jaccard index, mutual information [4, 5] and various phylogeny-sensitive measures such as those based on Dollo parsimony [28]. We compared these metrics by assessing how well they return known physically interacting genes. A set of physically interacting proteins was obtained for our proteins of interest (kinetochore proteins) using the BioGRID [35]. For each metric, we calculated the enrichment of these confi rmed interacting protein pairs among pairs having a given phylogenetic profi le similarity score (converted into the coverage of all possible protein pairs at that similarity score). Across most scores, the Pearson correlation coeffi cient returns the highest number of interacting pairs. For this Pearson correlation coeffi cient (r), the threshold t was set at the r value that yields 6-fold enrichment of interacting pairs relative to pairs for which no interaction is observed. 80 Chapter 3 Evolutionary dynamics of the kinetochore network as revealed by comparative genomics 81
Borealin Incenp Mad2 Cdc20 Bub3 TRIP13 MadBub Mis12 Mps1 p31comet ZW10 Mad1 Nnf1 BugZ Ska1 Knl1 Dsn1 Ska3 Spc24 Zwint−1 Nuf2 Nsl1 Ska2