PDF hosted at the Radboud Repository of the Radboud University Nijmegen

The following full text is a publisher's version.

For additional information about this publication click this link. http://hdl.handle.net/2066/26996

Please be advised that this information was generated on 2021-10-08 and may be subject to change.

Origin and evolution of the mitochondrial proteome

Applications for function prediction in the

Toni Gabaldón

A mis Padres, por hacer de mis ilusiones las suyas.

A Marta, por nuestra bonita simbiosis.

Cover: Der Baum des Lebens (Gustav Klimt)

ISBN: 90‐9019731‐1 (ISBN‐13: 9789090197319)

Printed by PrintPartners Ipskamp, Nijmegen.

Origin and evolution of the mitochondrial proteome.

Applications for protein function prediction in the eukaryotes

Een wetenschappelijke proeve op het gebied van de

Medische Wetenschappen

Proefschrift

ter verkrijging van de graad van doctor aan de Radboud Universiteit Nijmegen op gezag van de Rector Magnificus prof. Dr. C.W.P.M. Blom, volgens besluit van het College van Decanen

in het openbaar te verdedigen op dinsdag, 11 oktober 2005 des namiddags om 1:30 uur precies

door

Juan Antonio Gabaldón Estevan

Geboren op 8 november 1973 Te Valencia (Spanje)

Promotor: prof. dr . Martijn A. Huynen

Manuscriptcommissie: prof. dr. Wilfried de Jong dr. Frank Holstege UMC-Utrecht prof dr. Charles Kurland. Lund Univ. (Sweden)

Origin and evolution of the mitochondrial proteome.

Applications for protein function prediction in the eukaryotes

A scientific essay in the medical sciences

Doctoral thesis

to obtain the degree of doctor from Radboud University Nijmegen on the authority of the Rector, Prof. C.W.P.M. Blom, according to the decision of the Council of Deans to be defended in public on Tuesday, 11 October 2005 at 13:30 p.m. precisely

by

Juan Antonio Gabaldón Estevan

Born in Valencia (Spain) On the 8th of November 1973

Index of contents

Chapter 1 General Introduction 9

Chapter 2 Prediction of protein functions and pathways in the genome era 23

Chapter 3 Reconstruction of the proto‐mitochondrial metabolism 51

Chapter 4 Lineage‐specific gene loss following mitochondrial endosymbiosis and its applications for function prediction in the eukaryotes 57

Chapter 5 Tracing the evolution of a large protein complex in the eukaryotes, NADH:Ubiquinone oxidoreductase (Complex I) 73

Chapter 6 Evolution of mitochondrial metabolism 97

Chapter 7 Shaping the mitochondrial proteome 119

Chapter 8 The evolutionary origin of the peroxisome 135

Chapter 9 Summarizing discussion 151 Nederlandse samenvatting en discusie 156 Resumen y discusión 160

Appendix 1 An aerobic that produces hydrogen 167

Appendix 2 Color figures 179

Acknowledgements 195 Curriculum Vitae 199 List of publications 201

Chapter 1

Uno no es lo que es por lo que escribe, sino por lo que ha leído*.

Jorge Luis Borges

*One is not what he is because of what he writes, but because of what he has read

8 General introduction

Chapter 1

General introduction

9 Chapter 1

1. Mitochondria as eukaryotic organelles

1.1 The Eukaryotic cell All known cellular life on earth can be divided into three major domains, namely , and Eukaryota. The first two differ in many features such as the structure and chemistry of their cell walls and membranes and their molecular biology. However, both Bacterial and Archaeal domains share the absence of a cell nucleus and are therefore referred to as prokaryotes, from the Greek “before the nucleus”. On the contrary, Eukaryotes, “true nucleus” in Greek, do posses a nucleus. This membrane‐bound structure confines the genome and the processes of replication, transcription and RNA maturation. Besides the presence of a nucleus, eukaryotic cells display several other distinctive characteristics, some of which I will briefly mention here. Eukaryotes have much larger cells than prokaryotes, commonly by a factor of a thousand or more. To regulate the shape and movement of such large cells, as well as to control the place of the internal structures, eukaryotes posses a network of protein filaments that is called the cytoskeleton. Embedded within this cytoskeleton are several membranous structures or organelles (Figure 1). These include, among others, the Endoplasmic Reticulum (ER), the Golgi apparatus, peroxisomes, hydrogenosomes, choloroplasts and mitochondria. Remarkably, the presence of some of these organelles is restricted to specific phyla. This is the case of chloroplasts in photosynthetic eukaryotes and the hydrogenosomes in some anaerobic protozoa and fungi. By targeting specific to different organelles, eukaryotes have achieved a high level of compartmentalization of their metabolism. For example, in the Golgi apparatus occurs the synthesis of lipids for the cell membrane while the ER is responsible for the synthesis and insertion of cell membrane . Other vesicles are devoted to specific biochemical pathways such as intracellular digestion in the lysosomes or peroxide degradation and fatty acid oxidation in the peroxisomes. Nevertheless, despite this high compartmentalization, all organellar functions are ultimately orchestrated by the cell nucleus because, with only a handful of exceptions, the proteins that are targeted to the organelles are encoded by the nuclear DNA. Mitochondria, which are introduced in more detail in the following section, play a central role in the eukaryotic cell and are widely distributed among eukaryotes. Indeed, it seems that all known eukaryotic species posses either mitochondria or one of their evolutionary related organelles such as mitosomes or hydrogenosomes. Thus, mitochondria are essential to understand the function and evolution of the eukaryotic cell.

1.2 Mitochondria as diverse organelles Mitochondria are organelles surrounded by a double membrane system and are found in virtually all eukaryotic cells. They show a striking heterogeneity in terms of their number, location, and shape in the different cell types and conditions [1]. In addition, mitochondria are constantly moving, fusing and dividing and can often be regarded as a highly interconnected network [2]. Regardless of their morphology, the structure of mitochondria defines various functional spaces within the organelle (Figure 1). Firstly, each of the two lipid bi‐layers that surround the

10 General introduction mitochondrion, the inner and outer mitochondrial membranes, has different permeability properties and contains a unique collection of proteins. Secondly, these membranes define two separate compartments, the mitochondrial matrix and the narrower inter‐membrane space, which constitute distinct microenvironments in which soluble proteins perform their functions. In addition, the inner membrane is usually highly convoluted with a series of invaginations, or cristae, that maximize its surface.

M L P N

ER

G

Figure 1 – Schematic representation of a eukaryotic cell. Surrounding the nucleus (N), and embedded in the cytoplasm, several organellar structures can be seen. These include the Endoplasmic Reticulum (ER), the Golgi apparatus (G), peroxisomes (P), lysosomes (L) and mitochondria (M). A mitochondrion is scaled‐up and illustrated in greater detail (see description in the text). Note that parts of the outer and inner membrane are not drawn to show the mitochondrial interior. (Drawn by T. Gabaldón)

The structure and number of these cristae also varies greatly between different tissues. For instance, cristae from brown fat tissue mitochondria display a lamellar structure [3], whereas those from steroidogenic tissue mitochondria are tubular [4]. In terms of their function, mitochondria are often described in the textbooks as the “power plants” of the cell. Indeed, they produce more than 80% of the ATP needed by a normal human adult [5]. In most eukaryotic organisms, the end products of glycolysis and other catabolic pathways are transported to mitochondria in order to be further oxidized in the krebs cycle and the oxidative phosphorylation (OXPHOS) pathway (Figure 2). OXPHOS pathway is often considered the hallmark of mitochondria and consists of the transfer of electrons from NADH or FADH2 to O2 through a series of membrane‐embedded protein complexes, the so‐called electron transport chain (ETC). The transfer of electrons through the ETC is coupled to the pumping of protons across the mitochondrial inner membrane and the generation of a membrane potential, which provides the required for the synthesis of ATP. The evolution of some components of this pathway, and specifically of the first

11 Chapter 1 multi‐protein complex of the ETC, the mitochondrial NADH:ubiquinone oxidoreductase or Complex I, is analyzed in detail in chapter 5 of the present thesis. Variations from the canonical aerobic mitochondria electron transport chain described above, can be found in various eukaryotic lineages. In some species, such as the yeasts Sacharomyces cerevisiae and Schizosaccharomyces pombe, some of the ETC components have been lost. Other lineages harbor fully anaerobically functioning mitochondria, which carry an alternative terminal oxidase that allows them to use substrates other than oxygen as final electron acceptor [6].

Figure 2 – Schematic representation of the mitochondrial electron transport chain. The five multi‐protein complexes (I to V) are embedded in the mitochondrial inner membrane. The electron flux from NADH or succinate to O2 is indicated by small arrows. Protons are pumped‐out from the mitochondrial matrix by Complexes I, III and IV and enter back through Complex V (long arrows crossing the membrane). (Drawn by T. Gabaldón)

These organisms can use an electron acceptor present in the environment, such as NO3‐, or an endogenously produced organic electron acceptor, such as fumarate. In addition, hydrogenosomes, which are evolutionary related to mitochondria, can be considered as highly specialized anaerobic mitochondria that produce hydrogen [7]. Finally, in some species in which the organelle’s main function is not the production of ATP the whole electron transport chain is lost. These highly derived mitochondria, which are known as mitosomes or relict mitochondria, have recently been described in several unicellular eukaryotes, including microsporidia [8], cryptosporidia [9], amoebozoans [10] and diplomonads [11]. Besides the OXPHOS pathway, mitochondria can compartmentalize many other pathways such as the urea cycle and fatty acid oxidation. The presence or absence in the mitochondria of such pathways shows a striking diversity among the different phyla. For instance, most of the fatty acid beta oxidation pathway is mitochondrial in human whereas in yeasts is peroxisomal [12]. Moreover,

12 General introduction mitochondria can play a pivotal role in many intermediate metabolism processes such as the synthesis of heme group [5], steroids [13], aminoacids, and iron‐sulfur clusters [14]. In addition, they harbor metabolic pathways that are specific for certain species or genus [5, 6, 15], such as folate synthesis in plant mitochondria [16]. Thus the role of mitochondria within the eukaryotic metabolism is manifold and not restricted to the generation of energy. Finally, mammalian mitochondria have recently been proven to be involved in regulatory and developmental processes such as calcium signaling, apoptosis and aging [5, 17]. In this context, it is not surprising that the disruption of mitochondrial pathways may lead to serious dysfunctions resulting in disease or even death.

1.3 Mitochondria‐related diseases Mitochondrial dysfunction can be fatal for human health and there is a growing list of mitochondria‐related diseases, most of which have no cure to date [17]. Mitochondrial diseases often affect tissues in which the role of mitochondria as energy‐producing organelle is crucial. These include, among others, neural and muscular tissues, kidney, liver and tissues from the endocrine and respiratory systems. Many of the mitochondria‐related diseases are caused by mutations in mitochondrially‐encoded genes and have therefore a typical maternal inheritance. However, a growing number of disease‐causing mutations are being described that affect nuclear genes coding for mitochondrial proteins. The list of mitochondria‐ related diseases include Parkinson´s [18], Alzheimer´s [19] and Huntington´s [20] diseases, as well as lateral sclerosis, Leigh syndrome [21] and Friedreich’s ataxia [22]. Through its role in the pancreatic beta cell and the secretion of insulin in response to glucose, mitochondrial impairment can also give rise to diabetes [17, 23]. Moreover mitochondria have been related to some types of deafness [24] and autism [25]. A more comprehensive survey of mitochondria‐related diseases can be found at the Mitochondrial Research Society website (http://www.mitoresearch.org/diseaselist.html), the recent review of Michael R. Duchen [17] and references therein. Given that the details of the molecular basis of many of these diseases are yet to be understood, and that the identification of possible key players may prove to be helpful for the understanding and treatment of mitochondrial diseases, the work performed during this thesis has always had as a parallel goal the identification of roles and partners of disease‐related proteins (see chapters 4 and 5).

1.4 Evolution of mitochondria and its connection to the origin of the eukaryotic cell According to the fossil record and some phylogenetic studies [26], the advent of the eukaryotes may have occurred more than 2 billion years ago. It is, however, largely unclear how the complexity of the eukaryotic cell arose from pre‐existent prokaryotic organisms and several hypotheses have been proposed. One such hypothesis [27] posits a chimeric model, in which a fusion between an archaeobacterium and a eubacterium would have created the eukaryotic cell. This model accounts for the findings that, in eukaryotes, informational genes (involved in transcription, translation and related processes) are usually most closely related to archaeal genes, whereas operational genes (involved in metabolic processes) are most

13 Chapter 1 closely related to bacterial genes. Nevertheless, the issue is far from being solved, with the exact scenario for the origin of the eukaryotes remaining elusive. The history of the different hypotheses on the origin of mitochondria, which I briefly recapitulate below, is tightly linked to major developments in fields such as microscopy, molecular biology and genomics. The discovery of mitochondria dates back to the second half of the nineteenth century, when several independent microscopic observations of granular bodies within the cell, which are likely to represent mitochondria, were reported. For instance, in 1859, Albert von Kölliker described conspicuous ʺgranulesʺ aligned between the striated myofibrils of muscle cells [28] and, forty years later, Walther Flemming observed ʺfilamentsʺ in the cytoplasm of several cell types [29]. However, it was the German pathologist Richard Altmann who, in 1890, first adventured an hypothesis on the origin of mitochondria [30]. Altmann postulated that these granules, which he named bioblasts, were independent units that corresponded to living microorganisms. Since then, the issue has remained a matter of speculation and hot debate among scientists, including some polemic events such as Ivan Wallin’s claim of having cultured mitochondria outside the cell [31]. The discussion revived in 1970, when Lyn Margulis formulated the endosymbiotic theory to explain the origin of several eukaryotic cell structures [32, 33], including mitochondria. Nowadays, with a growing body of evidence supporting it, the endosymbiotic origin of mitochondria is undisputed. The primary evidence for the bacterial origin of mitochondria is the presence of an organellar DNA (mtDNA), a small genome that has sequence similarities to bacterial DNA [34]. The sequence and phylogenetic analyses of the few genes that are encoded in the mitochondrial DNA have revealed that (1) all extant mitochondria are monophyletic, thus originating from a single endosymbiotic event and (2) the closest relatives of mitochondria among extant bacteria belong to the alpha subdivision of the proteobacteria [34, 35]. While the phylogeny of several mitochondrial genes suggests that mitochondria might have originated from within the order of Ricketsiales [36, 37] (Figure 3), other analyses point to different groups within the alpha‐proteobacteria [38].

Alpha‐proteobacteria are a broad and diverse group with genomes encoding from 800 to more than 8,000 proteins. They display a broad range of life‐styles from intracellular parasites to free‐living soil bacteria and thus it is difficult to infer the appearance of the mitochondrial ancestor just by studying modern alpha‐ proteobacteria. The question of what the mitochondrial ancestor looked like can be re‐formulated as to what extent did the proto‐mitochondrion resemble modern alpha‐proteobacteria. An attempt to answer this question is presented in chapters 3 and 6.

14 General introduction

Figure 3: Phylogenetic relationship among mitochondria and bacteria inferred from the analysis of LSU rRNA sequences. Bootstrap supports (%) values for the Maximum Likelihood, Distance based and Maximum Parsimony trees, respectively. Based on this and other phylogenies, Viktor Emelyanov proposed the so‐called full canonical pattern of mitochondrial ancestry, in which the order of emergence of the different taxa is: free‐living alphaproteobacteria, Rickettsia‐like endosymbiont (RLE), Rickettsiaceae and, finally, mitochondria. The figure is taken from [36] and reproduced with the kind permission of V. Emelyanov.

The issue of describing the mitochondrial ancestor is important because it can shed light on the circumstances by which the original relationship between the endosymbiont and its host was established. The mitochondrial endosymbiotic theory, as proposed by Margulis [33], argued that the initial rationale for the mitochondrial endosymbiosis was based on the mutually beneficial exchange of ATP and glycolisis end‐products between the host and the endosymbiont. However, this idea was questioned when the initial phylogenetic analyses on the mitochondrial ATP/ADP exchanger revealed a eukaryotic origin of this transporter. Two recent hypotheses propose alternative scenarios for the origin of mitochondria, one in which the proto‐mitochondrion would have been an oxygen scavenger [39] and the other in which it would have been a hydrogen‐producing, facultatively anaerobic species [40]. The latter, also known as “the hydrogen hypothesis for the first ”, considers that mitochondrial endosymbiosis was the original event that

15 Chapter 1 created the eukaryotic cell. This is opposed to the classic view of eukaryotic evolution, in which the advent of mitochondria followed the origin of the eukaryotic cell, thus assuming that the host for the mitochondrial endosymbiont was a proto‐ eukaryotic cell displaying already many eukaryotic cell features except for the absence of mitochondria. This view was mainly based on the supposed existence of amitochondriate eukaryotes that were thought to have diverged from the main eukaryotic stream prior to the acquisition of mitochondria [41]. However, this view was challenged by the advent of molecular phylogeny, which indicated that amitochondriate organisms were polyphyletic, as well as by the detection through microscopy of either hydrogenosomes or mitosomes within their cells. The existence of truly amitochondriate organisms is now questioned as the accumulating evidence suggests that all extant eukaryotes once possessed mitochondria. Indeed, double membrane organelles derived from mitochondria, such as hydrogenosomes or mitosomes, have been found in all of the eukaryotic organisms carefully examined to date. Moreover, these organelles, which are mutually exclusive, appear to be evolutionary related [42]. Taking all this into account, the emerging picture is that of the mitochondrial endosymbiosis occurring very early, if not at the origin, in the eukaryotic evolution.

2. Using comparative genomics to study the function and evolution of the mitochondrial proteome Evolution is a difficult process to study experimentally. It is usually not fast enough to be observed directly and only exceptionally is it possible to find physical evidence, such as fossils or ancient DNA, of past states. This is specially true for the study of very ancient events such as the origin of mitochondria. Fortunately, evolution leaves its footprint in the genomes and the distribution of traits among living organisms. By studying this footprint, one can infer the evolution of organisms through the successive splitting of ancestral lineages, a process depicted in phylogenetic trees. Once a phylogenetic tree is reconstructed, is it also possible to reconstruct the evolutionary history of individual traits of interest. The use of phylogenetic methods and ancient‐state inference have been central for the present thesis. For instance a novel large‐scale phylogenetic (phylogenomic) approach was used to reconstruct the ancestral mitochondrial proteome (Chapter 3 and 6), and more detailed phylogenetic analyses, as well as parsimonious reconstruction of ancestral character states were used to trace the evolution of specific pathways or proteins such as the NADH:Ubiquinone oxidoreductase complex (Chapter 5). Therefore a brief introduction to the phylogenetic methods used may be useful to the reader. Other bioinformatics methods that have been used in the present thesis, such as the use of genomic context to infer protein function, are extensively described in Chapter 2 and the material and methods sections of the remaining chapters.

2.1 Phylogenetics and the study of protein evolution The study of the evolution of a protein usually starts with the detection of other members of its family. This is done by comparing the sequence of the protein of interest with other sequences stored in the databases, and subsequently selecting the

16 General introduction hits that are significantly similar. The assumption is that proteins with similar sequences are derived from a common ancestral protein. In other words, they are considered to be homologous proteins [43, 44]. Several algorithms have been developed that allow efficient automatic detection of homologous proteins in large databases. These include, among others, Smith‐Waterman [45], Psi‐Blast [46] and HMMER [47] algorithms. Once the sequences of a protein family are retrieved, they can be aligned. The alignment of multiple sequences basically aims to place ‘homologous’ residues of different proteins on top of each other. This constitutes a crucial step for phylogenetic reconstructions because it is assumed that all positions in a column of a multiple sequence alignment derive from a common ancestral residue. Here again, the variation of algorithms, such as ClustalW [48] or MUSCLE [49], is broad. Besides phylogenetic analyses, multiple sequence alignments are useful for many other bioinformatics methods, such as the detection of domains and motifs and efficient database searches. By applying a specific evolutionary model to explain the aminoacid substitutions observed in the multiple sequence alignment, the evolutionary distances between all pairs of proteins can be computed. This evolutionary distance, which reflects the expected mean number of changes per site that have occurred since two sequences diverged from their common ancestor, is used by the so‐called distance‐methods for phylogenetic inference. One of this methods is Neighbor Joining (NJ), which constitutes a good and fast heuristic algorithm that estimates the “minimal evolution” tree, a phylogenetic tree which minimizes the sum of the lengths (evolutionary distances) of all its branches [50]. NJ and variations of it have been extensively used in the work presented here and have long been proven to be quite efficient in finding the “right” tree topologies, given a set of homologous sequences [51, 52]. Compared to other methods, NJ has the advantage of being very fast, which allows the construction of large trees including hundreds of sequences, and therefore it has been used in the large‐scale approaches presented here. A different approach for phylogenetic inference is that of Maximum Likelihood (ML) [53]. Here, the concept of likelihood refers to the probability that a certain tree with a set of parameters (topology, branch‐lenghts, etc) produces, assuming a specific evolutionary model, a given set of data (sequences). The ML‐ methods try to find the tree with the maximal likelihood to have produced the given set of data. However computing the likelihood of all possible trees for a decent number of proteins is a very computationally‐intense task and becomes unfeasible for large sets of sequences. Therefore, all practical methods rely on heuristics that are able to reduce the search‐space and find good sub‐optimal trees in a reasonable time. For instance PhyML [54] uses a simple hill‐climbing algorithm to optimize a seed NJ‐ tree whereas MrBayes [55] uses bayesian inference with a Markov Chain Monte Carlo (MCMC) algorithm. Yet another type of phylogenetic inference is that of Maximum Parsimony (MP) [56]. This method has not been used as a method for protein phylogeny inference in the present thesis because, contrary to ML and NJ methods, MP approach does not allow the correction for multiple mutations per site and it is more prone to the so‐called long‐branch attraction effect. However, Maximum Parsimony has been used at a different level, to trace the evolution of presence/absence of genes

17 Chapter 1 in a set of genomes (Chapter 4) or proteins in a complex (chapter 5), based on the observed distribution of traits. In the genomic era it has been possible to move from the evolutionary analysis of single protein families to that of complete genomes and proteomes By deriving the phylogeny of every single protein encoded in a genome, one obtains a set of phylogenetic trees, which is called the phylome [57]. The value of complete phylomes is great since it allows us to screen for genes that may have unusual origins, such as genes originated by lateral gene transfers (LGT), or genes derived from a specific evolutionary event, such as the ones derived from the mitochondrial and plastidial endosymbiosis.

3. Outline of the present thesis This PhD thesis focuses on the origin and evolution of the mitochondrial proteome, with a special emphasis on the evolution of disease‐related pathways and its applications for protein function prediction. Chapter 2 presents an overview of the context‐based function‐prediction techniques that had been applied during the present thesis. In Chapter 3 the first metabolic reconstruction of the proto‐ mitochondrion is presented. We used the fact that mitochondria are derived from the endosymbiosis of an alpha‐proteobacterium and the availability of several fully‐ sequenced alpha‐proteobacterial genomes to develop a new large‐scale phylogenetic approach that allowed the reconstruction of the ancestral proto‐mitochondrial proteome and, therefore, its metabolism. The results are consistent with a multifaceted benefit for the ancestral host and the evidence that the proto‐ mitochondrial contribution to the modern eukaryotic metabolism was much larger than previously thought, and not restricted to mitochondria. In Chapter 4 we analyze the co‐evolution of the proto‐mitochondrially derived proteins among the different eukaryotic lineages, and combine this information with other function‐prediction approaches to predict functional interactions among them. These include, among others, the prediction of a functional interaction between a human protein of unknown function and Complex I. Chapter 5 focuses on the evolution of the first component of the mitochondrial respiratory chain the NADH:Ubiquinone oxidoreductase (Complex I) and, again, this information is used to predict new interaction‐partners. The availability of new fully‐sequenced alpha‐proteobacterial and eukaryotic genomes, combined with the publication of large proteomics sets of human [58] and yeast [59] mitochondria allowed us to reconstruct two modern mitochondrial metabolisms and compare them with an updated version of the inferred proto‐mitochondrial metabolism (Chapter 6). Chapter 7 recapitulates our results and those of others and provides an overview of our current understanding on how the mitochondrial proteome evolved. As we mentioned above, a major result from our analyses is that many proto‐mitochondrial proteins were eventually re‐ targeted to the cytoplasm and other compartments. One of such compartments is the peroxisome, which apparently has recruited a significant part of its proteome from the mitochondrion. A phylogenetic analysis on the rat and yeast peroxisomal proteome is shown in chapter 8, together with its implications on the evolutionary origin of this organelle. The dissertation is closed by a summarizing discussion in chapter 9.

18 General introduction

Finally the appendix includes a related study [7], which focuses on the hydrogenosomes of the ciliate Nyctotherus ovalis and their evolutionary relationship with mitochondria. A combination of bioinformatics and experimental approaches shows that these organelles are the result of a secondary adaptation of mitochondria. My participation in this joint project consisted in the prediction of genes encoded by the hydrogenosomal genome of N. ovalis, the reconstruction of their phylogenies and the detection of orthologs of mitochondrial proteins in a set of expressed sequenced tags (ESTs) from this species.

References

1 Rube, D. A. and van der Bliek, A. M. 2004. ʺMitochondrial morphology is dynamic and variedʺ Mol Cell Biochem (256‐257) 1‐2:331‐9 2 Collins, T. J., et al. 2002. ʺMitochondria are morphologically and functionally heterogeneous within cellsʺ Embo J (21) 7:1616‐27 3 Perkins, G. A., et al. 1998. ʺElectron tomography of mitochondria from brown adipocytes reveals crista junctionsʺ J Bioenerg Biomembr (30) 5:431‐42 4 Prince, F. P. and Buttle, K. F. 2004. ʺMitochondrial structure in steroid‐producing cells: three‐dimensional reconstruction of human Leydig cell mitochondria by electron microscopic tomographyʺ Anat Rec A Discov Mol Cell Evol Biol (278) 1:454‐ 61 5 Scheffler, I. E. 2001. ʺMitochondria make a come backʺ Adv Drug Deliv Rev (49) 1‐2:3‐ 26 6 Tielens, A. G., et al. 2002. ʺMitochondria as we donʹt know themʺ Trends Biochem Sci (27) 11:564‐72 7 Boxma, B., et al. 2005. ʺAn anaerobic mitochondrion that produces hydrogenʺ Nature (434) 7029:74‐9 8 Williams, B. A., et al. 2002. ʺA mitochondrial remnant in the microsporidian Trachipleistophora hominisʺ Nature (418) 6900:865‐9 9 Riordan, C. E., et al. 2003. ʺCryptosporidium parvum Cpn60 targets a relict organelleʺ Curr Genet (44) 3:138‐47 10 Tovar, J., Fischer, A. and Clark, C. G. 1999. ʺThe mitosome, a novel organelle related to mitochondria in the amitochondrial parasite Entamoeba histolyticaʺ Mol Microbiol (32) 5:1013‐21 11 Tovar, J., et al. 2003. ʺMitochondrial remnant organelles of Giardia function in iron‐ sulphur protein maturationʺ Nature (426) 6963:172‐6 12 van Roermund, C. W., et al. 2003. ʺFatty acid metabolism in Saccharomyces cerevisiaeʺ Cell Mol Life Sci (60) 9:1838‐51 13 Miller, W. L. 1995. ʺMitochondrial specificity of the early steps in steroidogenesisʺ J Steroid Biochem Mol Biol (55) 5‐6:607‐16 14 Lill, R. and Kispal, G. 2000. ʺMaturation of cellular Fe‐S proteins: an essential function of mitochondriaʺ Trends Biochem Sci (25) 8:352‐6 15 Ohta, S. 2003. ʺA multi‐functional organelle mitochondrion is involved in cell death, proliferation and diseaseʺ Curr Med Chem (10) 23:2485‐94 16 Neuburger, M., et al. 1996. ʺMitochondria are a major site for folate and thymidylate synthesis in plantsʺ J Biol Chem (271) 16:9466‐72 17 Duchen, M. R. 2004. ʺMitochondria in health and disease: perspectives on a new mitochondrial biologyʺ Mol Aspects Med (25) 4:365‐451 18 Corti, O., et al. 2005. ʺParkinsonʹs disease: from causes to mechanismsʺ C R Biol (328) 2:131‐42

19 Chapter 1

19 Zhu, X., et al. 2004. ʺMitochondrial failures in Alzheimerʹs diseaseʺ Am J Alzheimers Dis Other Demen (19) 6:345‐52 20 Grunewald, T. and Beal, M. F. 1999. ʺBioenergetics in Huntingtonʹs diseaseʺ Ann N Y Acad Sci (893) 203‐13 21 Shadel, G. S. 2004. ʺCoupling the mitochondrial transcription machinery to human diseaseʺ Trends Genet (20) 10:513‐9 22 Napier, I., Ponka, P. and Richardson, D. R. 2005. ʺIron trafficking in the mitochondrion: novel pathways revealed by diseaseʺ Blood (105) 5:1867‐74 23 Schapira, A. H. 2000. ʺMitochondrial disordersʺ Curr Opin Neurol (13) 5:527‐32 24 Pickles, J. O. 2004. ʺMutation in mitochondrial DNA as a cause of presbyacusisʺ Audiol Neurootol (9) 1:23‐33 25 Lombard, J. 1998. ʺAutism: a mitochondrial disorder?ʺ Med Hypotheses (50) 6:497‐ 500 26 Hedges, S. B., et al. 2004. ʺA molecular timescale of eukaryote evolution and the rise of complex multicellular lifeʺ BMC Evol Biol (4) 1:2 27 Gupta, R. S. and Golding, G. B. 1996. ʺThe origin of the eukaryotic cellʺ Trends Biochem Sci (21) 5:166‐71 28 Kölliker, A. 1856. Zeitschrift für wissenschaftl Zoologie (VIII) 311‐318 29 Flemming, W. 1899. ʺUber Zell Strukturenʺ Anat. Anz. (16) 2‐12 30 Altmann, R. 1890 ʺDie Elementarorganismen und ihre Beziehungen zu den Zellen.ʺ edited by Veit & Comp Liepzig 31 Wallin, I. E. 1927 ʺSymbionticism and the origin of speciesʺ edited by Williams and Wilkins Baltimore 32 Margulis, L. 1970 ʺThe origin of the eukaryotic cellʺ edited by Yales University Press New HAven 33 Margulis, L. 1981 ʺSymbioses in Cell Evolutionʺ edited by W. H. Freeman W.H. Freeman San Francisco 34 Gray, M. W., Burger, G. and Lang, B. F. 2001. ʺThe origin and early evolution of mitochondriaʺ Genome Biol (2) 6:REVIEWS1018 35 Gray, M. W., Burger, G. and Lang, B. F. 1999. ʺMitochondrial evolutionʺ Science (283) 5407:1476‐81 36 Emelyanov, V. V. 2001. ʺEvolutionary relationship of Rickettsiae and mitochondriaʺ FEBS Lett (501) 1:11‐8 37 Emelyanov, V. V. 2003. ʺCommon evolutionary origin of mitochondrial and rickettsial respiratory chainsʺ Arch Biochem Biophys (420) 1:130‐41 38 Esser, C., et al. 2004. ʺA genome phylogeny for mitochondria among alpha‐ proteobacteria and a predominantly eubacterial ancestry of yeast nuclear genesʺ Mol Biol Evol (21) 9:1643‐60 39 Kurland, C. G. and Andersson, S. G. 2000. ʺOrigin and evolution of the mitochondrial proteomeʺ Microbiol Mol Biol Rev (64) 4:786‐820 40 Martin, W. and Müller, M. 1998. ʺThe hydrogen hypothesis for the first eukaryoteʺ Nature (392) 6671:37‐41 41 Muller, M. 1992. ʺEnergy metabolism of ancestral eukaryotes: a hypothesis based on the biochemistry of amitochondriate parasitic protistsʺ Biosystems (28) 1‐3:33‐40 42 Dyall, S. D. and Johnson, P. J. 2000. ʺOrigins of hydrogenosomes and mitochondria: evolution and organelle biogenesisʺ Curr Opin Microbiol (3) 4:404‐11 43 Fitch, W. M. 1970. ʺDistinguishing homologous from analogous proteinsʺ Syst Zool (19) 2:99‐113 44 Fitch, W. M. 2000. ʺHomology a personal view on some of the problemsʺ Trends Genet (16) 5:227‐31

20 General introduction

45 Smith, T. F. and Waterman, M. S. 1981. ʺIdentification of common molecular subsequencesʺ J Mol Biol (147) 1:195‐7 46 Altschul, S. F. and Koonin, E. V. 1998. ʺIterated profile searches with PSI‐BLAST‐‐a tool for discovery in protein databasesʺ Trends Biochem Sci (23) 11:444‐7 47 Eddy, S. R. 1998. ʺProfile hidden Markov modelsʺ Bioinformatics (14) 9:755‐63 48 Thompson, J. D., Higgins, D. G. and Gibson, T. J. 1994. ʺCLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position‐specific gap penalties and weight matrix choiceʺ Nucleic Acids Res (22) 22:4673‐80 49 Edgar, R. C. 2004. ʺMUSCLE: a multiple sequence alignment method with reduced time and space complexityʺ BMC Bioinformatics (5) 1:113 50 Saitou, N. N., M. 1987. ʺThe neighbor‐joining method: a new method for reconstructing phylogenetic treesʺ Mol Biol Evol (4) 4:406‐25 51 Takahashi, K. and Nei, M. 2000. ʺEfficiencies of fast algorithms of phylogenetic inference under the criteria of maximum parsimony, minimum evolution, and maximum likelihood when a large number of sequences are usedʺ Mol Biol Evol (17) 8:1251‐8 52 Kuhner, M. K. and Felsenstein, J. 1994. ʺA simulation comparison of phylogeny algorithms under equal and unequal evolutionary ratesʺ Mol Biol Evol (11) 3:459‐68 53 Schadt, E. E., Sinsheimer, J. S. and Lange, K. 1998. ʺComputational advances in maximum likelihood methods for molecular phylogenyʺ Genome Res (8) 3:222‐33 54 Guindon, S. and Gascuel, O. 2003. ʺA simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihoodʺ Syst Biol (52) 5:696‐704 55 Ronquist, F. and Huelsenbeck, J. P. 2003. ʺMrBayes 3: Bayesian phylogenetic inference under mixed modelsʺ Bioinformatics (19) 12:1572‐4 56 Felsenstein, J. 1996. ʺInferring phylogenies from protein sequences by parsimony, distance, and likelihood methodsʺ Methods Enzymol (266) 418‐27 57 Eisen, J. A. and Fraser, C. M. 2003. ʺPhylogenomics: intersection of evolution and genomicsʺ Science (300) 5626:1706‐7 58 Cotter, D., et al. 2004. ʺMitoProteome: mitochondrial protein sequence database and annotation systemʺ Nucleic Acids Res (32 Database issue) D463‐7 59 Sickmann, A., et al. 2003. ʺThe proteome of Saccharomyces cerevisiae mitochondriaʺ Proc Natl Acad Sci U S A (100) 23:13207‐12

21 Chapter 2

I argue that the theory of evolution does not make predictions, (…..), but is instead a logical formula which can be used only to classify empiricisms and to show the relationships which such a classification implies. The essence of the argument is that these theories are actually tautologies and, as such, cannot make empirically testable predictions.

Peters, R. H., (Tautology in Evolution and Ecology) American Naturalist, vol. 110 (January/February 1976)

22 Prediction of protein function and pathways

Chapter 2

Prediction of protein function and pathways in the genome era

Toni Gabaldón and Martijn A. Huynen

Cell. Mol. Life Sci. 2004 Apr;61(7‐8):930‐44

23 Chapter 2

Abstract The growing number of completely sequenced genomes adds new dimensions to the use of sequence analysis to predict protein function. Compared to the classical knowledge transfer from one protein to a similar sequence (homology‐ based function prediction), knowledge about the corresponding genes in other genomes (orthology‐based function prediction) provides more specific information about the protein’s function while the analysis of the sequence in its genomic context (context‐based function prediction) provides information about its functional context. Whereas homology‐based methods predict the molecular function of a protein, genomic context methods predict the biological process in which it plays a role. These complementary approaches can be combined to elucidate complete functional networks and biochemical pathways from the genome sequence of an organism. Here we review recent advances in the field of genomic‐context based methods of protein function prediction. Techniques are highlighted with examples including an analysis that combines information from genomic‐context with homology to predict a role of the RNase L inhibitor (RLI) in the maturation of ribosomal RNA.

Introduction Since the completion of the first bacterial genome, that of Haemophilus influenzae in 1995 [1], the number published genome sequences has been growing exponentially [2] (Figure 1). By the time we are writing this review there are 144 fully sequenced genomes (excluding viral and organellar ones), of which 18 are of eukaryotic species, with at least an additional 134 prokaryotes and 33 eukaryotes in the pipeline [3]. The completion of a new genome sequence is followed by a process known as genome annotation to predict, among others, its protein coding regions and, to the extent that that is possible, their functions. This assigning of functions to predicted genes constitutes a major goal in the genomic era. Despite the development of new advances in experimental techniques such as DNA microarrays [4, 5], yeast two‐hybrid system [6], RNAi [7, 8] or large scale systematic deletions [9], experimental characterization of proteins to elucidate their function lags far behind the availability of new sequences, and the annotation of newly sequenced genomes relies mostly on computational methods. The most ancient and straightforward computational method for assigning function to a protein is based on the detection of homologs with known function (homology‐based function prediction). 70% to 90% of the genes of a genome have homologs in other species [10] and these fractions will likely increase as new genomes are sequenced.

24 Prediction of protein function and pathways

Figure 1: Cumulative number of fully sequenced genomes deposited in the public databases. Fraction of bacterial, archaeal and eukaryotic species are indicated with different grey scales. Data for year 2003 only represents fully sequenced genomes by August.

However there is no clear functional prediction for about a 40% of genes in most genomes [11] and, for many of the rest, the predictions that can be made are only very general. As the number and variability of sequenced genomes were growing, so did the ways of exploiting genomics data to predict protein function. Firstly they have lead to the development of methods for large‐scale orthology detection, allowing more specific function prediction than “just” homology. Secondly they allowed the development of methods, known as genomic context [12] or non‐ homology [13] methods, that exploit information about the relations between genes on the genome such as gene fusions, chromosomal proximity, their distribution across species or conserved gene co‐expression to predict functional interactions between their proteins (Figure 2). Here we review the conceptual and technical advances in function prediction based on the above methods.

25 Chapter 2

Figure 2: Main types of genomic context associations between two genes of a certain genome (centre of the figure). a) Gene fusion: both sequences are encoded in a single gene in another genome. b) Genomic neighborhood: both genes are close in the chromosome in several distantly related species, generally prokaryotes. c) 1.Similar phylogenetic pattern: both genes have a similar pattern of presence/absence across species or 2. complementary phylogenetic pattern d) Conserved co‐expression: both genes have a similar pattern of expression under different conditions and this co‐expression is conserved between species.

From homology‐based to orthology‐based function prediction Homologous proteins are proteins derived from a common ancestral sequence [14], they have a similar 3D structure and are likely to perform a similar function [15], at least at the molecular level. This is the basis of homology‐based function prediction, in which one infers the function of a protein by extrapolating the knowledge from its experimentally characterized homologs. Initial characterization of new protein sequences starts by searching a protein database like SWISS‐PROT [16] with algorithms like Smith‐Waterman [17] and its faster approximation BLAST [18], to detect experimentally characterized homologous sequences and obtain their functions. The sensitivity of such homology searches has been more than doubled [19] by profile‐based methods such as PSI‐BLAST [18] or Hidden Markov Models [20]. The genome era has had two main effects on homology‐based function prediction. First of all, more sequences allow us to make better sequence profiles and have led to the development of domain databases like SMART [21] and PFAM [22], and to collections of those like InterPro [23]. For a recent review of the challenges and opportunities of protein domain analysis in the genome era see [24]. Secondly, and more importantly from a conceptual point of view, we can now do function prediction at a higher level of resolution, that of orthologous relations between genes [14]. Orthology is, like homology, primarily an evolutionary concept, and not a functional one. Sequences are orthologous when their independent evolution reflects

26 Prediction of protein function and pathways a speciation event rather than a gene duplication event: i.e. they were one gene at the moment of speciation (Figure 3).

Orthology is relevant for function prediction as orthologs are, relative to paralogs, more likely to perform the same function. Orthology is essential to genome comparison because it allows us to compare genomes in terms of their gene content. It is therewith also essential for methods that predict functional interactions between proteins based on the comparison of genomes’ gene content (see below). Aside from being imperative for comparing genomes, orthology also relies on having complete genomes: techniques for large‐scale orthology prediction like “best bi‐directional hits” [26], and multiple‐genome extensions thereof like the COG database [27] depend on knowing the similarity levels between all genes in the genomes that are compared. The large‐scale prediction of orthologous groups of proteins using such best‐hit approaches is far from trivial. Aside from the technical issues like homology detection, gene fusion and fission, and highly variable rates of evolution, there is the conceptual issue of how to handle gene duplication [25] (Figure 3), which is rampant in eukaryotes [28].

Figure 3: Orthology, paralogy and the conceptual issues when more than 2 species are compared. A parsimonious explanation of the tree of Gamma‐butyrobetaine, 2‐oxoglutarate dioxygenase (BODG) and its close homologs H. sapiens, D. melanogaster and S.cerevisiae is that there has been one gene duplication in the metazoa, and further duplications in Drosophila. The yeast gene can be considered orthologous to all the other genes in this tree, yet the human genes BODG (the last step of carnitine biosynthesis) and Q9NVH6 (the first step of carnitine biosynthesis) are not orthologous to each other (marked with a cross) because their independent evolution reflects a gene‐duplication event at the root of the metazoan. Strictly speaking, orthology, in contrast to homology, is therefore non‐transitive (if A is orthologous to B and B is orthologous to C, A and C are not necessarily orthologous to each other). In a phylogeny based orthology database that includes all eukaryotes would consider all these

27 Chapter 2

genes to be part of one orthologous group as they were all one gene in the last common ancestor of the fungi and metazoa. A taxon specific orthology database, e.g one that is specific for the metazoa, would classify QNVH6 and BODG in different orthologous groups, providing a higher resolution for function prediction. Example taken from[25].

Orthology databases that implicitly or explicitly “trace‐back” to the last common ancestor of life, in which all genes that were one gene in the last common ancestor are considered orthologous to each other, necessarily have a low level of evolutionary and functional resolution. The recent update of the COG database has a separate set of eukaryotic orthologous groups (KOGs), and has a much higher level of resolution for the eukaryotes than the original COG database [29]. Nevertheless, orthology determination by best‐hit approaches is more prone to errors than the classical method of inferring orthology from phylogenetic trees, especially when there is variation in the rate of sequence evolution within an orthologous group [30]. Although the evidence that it leads to better function prediction is still anecdotal, phylogenetic analyses for orthology prediction should in principle improve function prediction [30‐32], and there are promising steps to implement these methods on a large scale [33‐36].

Genomic‐context based function prediction Apart from symbiotic or parasitic cases, being encoded in the same genome is a pre‐requisite for two proteins to interact. It can therewith in principle also be used to predict interactions, but the information that two genes are encoded in the same genome provides of course only a very weak signal that they interact. A number of methods that use genomics comparison to predict functional interactions between proteins increase that signal by (a combination of) three strategies. 1) Detecting a more direct association of the genes on the genome, e.g. a close physical association of the genes on the genome or the similar performance of two genes in genome‐wide experiments, 2) Detecting evolutionary conservation of that association between species “horizontal comparative genomics”, and 3) Detecting the same association in different types of genomic context “vertical comparative genomics”. The type of information that such “genomic‐context” based prediction provides is qualitatively different from that of the homology‐based one, because proteins that are part of the same biological process tend to occur in each other’s genomic context, regardless of their sequence similarity. The type of information it provides is also very different from homology searches. While the identification of domains or homologs of a protein can give us a clue about its molecular function, the analyses of its genomic context instead allow us to identify interacting partners and the biological process in which is playing a role.

Gene fusion The finding of two or more proteins encoded by separate genes of which orthologs in a different species are encoded in a single gene, reveals a gene fusion or gene fission event [37]. This is the most direct form of genomic context and, from a functional point of view, the fusion of two proteins can result in an enhancement of

28 Prediction of protein function and pathways the interaction between their respective biochemical activities to facilitate, for example, the channelling of a substrate [38]. This process has already been observed for several enzymes, and the inference of a functional link between two fused genes has been intuitively used on a small scale for many years, the most widely known example being the fusion of alpha and beta subunits of Tryptophan synthetase in fungi [39]. An extension of this intuitive approach to the analyses of complete genomes was introduced in 1999 by Marcotte et al [13] and Enright et al [40]. By showing that many of the observed gene fusions events involved genes known to functionally interact, they proposed detecting gene fusions to predict interactions on a large scale. In concordance with the above mentioned substrate channelling effect, most of the observed fusions events involve metabolic enzymes, although the fusions do not always involve subsequent steps in the pathway [40, 41], in Escherichia coli three quarters of the total of gene fusions affect metabolic genes [42]. More recently a comparative study of 30 microbial genomes revealed that on average as much as 72 % of annotated genes linked by fusion events belong to the same functional category [41].

Although gene fusion can be considered a rare event when comparing few genomes, the number of genes linked by fusions is likely to increase with the number of genomes that are compared: In the studies mentioned above we observe a ~114 fold increase from the 88 gene fusions events detected by comparing three bacterial genomes [40] to the 10,073 events when comparing thirty [41]. Not all of the detected gene fusions are equally informative and one should be aware of the existence of promiscuous domains, like the ones involved in signal transduction, that tend to appear in a variety of functional and genomic contexts [13].

29 Chapter 2

Figure 4: (Note: a color version of this figure is shown in appendix 2) Gene fusions within the Tryptophan synthesis pathway: A) L-Tryptophan synthesis pathway and enzymatic activities associated to each step: E (yellow): Anthranilate/para-aminobenzoate component I. G (red): Anthranilate/para- aminobenzoate component II. D (blue): Anthranilate phosphorybosil transferase. F (pink): Phosphorybosilanthranilate isomerase. C (green): Indole-3-glycerol phosphate synthase. A (cream): Tryptophan synthase alpha chain. B (brown): Tryptophan synthase beta chain. B) Gene fusions observed in several fully sequenced genomes as provided by STRING database [43], only one species for each fusion is chosen as an example. C) Functional network derived from the fusion events within genes of the Tryptophan synthesis pathways, enzymatic functions are linked if they are observed in the same polypeptidic chain in at least one genome.

One interesting property of the gene fusion approach is its transitiveness, which allows expanding the functional association to larger groups of genes that are interconnected. In other words, if gene B is fused with gene A in one genome and with gene C in another genome, than A, B and C form a functional network. Once again the tryptophan synthesis pathway can serve us to illustrate this example [44]. Whereas in E. coli we find a fusion between trpG and trpD and between trpC and trpF, in Saccharomyces cerevisiae we can detect, apart from the above‐mentioned trpA‐ trpB fusion in tryptophan synthetase, a fusion involving trpG and trpC. From these pairwise relationships we can infer that trpD, trpG, trpC and trpF are part of the same functional network, in this case the Tryptophan synthesis pathway. Extending the analyses to more genomes all the proteins in the Tryptophan synthesis pathway can be related by fusion events (Fig 4).

30 Prediction of protein function and pathways

Chromosomal proximity in prokaryotes The first pairwise genome‐wide sequence comparisons revealed that even closely related species lack large scale conservation of gene order [26, 45‐47], indicating that in the course of evolution genomes are rapidly rearranged and shuffled. Yet in prokaryotes some clusters of genes appear conserved in evolution, including the relative location of the genes within them, over large evolutionary distances. Further inspection of these genes revealed that they tend to encode proteins that functionally interact [48, 49], and that they tend to be part of the same operon [50]. As in the case of gene fusion, since conservation of chromosomal proximity has functional meaning it can be used to predict functional interaction between the components of conserved gene clusters. This was proposed in 1998 by Overbeek et al [49, 51] and Dandekar et al [48], by measuring conservation of genes in runs (sets of genes encoded in the same strand and separated by less than 300 bases) and conservation of neighboring genes respectively.

Chromosomal proximity in eukaryotes Although operons are typically bacterial, some eukaryotes also use proximity in the genome to coordinate regulation [52, 53]. An analysis of gene clustering in five eukaryotic genomes [54] revealed that 30% (in Drosophila melanogaster) to 98% (in Saccharomtces cerevisiae) of the pathways in KEGG show a significant clustering of their genes on the chromosomes. Furthermore, gene‐order conservation between S. cerevisiae and Candida albicans appears, at least for divergently transcribed genes, to be correlated with co‐expression [55‐57]. The signals for function prediction in gene order and gene order conservation in eukaryotes are however weak, and have not been employed for function prediction, although it can be argued that this is in part the result of the still relatively small number of sequenced eukaryotic genomes. Prokaryotic chromosomal proximity can be used for eukaryotic proteins with homologs in prokaryotic species, as in the case of the identification of the human Methylmalonyl‐CoA racemase [58]. Previous to its biochemical characterization, the function of this gene was first inferred based on the conserved chromosomal neighborhood of its prokaryotic homologs with genes involved in propionyl‐CoA metabolism.

Similar phylogenetic distribution Although the fact that two genes are encoded together in one genome provides only a very weak signal that they to interact, when they are encoded in a considerable number of genomes, and are both absent from others this signal becomes strong enough for function prediction. This technique, called gene co‐ occurrence of phylogenetic profiles, was proposed [26, 59] and verified by studies showing that proteins with a similar distribution across species have a high tendency to functionally interact [26, 59‐61]. Distribution across species is usually expressed by means of a phylogenetic pattern or profile: a string of letters or numbers that describe the presence or absence of a given gene in a set of genomes, and then detecting genes with a similar profile. Distances between profiles can be measured using a simple count of differences (Hamming distance) or more sophisticated scores like mutual information [62], or Pearson correlation coefficient [63].

31 Chapter 2

A typical example of a successful function prediction using phylogenetic distribution is the one of the frataxin gene. Although the mutation in this gene was known to cause the human neurodegenerative disease Friedreich´s ataxia [64], its molecular function remained unclear. In 2001 Huynen and coworkers [65] indicated that frataxin had the same phylogenetic distribution (Figure 5) as several iron‐sulfur cluster protein assembly genes, suggesting a role in the same process for frataxin. The experimental confirmation that frataxin is actually involved in this process came a year later [66‐68].

Figure 5: Phylogenetic distribution of Frataxin (cyaY) and that of hscB and hscA: grey and white boxes indicate respectively presence and absence of the genes in a certain species (names on the right). The species phylogeny is indicated with thick and thin lines respectively indicating presence or absence of the genes

Because of its large potential for function prediction, the use of phylogenetic patterns to predict protein interactions is continuously undergoing technical improvements. A recent variation includes the use of phylogenetic patterns of neighboring gene pairs [69], this combination of gene neighborhood and chromosomal proximity was shown to be more accurate than the single‐gene phylogenetic profile, at a cost of coverage. Other modifications attempt to filter out the phylogenetic bias in the sequenced genomes (genomes of certain taxa are

32 Prediction of protein function and pathways overrepresented) by using evolutionary information to measure the distance between profiles [70] or collapsing into a single node parts of the profile that represent related species that share the presence or absence of a certain gene [71].

Complementary phylogenetic distribution A reverse use of phylogenetic profiles to predict function is the identification of proteins with complementary or anti‐correlated profiles [10] to detect non‐ orthologous gene displacements [72]. Cases of experimentally confirmed function prediction are that of a new Thymidilate synthase [73], and of seven enzymes involved in Thiamine biosynthesis [74]. In general, the detection of non‐orthologous gene displacement by complementary phylogenetic profiles is combined with gene‐ order conservation to increase the signal: i.e. does the “new” gene occur in conserved operons with the other genes with which it is supposed to interact, replacing the old gene not only in terms of functional context but also in terms of genomic context.

Correlated gain and loss of genes One methodological issue in the comparison of phylogenetic profiles is that there is a strong phylogenetic signal in the genes two genomes share [75]: i.e. the fact that two genes occur together in a number of closely related genomes does not necessarily imply a functional interaction. One can solve this by detection of profiles that are not in agreement with the species phylogeny, indicating that a pair of genes have been lost or gained together in a genome. This method was applied to the prediction of genes responsible for pathogenicity by identifying genes present in a pathogenic species that are absent in closely related non‐pathogenic species or strain [76, 77], or genes responsible for host specificity by detecting differences between similar pathogens that affect different hosts [78]. A functional linkage between genes that have been lost in the same lineages has also been shown in Eukaryotes by comparing the genomes of the fungi S. cerevisiae and Schizosaccaromyches pombe [79] and in Archaea by comparing the three sequenced genomes from the Pyrococcus genus [80].

Co‐evolution of sequences Another variant of the use of co‐evolution to predict protein function uses the evolutionary information that is contained at a lower level than the distribution across species: that of the sequences themselves. For specific cases of protein families known to interact, such as insulin and its receptors [81] or the chemokine‐receptor system [82, 83], their phylogenetic trees are more similar to each other than expected based on the general divergence between the corresponding species. This was interpreted as an indication of correlated evolution reflecting similar evolutionary constrains. Valencia and colleagues [84, 85] made use of this property to search for interaction partners within the E. coli proteome by measuring the correlation between the distance matrices used to build the phylogenetic trees (Figure 6). Provided a good species coverage and quality of the multiple sequence alignment, the technique can indeed distinguish statistically true interactions among many possible alternatives. Ramani and Marcotte [86] used a similar approach (Figure 6) to predict the binding

33 Chapter 2

specificities among members of 18 ligand and receptor families that posses many paralogs in the human genome. The co‐evolution of interacting partners can be followed more closely by searching for mutations that are correlated in both protein families (they occur in the same species), these positions may correspond to residues on the interface that undergo compensatory mutations in one protein to compensate the effects of mutations in the other. This information was initially used to detect proximal residues to predict protein folding [87] or discriminate between different structural models [88], and was later extended to the prediction of interacting partners based on the finding of pairs of proteins with correlated mutations [89, 90].

Figure 6: Valencia (A) and Ramani (B) methods for predicting protein interactions: A) Similarity between the phylogenies of two protein families is estimated by measuring the correlation between their respective distance‐matrices. B) The interaction specificity between the members of two interacting protein families is predicted by means of shuffling the columns and rows of the distance matrix of one of the families until the agreement with the other one is maximized, the interactions are then predicted for proteins heading the same columns.

The method has the advantage that provides not only the prediction of interacting partners but also of potentially interacting residues. Currently, 3D information about proteins is rarely used to predict which proteins interact with which ones, an exception is the use of 3D information combined with sequence co‐ variation on the potential interaction surfaces to predict whether homologs of

34 Prediction of protein function and pathways interacting proteins will interact in the same way [91]. 3D information is also used by docking procedures for interaction prediction, but these methods predict how and where two proteins interact with each other rather than that they predict which ones interact with which ones, see [92] for a recent update.

Conservation of co‐expression Besides genome sequences, the only experimental context data to date that are truly genome wide are expression data like Microarrays [4, 5] or SAGE [93]. These are similar to genome‐context data in the sense that they reflect functional interactions between proteins [94‐97], and can therewith also be used to predict them [98‐102]. The relative high level of noise that is usually encountered in the expression profiles can be reduced by the detection a correlation between two genes among different experiments [99, 102], still this technique detects functional interactions of a general kind [98, 103]. Recently the observation from other types of genomic‐context methods, that evolutionary conservation dramatically increases the reliability of the prediction, has been shown to apply to co‐expression as well [36, 104, 105] . An interesting extension to the idea that conservation of co‐expression increases the likelihood of functional interaction is to apply it to conservation after gene duplication: the likelihood that two genes (A and B) that are co‐expressed interact increases when their paralogs (A’ and B’) are also co‐expressed [36]. The concept of evolutionary conservation of interaction has also been applied to yeast‐2‐hybrid data to detect pathways conserved between Helicobacter pylori and S. cerevisiae or duplicated within S. cerevisiae [106].

Accuracy and coverage of context‐based methods Large‐scale analyses of context‐based methods indicate a high accuracy, especially gene fusion, with estimates ranging from 72% to 95% [41, 71], gene‐order conservation (80% to 95%) [71, 107] and conserved co‐expression (> 95%) [36]. In general the likelihood that any prediction is true can be increased, at a cost of coverage, by either combining different context‐based methods (“vertical comparative genomics”) [71] (see also [108] for the combination of other types of genomics data) or increasing the required evolutionary conservation (“horizontal comparative genomics”) [56, 71, 107]. In terms of coverage, context‐based methods will likely surpass that of the homology‐based ones in the near future, at least for prokaryotes: in 2002 conserved neighborhood could provide reliable prediction for roughly a 60% of the E. coli genes, a fraction approaching that of genes with an annotated homolog in Swiss‐Prot (70%) [109]. When compared to experimental genome‐wide methods, genome‐context predictions have a higher coverage and accuracy than several genomics experimental techniques such as yeast two‐hybrid or simple (not‐conserved) co‐expression [110]. More important than such “beauty contests”, is that combining experimental and computational techniques increases the accuracy while maintaining a reasonable coverage [110]. In other words, it makes sense, after a high throughput experiment to combine the results with genomic‐ context analyses of one’s proteins to identify the most likely interactions, e.g. with a public domain database like STRING [71].

35 Chapter 2

Benchmarking and experimental verification One of the bottlenecks in the development of methods for the prediction of functional interaction is the availability of benchmarks with experimentally determined interactions. Manually curated databases of protein interactions [111‐ 113] have a relatively small coverage and especially the fraction of false negatives (what fraction of true interactions are not predicted by a method) is therefore generally hard to estimate. It should thereby be noted that “functional interaction” is, even more than “function”, not a strictly defined term that can range from a direct stable physical interaction to less direct ones like “being part of the same biological process”. Any reference set used for benchmarking should therefore be categorized into different types of interactions allowing not only quantitative (“what is the fraction of false positives”) but also qualitative evaluation (“what types of interaction do we detect”) of function prediction. Manual analyses can of course be more thorough in the evaluation of the experimental evidence and can make such a distinction between different types of interactions, but they are necessarily limited to relatively small sets of proteins, like the 480 proteins of M. genitalium [12]. Large‐ scale benchmarking of context‐based functional annotation use classifications such as presence of the same key words describing function in SWISSPROT [114], having the same Gene Ontology annotation [115], belonging to the same functional class according to a database such as COG [116], or being part of the same pathway according to KEGG [117]. It should thereby be noted that benchmarking is often done in terms of enrichment of proteins of a certain functional class, e.g. when proteins in a cluster have a statistically significantly higher than average probability of being involved in the same pathway. Such a significant pattern does not necessarily imply that highly reliable predictions can be made [114]. Reliable prediction can only be made when a large fraction (e.g. 90%) of the proteins with known function in a cluster belong to the same pathway. Experimental verification of predictions made by genomic context methods still lags far behind the many predictions that have been made. In a recent survey we identified 13 cases of predictions that were experimentally verified [109], which is a small fraction compared to the literally hundreds that have been made, e.g. [12, 47, 118]. We expect that improving accessibility of genomic‐context data [71] will facilitate the usage of experimental groups to exploit them and to couple the predictions directly to experimental verification.

From functional interactions to biochemical pathways and networks By combining pair‐wise interactions one can derive networks of protein interactions. The study of such networks, in which nodes are connected when they are (predicted to be) involved in the same biological process, has revealed that they are so‐called scale‐free networks. This means that there is not a typical number of connections per node, rather the distribution of the number of connections (k) per node (N) follows a power law (N(k) ~ k‐γ). In other words, there are many nodes with few connections and a small but still significant number of nodes with many interactions. These highly connected nodes tend to be relatively the essential to an organism [119] and to evolve relatively slowly [120]. Protein interaction networks can

36 Prediction of protein function and pathways be analyzed for structures that reflect function and selection at a level higher than pair‐wise interactions, e.g. that of pathways and protein complexes, and that can therewith also be used for higher‐order function prediction. Such “functional modules” are represented in the network by sets of proteins that are locally highly inter‐connected to each other and less connected to other proteins [121, 122]. Detecting that an uncharacterized protein is part of such a highly connected cluster of proteins allows the prediction that the protein is part of that module [121, 123]. Furthermore, finding interactions between an uncharacterized protein and multiple proteins from the same module increases the likelihood that the predicted involvement is indeed correct [25, 122, 124].

Global constraints on network structure? Besides the high local clustering coefficient, another aspect of protein networks that has been argued to reflect selection at a level higher than pair‐wise interactions is the diameter, i.e. what is, on average, the minimal number of steps one needs to get from any node to any other node. It should therewith be noted that the direction of this argument has been rather arbitrary: both the relatively small diameter of metabolic networks [125] as well as a relatively large diameter of protein interaction networks [126] have been argued to be the result of selection. Subsequent analyses have however shown that in either case the networks were more random than proposed and that the observed biases in the diameter size were either due to the choice of the network nodes [127] or experimental bias in the underlying dataset [128]. Whether the global architecture of biological interaction networks is relevant for function, or merely puts boundaries on the evolutionary process that created it is still open to debate. Neutral models for the evolution of the networks are able to capture aspects like their scale‐free architecture [129] and high cliquishness [130], and in our view there is no convincing evidence for selection in the evolution of the global network architecture. This does of course not imply that evolution has not capitalized on the neutrally evolved network structure as such, or that the individual links between proteins are meaningless. It means that the scale‐free architecture as such is no evidence for selection.

Metabolic pathway prediction A special case of functional networks are the metabolic pathways. Their reconstruction from a species genome sequence has been possible through the combination of homology‐based methods, to determine the molecular function of the proteins, with the identification of their functional partners by context‐based techniques [131‐134]. The comparison of reconstructed central metabolic pathways such as glycolysis [135] and cycle [136] from different organisms revealed a surprising plasticity with the existence of many species‐specific variations. Such deviations from the canonical pathway can be used to identify drug targets when certain alternative enzymes or by‐passes are specific of the pathogen [137].

Combining homology and context for function prediction, a case story A typical example of what is, and what is not possible in function prediction by the methods discussed here is the protein RNase L inhibitor (RLI). Experimental

37 Chapter 2

results have indicated that this protein reversibly associates with RNase L which it inhibits [138], although the exact mechanism of inhibition has not been elucidated. Furthermore RLI interacts with the HIV‐1 protein VIF [139]. Functionally these two activities of RLI are not likely to be the whole story, however. RLI is present in all eukaryotes and all Archaea that have been sequenced so far (Figure 7), but an examination of the SMART database [21] indicates that only mammals have proteins with the domain organization of its interaction partner in human, RNase L. Furthermore the interaction with HIV‐1 proteins can hardly be the original reason of the proteins existence. We examined information from genomic context data and from homology to derive a new hypothesis for the function of RLI. First we examined whether there are other orthologous groups that specifically tend to co‐ occur with RLI. An examination of the STRING database indicates that only 55 orthologous groups have a phylogenetic distribution that is identical to RLI. For the orthologous groups in this set of which we know the function (44), nearly all are either 33 (60%) involved in translation or ribosome biogenesis, 7 (13%) in transcription and 3(5%) in DNA replication, recombination and repair, matching the general pattern that the proteins that the eukaryotes obtained from the Archaea are mainly involved in the duplication and processing of DNA and RNA. These correlations point to a role of RLI in DNA duplication/transcription or RNA processing. From a second type of genomic context, the conservation of co‐regulation [36, 105], comes a prediction that is consistent with this observation, but that is more specific. Between S.cerevisiae and C.elegans RLI is conservedly co‐expressed with a number of proteins involved the processing of ribosomal RNA like the nucleolar protein SIK1 (NOP56) from yeast that is involved in rRNA methylation [36]. A conservation of co‐expression between four species study [105] hints furthermore at interactions with 1) another nucleolar protein involved in rRNA processing, NOP4, 2) a DEAD box RNA helicase, DBP2 and 3) the B and C subunits of RNA polymerase I (Figure 7). A possible link with an RNA helicase can also be inferred from genomic context information: in one Archaeal genus, Methanosarcina, RLI is located in a potential operon with a DEAD box helicase.

38 Prediction of protein function and pathways

Figure 7: Genomic context (phylogenetic distribution and conserved co‐expression) and sequence homology (domain organization and conserved residues in the sequence alignment) hint at a role of the Ribonuclease L inhibitor (RLI) in the processing of ribosomal RNA. A) Conserved co‐expression from http://cmgm.stanford.edu/~kimlab/multiplespecies [105]: The genes (names from S. cerevisiae) that are conservedly coexpressed with RLI (P palue < 0.001) in S.cerevisiae, C.elegans, Drosophila melanogaster and H.sapiens and for which functional information is available can all be linked to transcription and processing of rRNA. B) The orthologous groups (right panel) that have the same phylogenetic distribution as RLI (present in all Archaea and all eukaryotes, left panel), based on the COGs [27] as implemented STRING http://string.embl‐heidelberg.de [71]. Most genes are involved in replication, transcription and translation. The overlap with the conservedly co‐expressed set of genes (A) is NOP56 (in red). C) The domain organization of RLI from SMART http://smart.embl‐ heidelberg.de [21] with an alignment of the N‐terminal, 4‐cysteine domains of a representative set of sequences, constructed with clustalx [142]. In each sequence both sets of 4 cysteines contain at least one intercysteine loop (between the first two cysteines and between the last two cysteines) with a positively charged residue (Lysine). See text for further details.

39 Chapter 2

A third type of information, that of the domain structure of the protein, is quite specific. The protein contains two domains with each four conserved cysteines, one unique to RLI, and one a 4Fe‐4S binding domain, and furthermore 2 ATPase domains (Figure 7). One possibility to link this domain organization to RNA interaction lies in the 4Fe‐4S binding domain. Aside from playing a role in redox reactions, these domains have also been observed in DNA binding proteins endonuclease III [140] and the DNA glycosylase MutY [141]. The 4Fe‐4S cluster is hypothesized to stabilize the fold, presenting a loop that extends from to the backbone of the DNA [140]. Consistent with a role of both the 4‐cysteine domains of RLI is that they both contain conserved Lysines (positively charged) that could interact with the phosphate backbone of rRNA (negatively charged).

Examples like this one show the potential and limits of using comparative genomics in protein function prediction. We can pinpoint to a role in a process, but cannot always predict what exactly that role is. In the case of metabolic pathways a molecular function like enzymatic activity and context function like the pathway can often be matched to obtain to a specific prediction. In the case or RNase L inhibitor case the situation is less obvious.

Discussion The usage of genome sequences to predict protein function is in its adolescence. As we have reviewed here, a number of concepts have been introduced, compared and combined for the prediction of function and of functional interaction. Furthermore, benchmarking against a variety of databases indicates the generally high reliability of the predictions. These advances have not only been made possible by the availability of genomics data, they also contribute to the exploitation of these data. The use of context‐based techniques has proven to be useful to improve the annotation of complete genomes [12, 143], and annotation at the level of orthologous groups is included in genome annotation [116]. There are, however, a number of challenges that we will need to tackle if genomic context wants to reach the same level of success as homology‐based function prediction. On the practical side we need better integration of the various signals, both from homology as well as from genomic context. Not to make more predictions, but rather to make more specific predictions that are directly amenable for experimental testing: i.e. we would like to predict not only in which biological process does a protein play a role, but what does the protein do there. As the example from the RNase L inhibitor shows one can use many different sources of information that are relevant to function, combining those in a (semi‐) automatic way will not be trivial. On the theoretical side we need a better understanding of why our methods actually work. While vertical comparative genomics (the comparing of different types of data from one species), appears a straightforward way of filtering out noise, in horizontal comparative genomics (the comparison of the same type of data from different species), a second effect is likely to play a role: the filtering out of species‐specific interactions with little general relevance. Classic examples of such species‐specific interactions are the co‐regulation of ribosomal genes with genes in

40 Prediction of protein function and pathways the glycolysis in Halobacterium and S.cerevisiae [144, 145]. Such interactions do not fit into our platonic view of what a functional interaction constitutes, yet we can very well imagine why genes for the glycolysis and the ribosome are co‐regulated. This does confront us with the question how to define functional interactions between proteins, when protein function itself is already a concept that can be defined at many levels [146]. The genome era has made this question much less academic than it used to be, leading to important, if simplifying standarisation [147], and that itself is an important and unanticipated achievement in the field of biology to which the concept of function is as central as ever.

References

1 Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty, B.A., Merrick, J.M. and et al. ʺWhole‐genome random sequencing and assembly of Haemophilus influenzae Rdʺ Science 1995 269:496‐512 2 van Nimwegen, E. ʺScaling laws in the functional content of genomesʺ Trends Genet 2003 19:479‐84 3 Benson, D.A., Karsch‐Mizrachi, I., Lipman, D.J., Ostell, J. and Wheeler, D.L. ʺGenBankʺ Nucleic Acids Res 2003 31:23‐7 4 Schena, M., Shalon, D., Davis, R.W. and Brown, P.O. ʺQuantitative monitoring of gene expression patterns with a complementary DNA microarrayʺ Science 1995 270:467‐70 5 Lockhart, D.J., Dong, H., Byrne, M.C., Follettie, M.T., Gallo, M.V., Chee, M.S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H. and Brown, E.L. ʺExpression monitoring by hybridization to high‐density oligonucleotide arraysʺ Nat Biotechnol 1996 14:1675‐80 6 Chien, C.T., Bartel, P.L., Sternglanz, R. and Fields, S. ʺThe two‐hybrid system: a method to identify and clone genes for proteins that interact with a protein of interestʺ Proc Natl Acad Sci U S A 1991 88:9578‐82 7 Fire, A., Xu, S., Montgomery, M.K., Kostas, S.A., Driver, S.E. and Mello, C.C. ʺPotent and specific genetic interference by double‐stranded RNA in Caenorhabditis elegansʺ Nature 1998 391:806‐11 8 Kamath, R.S. and Ahringer, J. ʺGenome‐wide RNAi screening in Caenorhabditis elegansʺ Methods 2003 30:313‐21 9 Vidan, S. and Snyder, M. ʺLarge‐scale mutagenesis: yeast genetics in the genome eraʺ Curr Opin Biotechnol 2001 12:28‐34 10 Galperin, M.Y. and Koonin, E.V. ʺWhoʹs your neighbor? New computational approaches for functional genomicsʺ Nat Biotechnol 2000 18:609‐13 11 Iliopoulos, I., Tsoka, S., Andrade, M.A., Janssen, P., Audit, B., Tramontano, A., Valencia, A., Leroy, C., Sander, C. and Ouzounis, C.A. ʺGenome sequences and great expectationsʺ Genome Biol 2001 2:INTERACTIONS0001 12 Huynen, M., Snel, B., Lathe, W., 3rd and Bork, P. ʺPredicting protein function by genomic context: quantitative evaluation and qualitative inferencesʺ Genome Res 2000 10:1204‐10 13 Marcotte, E.M., Pellegrini, M., Ng, H.L., Rice, D.W., Yeates, T.O. and Eisenberg, D. ʺDetecting protein function and protein‐protein interactions from genome sequencesʺ Science 1999 285:751‐3

41 Chapter 2

14 Fitch, W.M. ʺDistinguishing homologous from analogous proteinsʺ Syst Zool 1970 19:99‐113 15 Teichmann, S.A. ʺThe constraints protein‐protein interactions place on sequence divergenceʺ J Mol Biol 2002 324:399‐407 16 Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., OʹDonovan, C., Phan, I., Pilbout, S. and Schneider, M. ʺThe SWISS‐PROT protein knowledgebase and its supplement TrEMBL in 2003ʺ Nucleic Acids Res 2003 31:365‐70 17 Smith, T.F. and Waterman, M.S. ʺIdentification of common molecular subsequencesʺ J Mol Biol 1981 147:195‐7 18 Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. ʺGapped BLAST and PSI‐BLAST: a new generation of protein database search programsʺ Nucleic Acids Res 1997 25:3389‐402 19 Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T. and Chothia, C. ʺSequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methodsʺ J Mol Biol 1998 284:1201‐10 20 Baldi, P., Chauvin, Y., Hunkapiller, T. and McClure, M.A. ʺHidden Markov models of biological primary sequence informationʺ Proc Natl Acad Sci U S A 1994 91:1059‐63 21 Letunic, I., Goodstadt, L., Dickens, N.J., Doerks, T., Schultz, J., Mott, R., Ciccarelli, F., Copley, R.R., Ponting, C.P. and Bork, P. ʺRecent improvements to the SMART domain‐based sequence annotation resourceʺ Nucleic Acids Res 2002 30:242‐4 22 Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., Griffiths‐ Jones, S., Howe, K.L., Marshall, M. and Sonnhammer, E.L. ʺThe Pfam protein families databaseʺ Nucleic Acids Res 2002 30:276‐80 23 Mulder, N.J., et al. ʺInterPro: an integrated documentation resource for protein families, domains and functional sitesʺ Brief Bioinform 2002 3:225‐35 24 Copley, R.R., Doerks, T., Letunic, I. and Bork, P. ʺProtein domain analysis in the era of complete genomesʺ FEBS Lett 2002 513:129‐34 25 Sonnhammer, E.L. and Koonin, E.V. ʺOrthology, paralogy and proposed classification for paralog subtypesʺ Trends Genet 2002 18:619‐20 26 Huynen, M.A. and Bork, P. ʺMeasuring genome evolutionʺ Proc Natl Acad Sci U S A 1998 95:5849‐56 27 Tatusov, R.L., Natale, D.A., Garkavtsev, I.V., Tatusova, T.A., Shankavaram, U.T., Rao, B.S., Kiryutin, B., Galperin, M.Y., Fedorova, N.D. and Koonin, E.V. ʺThe COG database: new developments in phylogenetic classification of proteins from complete genomesʺ Nucleic Acids Res 2001 29:22‐8 28 Lespinet, O., Wolf, Y.I., Koonin, E.V. and Aravind, L. ʺThe role of lineage‐specific gene family expansion in the evolution of eukaryotesʺ Genome Res 2002 12:1048‐59 29 Tatusov, R.L., et al. ʺThe COG database: an updated version includes eukaryotesʺ BMC Bioinformatics 2003 4:41 30 Eisen, J.A. and Fraser, C.M. ʺPhylogenomics: intersection of evolution and genomicsʺ Science 2003 300:1706‐7 31 Saier, M.H., Jr., et al. ʺPhylogenetic characterization of novel transport protein families revealed by genome analysesʺ Biochim Biophys Acta 1999 1422:1‐56 32 Eisen, J.A., Sweder, K.S. and Hanawalt, P.C. ʺEvolution of the SNF2 family of proteins: subfamilies with distinct sequences and functionsʺ Nucleic Acids Res 1995 23:2715‐23 33 Storm, C.E. and Sonnhammer, E.L. ʺAutomated ortholog inference from phylogenetic trees and calculation of orthology reliabilityʺ Bioinformatics 2002 18:92‐9

42 Prediction of protein function and pathways

34 Yuan, Y.P., Eulenstein, O., Vingron, M. and Bork, P. ʺTowards detection of orthologues in sequence databasesʺ Bioinformatics 1998 14:285‐9 35 Arvestad, L., Berglund, A.C., Lagergren, J. and Sennblad, B. ʺBayesian gene/species tree reconciliation and orthology analysis using MCMCʺ Bioinformatics 2003 19 Suppl 1:I7‐I15 36 van Noort, V., Snel, B. and Huynen, M.A. ʺPredicting gene function by conserved co‐ expressionʺ Trends Genet 2003 19:238‐42 37 Snel, B., Bork, P. and Huynen, M. ʺGenome evolution. Gene fusion versus gene fissionʺ Trends Genet 2000 16:9‐11 38 Welch, G.R. and Easterby, J.S. ʺMetabolic channeling versus free diffusion: transition‐ time analysisʺ Trends Biochem Sci 1994 19:193‐7 39 Burns, D.M., Horn, V., Paluh, J. and Yanofsky, C. ʺEvolution of the tryptophan synthetase of fungi. Analysis of experimentally fused Escherichia coli tryptophan synthetase alpha and beta chainsʺ J Biol Chem 1990 265:2060‐9 40 Enright, A.J., Iliopoulos, I., Kyrpides, N.C. and Ouzounis, C.A. ʺProtein interaction maps for complete genomes based on gene fusion eventsʺ Nature 1999 402:86‐90 41 Yanai, I., Derti, A. and DeLisi, C. ʺGenes linked by fusion events are generally of the same functional category: a systematic analysis of 30 microbial genomesʺ Proc Natl Acad Sci U S A 2001 98:7940‐5 42 Tsoka, S. and Ouzounis, C.A. ʺPrediction of protein interactions: metabolic enzymes are frequently involved in gene fusionʺ Nat Genet 2000 26:141‐2 43 Snel, B., Lehmann, G., Bork, P. and Huynen, M.A. ʺSTRING: a web‐server to retrieve and display the repeatedly occurring neighbourhood of a geneʺ Nucleic Acids Res 2000 28:3442‐4 44 Yanofsky, C. ʺComparison of regulatory and structural regions of genes of tryptophan metabolismʺ Mol Biol Evol 1984 1:143‐61 45 Mushegian, A.R. and Koonin, E.V. ʺGene order is not conserved in bacterial evolutionʺ Trends Genet 1996 12:289‐90 46 Watanabe, H., Mori, H., Itoh, T. and Gojobori, T. ʺGenome plasticity as a paradigm of eubacteria evolutionʺ J Mol Evol 1997 44 Suppl 1:S57‐64 47 Wolf, Y.I., Rogozin, I.B., Kondrashov, A.S. and Koonin, E.V. ʺGenome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic contextʺ Genome Res 2001 11:356‐72 48 Dandekar, T., Snel, B., Huynen, M. and Bork, P. ʺConservation of gene order: a fingerprint of proteins that physically interactʺ Trends Biochem Sci 1998 23:324‐8 49 Overbeek, R.F., M. DʹSouza, M. Pusch, G.D. Maltsev, N. ʺUse of contiguity on the chromosome to infer functional couplingʺ In Silico Biol. 1998 2:93‐108 50 Moreno‐Hagelsieb, G., Trevino, V., Perez‐Rueda, E., Smith, T.F. and Collado‐Vides, J. ʺTranscription unit conservation in the three domains of life: a perspective from Escherichia coliʺ Trends Genet 2001 17:175‐7 51 Overbeek, R., Fonstein, M., DʹSouza, M., Pusch, G.D. and Maltsev, N. ʺThe use of gene clusters to infer functional couplingʺ Proc Natl Acad Sci U S A 1999 96:2896‐901 52 Blumenthal, T. ʺGene clusters and polycistronic transcription in eukaryotesʺ Bioessays 1998 20:480‐7 53 Spieth, J., Brooke, G., Kuersten, S., Lea, K. and Blumenthal, T. ʺOperons in C. elegans: polycistronic mRNA precursors are processed by trans‐splicing of SL2 to downstream coding regionsʺ Cell 1993 73:521‐32 54 Lee, J.M. and Sonnhammer, E.L. ʺGenomic gene clustering analysis of pathways in eukaryotesʺ Genome Res 2003 13:875‐82 55 Hurst, L.D., Williams, E.J. and Pal, C. ʺNatural selection promotes the conservation of linkage of co‐expressed genesʺ Trends Genet 2002 18:604‐6

43 Chapter 2

56 Huynen, M.A. and Snel, B. Exploiting the variations in the genomic association of genes to predict pathways and reconstruct their evolution 2003 3:145‐166 57 Huynen, M.A., Snel, B. and Bork, P. ʺInversions and the dynamics of eukaryotic gene orderʺ Trends Genet 2001 17:304‐6 58 Bobik, T.A. and Rasche, M.E. ʺIdentification of the human methylmalonyl‐CoA racemase gene based on the analysis of prokaryotic gene arrangements. Implications for decoding the human genomeʺ J Biol Chem 2001 276:37194‐8 59 Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D. and Yeates, T.O. ʺAssigning protein functions by comparative genome analysis: protein phylogenetic profilesʺ Proc Natl Acad Sci U S A 1999 96:4285‐8 60 Huynen, M.A. and Snel, B. ʺGene and context: integrative approaches to genome analysisʺ Adv Protein Chem 2000 54:345‐79 61 Gaasterland, T. and Ragan, M.A. ʺMicrobial genescapes: phyletic and functional patterns of ORF distribution among prokaryotesʺ Microb Comp Genomics 1998 3:199‐ 217 62 Huynen, M., Snel, B., Lathe, W. and Bork, P. ʺExploitation of gene contextʺ Curr Opin Struct Biol 2000 10:366‐70 63 Wu, J., Kasif, S. and DeLisi, C. ʺIdentification of functional links between genes using phylogenetic profilesʺ Bioinformatics 2003 19:1524‐30 64 Campuzano, V., Montermini, L., Molto, M.D., Pianese, L., Cossee, M., Cavalcanti, F., Monros, E., Rodius, F., Duclos, F., Monticelli, A. and et al. ʺFriedreichʹs ataxia: autosomal recessive disease caused by an intronic GAA triplet repeat expansionʺ Science 1996 271:1423‐7 65 Huynen, M.A., Snel, B., Bork, P. and Gibson, T.J. ʺThe phylogenetic distribution of frataxin indicates a role in iron‐sulfur cluster protein assemblyʺ Hum Mol Genet 2001 10:2463‐8 66 Muhlenhoff, U., Richhardt, N., Ristow, M., Kispal, G. and Lill, R. ʺThe yeast frataxin homolog Yfh1p plays a specific role in the maturation of cellular Fe/S proteinsʺ Hum Mol Genet 2002 11:2025‐36 67 Chen, O.S., Hemenway, S. and Kaplan, J. ʺInhibition of Fe‐S cluster biosynthesis decreases mitochondrial iron export: evidence that Yfh1p affects Fe‐S cluster synthesisʺ Proc Natl Acad Sci U S A 2002 99:12321‐6 68 Duby, G., Foury, F., Ramazzotti, A., Herrmann, J. and Lutz, T. ʺA non‐essential function for yeast frataxin in iron‐sulfur cluster assemblyʺ Hum Mol Genet 2002 11:2635‐43 69 Zheng, Y., Roberts, R.J. and Kasif, S. ʺGenomic functional annotation using co‐ evolution profiles of gene clustersʺ Genome Biol 2002 3:RESEARCH0060 70 Liberles, D.A., Thoren, A., von Heijne, G. and Elofsson, A. ʺThe use of phylogenetic profiles for gene predictionʺ Current Genomics 2002 3:131‐137 71 von Mering, C., Huynen, M., Jaeggi, D., Schmidt, S., Bork, P. and Snel, B. ʺSTRING: a database of predicted functional associations between proteinsʺ Nucleic Acids Res 2003 31:258‐61 72 Koonin, E.V., Mushegian, A.R. and Bork, P. ʺNon‐orthologous gene displacementʺ Trends Genet 1996 12:334‐6 73 Myllykallio, H., Lipowski, G., Leduc, D., Filee, J., Forterre, P. and Liebl, U. ʺAn alternative flavin‐dependent mechanism for thymidylate synthesisʺ Science 2002 297:105‐7 74 Morett, E., Korbel, J.O., Rajan, E., Saab‐Rincon, G., Olvera, L., Olvera, M., Schmidt, S., Snel, B. and Bork, P. ʺSystematic discovery of analogous enzymes in thiamin biosynthesisʺ Nat Biotechnol 2003 21:790‐5

44 Prediction of protein function and pathways

75 Snel, B., Bork, P. and Huynen, M.A. ʺGenome phylogeny based on gene contentʺ Nat Genet 1999 21:108‐10 76 Perna, N.T., et al. ʺGenome sequence of enterohaemorrhagic Escherichia coli O157:H7ʺ Nature 2001 409:529‐33 77 Blattner, F.R., et al. ʺThe complete genome sequence of Escherichia coli K‐12ʺ Science 1997 277:1453‐74 78 Buchrieser, C., Rusniok, C., Kunst, F., Cossart, P. and Glaser, P. ʺComparison of the genome sequences of Listeria monocytogenes and Listeria innocua: clues for evolution and pathogenicityʺ FEMS Immunol Med Microbiol 2003 35:207‐13 79 Aravind, L., Watanabe, H., Lipman, D.J. and Koonin, E.V. ʺLineage‐specific loss and divergence of functionally linked genes in eukaryotesʺ Proc Natl Acad Sci U S A 2000 97:11319‐24 80 Ettema, T., van der Oost, J. and Huynen, M. ʺModularity in the gain and loss of genes: applications for function predictionʺ Trends Genet 2001 17:485‐7 81 Fryxell, K.J. ʺThe coevolution of gene family treesʺ Trends Genet 1996 12:364‐9 82 Goh, C.S., Bogan, A.A., Joachimiak, M., Walther, D. and Cohen, F.E. ʺCo‐evolution of proteins with their interaction partnersʺ J Mol Biol 2000 299:283‐93 83 Hughes, A.L. and Yeager, M. ʺCoevolution of the mammalian chemokines and their receptorsʺ Immunogenetics 1999 49:115‐24 84 Pazos, F. and Valencia, A. ʺSimilarity of phylogenetic trees as indicator of protein‐ protein interactionʺ Protein Eng 2001 14:609‐14 85 Valencia, A. and Pazos, F. ʺPrediction of protein‐protein interactions from evolutionary informationʺ Methods Biochem Anal 2003 44:411‐26 86 Ramani, A.K. and Marcotte, E.M. ʺExploiting the co‐evolution of interacting proteins to discover interaction specificityʺ J Mol Biol 2003 327:273‐84 87 Ortiz, A.R., Kolinski, A., Rotkiewicz, P., Ilkowski, B. and Skolnick, J. ʺAb initio folding of proteins using restraints derived from evolutionary informationʺ Proteins 1999 Suppl 3:177‐85 88 Gobel, U., Sander, C., Schneider, R. and Valencia, A. ʺCorrelated mutations and residue contacts in proteinsʺ Proteins 1994 18:309‐17 89 Pazos, F., Helmer‐Citterich, M., Ausiello, G. and Valencia, A. ʺCorrelated mutations contain information about protein‐protein interactionʺ J Mol Biol 1997 271:511‐23 90 Pazos, F. and Valencia, A. ʺIn silico two‐hybrid system for the selection of physically interacting protein pairsʺ Proteins 2002 47:219‐27 91 Aloy, P. and Russell, R.B. ʺInterrogating protein interaction networks through structural biologyʺ Proc Natl Acad Sci U S A 2002 99:5896‐901 92 Mendez, R., Leplae, R., De Maria, L. and Wodak, S.J. ʺAssessment of blind predictions of protein‐protein interactions: current status of docking methodsʺ Proteins 2003 52:51‐67 93 Velculescu, V.E., Zhang, L., Vogelstein, B. and Kinzler, K.W. ʺSerial analysis of gene expressionʺ Science 1995 270:484‐7 94 Grigoriev, A. ʺA relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiaeʺ Nucleic Acids Res 2001 29:3513‐9 95 Jansen, R., Greenbaum, D. and Gerstein, M. ʺRelating whole‐genome expression data with protein‐protein interactionsʺ Genome Res 2002 12:37‐46 96 Ge, H., Liu, Z., Church, G.M. and Vidal, M. ʺCorrelation between transcriptome and interactome mapping data from Saccharomyces cerevisiaeʺ Nat Genet 2001 29:482‐6 97 DeRisi, J.L., Iyer, V.R. and Brown, P.O. ʺExploring the metabolic and genetic control of gene expression on a genomic scaleʺ Science 1997 278:680‐6

45 Chapter 2

98 Wu, L.F., Hughes, T.R., Davierwala, A.P., Robinson, M.D., Stoughton, R. and Altschuler, S.J. ʺLarge‐scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clustersʺ Nat Genet 2002 31:255‐65 99 Eisen, M.B., Spellman, P.T., Brown, P.O. and Botstein, D. ʺCluster analysis and display of genome‐wide expression patternsʺ Proc Natl Acad Sci U S A 1998 95:14863‐ 8 100 Niehrs, C. and Pollet, N. ʺSynexpression groups in eukaryotesʺ Nature 1999 402:483‐7 101 Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B., Smith, S., Barker, J.L. and Somogyi, R. ʺLarge‐scale temporal gene expression mapping of central nervous system developmentʺ Proc Natl Acad Sci U S A 1998 95:334‐9 102 Hughes, T.R., et al. ʺFunctional discovery via a compendium of expression profilesʺ Cell 2000 102:109‐26 103 Noordewier, M.O. and Warren, P.V. ʺGene expression microarrays and the integration of biological knowledgeʺ Trends Biotechnol 2001 19:412‐5 104 Teichmann, S.A. and Babu, M.M. ʺConservation of gene co‐regulation in prokaryotes and eukaryotesʺ Trends Biotechnol 2002 20:407‐10; discussion 410 105 Stuart, J.M., Segal, E., Koller, D. and Kim, S.K. ʺA Gene Coexpression Network for Global Discovery of Conserved Genetic Modulesʺ Science 2003 302:249‐255 106 Kelley, B.P., Sharan, R., Karp, R.M., Sittler, T., Root, D.E., Stockwell, B.R. and Ideker, T. ʺConserved pathways within bacteria and yeast as revealed by global protein network alignmentʺ Proc Natl Acad Sci U S A 2003 100:11394‐9 107 Yanai, I., Mellor, J.C. and DeLisi, C. ʺIdentifying functional links between genes using conserved chromosomal proximityʺ Trends Genet 2002 18:176‐9 108 Jansen, R., Lan, N., Qian, J. and Gerstein, M. ʺIntegration of genomic datasets to predict protein complexes in yeastʺ J Struct Funct Genomics 2002 2:71‐81 109 Huynen, M.A., Snel, B., Mering, C. and Bork, P. ʺFunction prediction and protein networksʺ Curr Opin Cell Biol 2003 15:191‐8 110 von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S. and Bork, P. ʺComparative assessment of large‐scale data sets of protein‐protein interactionsʺ Nature 2002 417:399‐403 111 Xenarios, I., Salwinski, L., Duan, X.J., Higney, P., Kim, S.M. and Eisenberg, D. ʺDIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactionsʺ Nucleic Acids Res 2002 30:303‐5 112 Zanzoni, A., Montecchi‐Palazzi, L., Quondam, M., Ausiello, G., Helmer‐Citterich, M. and Cesareni, G. ʺMINT: a Molecular INTeraction databaseʺ FEBS Lett 2002 513:135‐ 40 113 Bader, G.D., Betel, D. and Hogue, C.W. ʺBIND: the Biomolecular Interaction Network Databaseʺ Nucleic Acids Res 2003 31:248‐50 114 Marcotte, E.M., Pellegrini, M., Thompson, M.J., Yeates, T.O. and Eisenberg, D. ʺA combined algorithm for genome‐wide prediction of protein functionʺ Nature 1999 402:83‐6 115 Hill, D.P., Blake, J.A., Richardson, J.E. and Ringwald, M. ʺExtension and integration of the gene ontology (GO): combining GO vocabularies with external vocabulariesʺ Genome Res 2002 12:1982‐91 116 Tatusov, R.L., Galperin, M.Y., Natale, D.A. and Koonin, E.V. ʺThe COG database: a tool for genome‐scale analysis of protein functions and evolutionʺ Nucleic Acids Res 2000 28:33‐6 117 Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H. and Kanehisa, M. ʺKEGG: Kyoto Encyclopedia of Genes and Genomesʺ Nucleic Acids Res 1999 27:29‐34

46 Prediction of protein function and pathways

118 Matte‐Tailliez, O., Zivanovic, Y. and Forterre, P. ʺMining archaeal proteomes for eukaryotic proteins with novel functions: the PACE caseʺ Trends Genet 2000 16:533‐6 119 Jeong, H., Mason, S.P., Barabasi, A.L. and Oltvai, Z.N. ʺLethality and centrality in protein networksʺ Nature 2001 411:41‐2 120 Fraser, H.B., Hirsh, A.E., Steinmetz, L.M., Scharfe, C. and Feldman, M.W. ʺEvolutionary rate in the protein interaction networkʺ Science 2002 296:750‐2 121 Snel, B., Bork, P. and Huynen, M.A. ʺThe identification of functional modules from the genomic association of genesʺ Proc Natl Acad Sci U S A 2002 99:5890‐5 122 Spirin, V. and Mirny, L.A. ʺProtein complexes and functional modules in molecular networksʺ Proc Natl Acad Sci U S A 2003 123 Vazquez, A., Flammini, A., Maritan, A. and Vespignani, A. ʺGlobal protein function prediction from protein‐protein interaction networksʺ Nat Biotechnol 2003 21:697‐700 124 Goldberg, D.S. and Roth, F.P. ʺAssessing experimentally derived interactions in a small worldʺ Proc Natl Acad Sci U S A 2003 100:4372‐6 125 Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N. and Barabasi, A.L. ʺThe large‐scale organization of metabolic networksʺ Nature 2000 407:651‐4 126 Maslov, S. and Sneppen, K. ʺSpecificity and stability in topology of protein networksʺ Science 2002 296:910‐3 127 Ma, H. and Zeng, A.P. ʺReconstruction of metabolic networks from genome data and analysis of their global structure for various organismsʺ Bioinformatics 2003 19:270‐7 128 Aloy, P. and Russell, R.B. ʺPotential artefacts in protein‐interaction networksʺ FEBS Lett 2002 530:253‐4 129 Barabasi, A.L. and Albert, R. ʺEmergence of scaling in random networksʺ Science 1999 286:509‐12 130 Pastor‐Satorras, R., Smith, E. and Sole, R.V. ʺEvolving protein interaction networks through gene duplicationʺ J Theor Biol 2003 222:199‐210 131 Paley, S.M. and Karp, P.D. ʺEvaluation of computational metabolic‐pathway predictions for Helicobacter pyloriʺ Bioinformatics 2002 18:715‐24 132 Overbeek, R., Larsen, N., Pusch, G.D., DʹSouza, M., Selkov, E., Jr., Kyrpides, N., Fonstein, M., Maltsev, N. and Selkov, E. ʺWIT: integrated system for high‐throughput genome sequence analysis and metabolic reconstructionʺ Nucleic Acids Res 2000 28:123‐5 133 Schilling, C.H., Covert, M.W., Famili, I., Church, G.M., Edwards, J.S. and Palsson, B.O. ʺGenome‐scale metabolic model of Helicobacter pylori 26695ʺ J Bacteriol 2002 184:4582‐93 134 Schilling, C.H. and Palsson, B.O. ʺAssessment of the metabolic capabilities of Haemophilus influenzae Rd through a genome‐scale pathway analysisʺ J Theor Biol 2000 203:249‐83 135 Dandekar, T., Schuster, S., Snel, B., Huynen, M. and Bork, P. ʺPathway alignment: application to the comparative analysis of glycolytic enzymesʺ Biochem J 1999 343 Pt 1:115‐24 136 Huynen, M.A., Dandekar, T. and Bork, P. ʺVariation and evolution of the citric‐acid cycle: a genomic perspectiveʺ Trends Microbiol 1999 7:281‐91 137 Becker, K., Schirmer, M., Kanzok, S. and Schirmer, R.H. ʺFlavins and flavoenzymes in diagnosis and therapyʺ Methods Mol Biol 1999 131:229‐45 138 Bisbal, C., Martinand, C., Silhol, M., Lebleu, B. and Salehzada, T. ʺCloning and characterization of a RNAse L inhibitor. A new component of the interferon‐ regulated 2‐5A pathwayʺ J Biol Chem 1995 270:13308‐17 139 Zimmerman, C., Klein, K.C., Kiser, P.K., Singh, A.R., Firestein, B.L., Riba, S.C. and Lingappa, J.R. ʺIdentification of a host protein essential for assembly of immature HIV‐1 capsidsʺ Nature 2002 415:88‐92

47 Chapter 2

140 Fromme, J.C. and Verdine, G.L. ʺStructure of a trapped endonuclease III‐DNA covalent intermediateʺ EMBO J 2003 22:3461‐71 141 Porello, S.L., Cannon, M.J. and David, S.S. ʺA substrate recognition role for the [4Fe‐ 4S]2+ cluster of the DNA repair glycosylase MutYʺ Biochemistry 1998 37:6465‐75 142 Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T.J., Higgins, D.G. and Thompson, J.D. ʺMultiple sequence alignment with the Clustal series of programsʺ Nucleic Acids Res 2003 31:3497‐500 143 Strong, M., Mallick, P., Pellegrini, M., Thompson, M.J. and Eisenberg, D. ʺInference of protein function and protein linkages in Mycobacterium tuberculosis based on prokaryotic genome organization: a combined computational approachʺ Genome Biol 2003 4:R59 144 Kromer, W.J. and Arndt, E. ʺHalobacterial S9 operon. Three ribosomal protein genes are cotranscribed with genes encoding a tRNA(Leu), the enolase, and a putative membrane protein in the archaebacterium Haloarcula (Halobacterium) marismortuiʺ J Biol Chem 1991 266:24573‐9 145 Santangelo, G.M. and Tornow, J. ʺEfficient transcription of the glycolytic gene ADH1 and three translational component genes requires the GCR1 product, which can act through TUF/GRF/RAP binding sitesʺ Mol Cell Biol 1990 10:859‐62 146 Bork, P., Dandekar, T., Diaz‐Lazcoz, Y., Eisenhaber, F., Huynen, M. and Yuan, Y. ʺPredicting function: from genes to genomes and backʺ J Mol Biol 1998 283:707‐25 147 ʺCreating the gene ontology resource: design and implementationʺ Genome Res 2001 11:1425‐33

48 Prediction of protein function and pathways

49 Chapter 3

Que en libros de ciencia he podido estudiar, que somos microbios venidos a más*.

Extremoduro

*For I could learn from science textbooks that we’re just upstart microbes

50 Reconstruction of the proto‐mitochondrial metabolism

Chapter 3

Reconstruction of the proto‐mitochondrial metabolism

Toni Gabaldón and Martijn A. Huynen

Science. 2003 Aug 1;301(5633):609.

51 Chapter 3

Results and discussion Two new hypotheses explain the initial relationship between the alpha‐ proteobacterial ancestor [1, 2] of the mitochondrion (the proto‐mitochondrion) and its host: one in which the proto‐mitochondrion would have been an oxygen scavenger [3] and one in which the proto‐mitochondrion would have been a hydrogen producing, facultatively anaerobic species [4]. Previous studies on the proto‐mitochondrion’s metabolism have been based on ~50 yeast mitochondrial proteins of alpha‐proteobacterial origin [5]. To also detect non‐mitochondrial proteins of alpha‐proteobacterial origin we compared the proteins encoded by six alpha‐proteobacterial genomes to a total of 77 genomes including 9 eukaryotes, and derived their phylogenies. In the 22.525 reconstructed phylogenies, 630 orthologous groups showed a close evolutionary relationship between alpha‐proteobacterial and eukaryotic proteins and did not indicate more recent horizontal transfer. We consider this a minimal estimate of the proto‐mitochondrial proteome as many of its genes have likely been lost from the sequenced eukaryotic genomes, or do not have a strong enough phylogenetic signal to be detected in large‐scale analyses. This is apparent in the mitochondrial genome of Reclinomonas americana for which our method retrieved 61% of the non‐ribosomal proteins.

The abundance of metabolite transporters suggests a host dependency of the proto‐mitochondrion. Most significant are the finding of both a lipid transporter and a glycerol uptake protein, which can be linked to ‐oxidation and glycerol metabolism respectively. Conversely, the presence of amino acid and peptide transporters is consistent with the large number of gaps in pathways from the amino‐ acid metabolism. Comparing the proto‐mitochondrial proteome to the largest compilation of mitochondrial proteins from yeast and human [7] indicates that mitochondrial proteins represent only a minority of the proteins of proto‐ mitochondrial descent (22% in human and 32% in yeast). The rest, including transporters and metabolic enzymes, have been retargeted to other part of the cell, corroborating a prediction from [4]. Our results confirm findings [5] that only a minor fraction of yeast mitochondrial proteins are of alpha‐proteobacterial origin (16%), and extend this result to the human mitochondria (14%). For a number of pathways the fraction of recovered proteins in the set of 630 is high enough to suggest they were complete in the proto‐mitochondrion (Figure 1): ‐oxidation and a complete electron transport chain, pathways of fructose/mannose metabolism and for the synthesis of lipids, biotine, heme and iron‐sulfur clusters. The [6] and pentose phosphate pathway can only partially be recovered while enzymes for amino‐acid and nucleotide metabolism, although well represented, contain relatively mainly “isolated steps”.

52 Reconstruction of the proto‐mitochondrial metabolism

Fig. 1. (Note: a color version of this figure is shown in appendix 2) An overview of metabolism and transport in the proto‐mitochondrion, as deduced from the orthologous groups present in its estimated proteome. Yellow boxes indicate the number of groups in that COG functional class. Boxes, arrows, and cylinders indicate pathways, enzymes, and transporters, respectively. Blue: proteins are mitochondrial in yeast or human. Orange: human and yeast orthologs have not been observed in mitochondria. Black: there is no human or yeast representative of this orthologous group.

The reconstructed metabolism suggests an aerobic proto‐mitochondrion catabolizing lipids, glycerol and amino acids provided by the host. Although this is most compatible with the oxygen‐scavenger hypothesis, in the absence of a published genome of a hydrogenosomal eukaryote this conclusion cannot be definite. More importantly, a multifaceted benefit of the endosymbiosis for the host seems plausible if one considers the conservation of pathways not directly related ATP production, like fructose/mannose metabolism and the synthesis of lipids, nucleotides and vitamins. Some of these have moved elsewhere in the cell, showing that modern mitochondria are not the only heritage of the ancestral symbiotic relationship between an alpha‐proteobacterium and its host.

Materials and Methods

Data: Genome sequences were obtained from Swissprot (http://www.ebi.ac.uk/swissprot), except for Plasmodium falciparum, Schizosaccharomyces pome, Candida albicans, and Encephalitozoon cuniculi ( GenBank,

53 Chapter 3

http://www.ncbi.nlm.org/Genbank), and Homo sapiens (EBI, http://www.ebi.ac.uk/IPI). The data set for human and yeast mitochondrial proteins was kindly provided by Eric Schon (attached).

Phylogenetic analyses: For every protein of six alpha‐proteobacterial genomes (Rickettsia prowazekii, Rickettsia conorii, Caulobacter crescentus, Rhizobium loti, Rhizobium melilotii, Brucella melitensis) homologous proteins were retrieved using Smith‐Waterman comparisons (E < 0.01) against 77 complete genomes, including the eukaryotes Encephalotozoon cuniculi, Saccharomyces cerevisiae, Candida albicans, Plasmodium falciparum, Homo sapiens, Caernohabditis elegans, Schizosaccharomyces pombe, Arabidopsis thaliana, Drosophila melanogaster. Only sequences that aligned with at least 50% of the query sequence were selected. Sets of homologs were limited to the closest 250 sequences, additional homologs were added if they belonged to an organism not present in the set of 250. Sequences were aligned using clustalW [8]. Phylogenetic trees were constructed with neighbour‐joining using Kimura distances (implemented in ClustalW) and the Dayhoff matrix (implemented in protdist in the Phylip package [9]. Kimura distances were chosen for their speed to perform the bootstrap analyses (100 samples).

Scanning algorithm: Trees were scanned for branches that only contained the query of the Smith‐Waterman search and eukaryotic and alpha‐proteobacterial proteins. Presence of gamma and beta‐proteobacterial proteins was also allowed for reasons of coverage. In case an alpha‐proteobacterial/eukaryotic partition existed, proteins in that branch were regarded as orthologs. The group was further divided if the proteins from the alpha‐proteobacterium formed different sub‐partitions with the eukaryotic ones. We filtered out possible cases of horizontal gene transfer by examining the topology of the trees: 1) In cases where the alpha‐ proteobacterial/eukaryotic group could be rooted by other species we interpreted the branching of a single alpha‐proteobacterial protein within a cluster of eukaryotic proteins as a gene transfer from a eukaryote to the alpha‐proteobacteria and discarded this group. 2) We also discarded cases where only one genus of both eukaryotes and alpha‐proteobacteria was present, eliminating such proteins as the ADP/ATP translocases that are only shared between the parasitic Rickettsia and E. cuniculi, or a set of 99 orthologous groups that are only shared between A. thaliana and the plant symbiont R. meliloti. Only groups supported by both distance measures (Kimura and Dayhoff) were used to reconstruct the metabolism. The orthologous groups that were selected and their phylogenetic trees can be found at www.cmbi.kun.nl/~jagabald/table_groups.html.

Statistical analyses: As expected from the large numbers of sequences, employing a 50% or 75% bootstrap cut‐off value reduced the set of orthologous groups: from 630 to 552 and 394 respectively. More importantly: including a bootstrap value cut‐off did however not affect the non‐mitochondrial set of proteins more than the mitochondrial sub‐set (Using 50%[75%] bootstrap values we retained

54 Reconstruction of the proto‐mitochondrial metabolism

73%[51%] and 74%[62%] of the yeast mitochondrial and non‐mitochondrial proteins respectively). To estimate the amount of false‐negatives of our analyses we used the mitochondrial genome of Reclinomonas americana, of which all proteins presumably have an alpha‐proteobacterial origin. Our procedure retrieved 49% of all the proteins and, when leaving out the generally short ribosomal proteins, 24 (61.5%) of the non‐ribosomal proteins, indicating that due to phylogenetic noise in large‐scale phylogenetic analyses, our estimate is rather minimal. It should be noted that the alpha‐proteobacterial gene content of the eukaryotes is quite variable, varying from 23 orthologous groups in E. cuniculi to 375 orthologous groups in A. thaliana. With an increase of the sampling of both eukaryotes and alpha‐proteobacteria we expect that the estimates will become both larger and more accurate. To estimate the fraction of false positives we repeated the procedure in the bacterium Deinococcus radiodurans, which has no direct relation to the eukaryotes. Here our procedure selected 41 out of 3085 (1.3 %) proteins, compared to 13‐24% in the alpha‐ proteobacteria. Note that the pathways that are discussed in the text are supported by multiple, independently detected proteins.

Acknowledgments We thank E. Schon for the list of mitochondrial proteins. This work was supported by a grant from the Netherlands Organization of Scientific Research

References

1 Andersson, S. G., et al. ʺThe genome sequence of Rickettsia prowazekii and the origin of mitochondriaʺ.Nature. 1998 396:133‐40 2 Gray, M. W., Burger, G. and Lang, B. F. ʺMitochondrial evolutionʺ.Science. 1999 283:1476‐81 3 Kurland, C. G. and Andersson, S. G. ʺOrigin and evolution of the mitochondrial proteomeʺ.Microbiol Mol Biol Rev. 2000 64:786‐820 4 Martin, W., et al. ʺAn overview of endosymbiotic models for the origins of eukaryotes, their ATP‐producing organelles (mitochondria and hydrogenosomes), and their heterotrophic lifestyleʺ.Biol Chem. 2001 382:1521‐39 5 Karlberg, O., et al. ʺThe dual origin of the yeast mitochondrial proteomeʺ.Yeast. 2000 17:170‐87 6 Schnarrenberger, C. and Martin, W. ʺEvolution of the enzymes of the citric acid cycle and the glyoxylate cycle of higher plants. A case study of endosymbiotic gene transferʺ.Eur J Biochem. 2002 269:868‐83 7 Schon, E. ʺGene products present in mitochondria of yeast and animal cells.ʺ 8 Thompson, J. D., Higgins, D. G. and Gibson, T. J. ʺCLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position‐specific gap penalties and weight matrix choiceʺ.Nucleic Acids Res. 1994 22:4673‐80 9 Felsenstein, J. ʺPHYLIP (Phylogeny Inference Package)ʺ.

55 Chapter 4

Use it or lose it

Popular saying

56 Gene loss following mitochondrial endosymbiosis

Chapter 4

Lineage‐specific gene loss following mitochondrial endosymbiosis and its applications for function prediction in the eukaryotes

Toni Gabaldón and Martijn A. Huynen

Bioinformatics (In press)

57 Chapter 4

Abstract The endosymbiotic origin of mitochondria has resulted in a massive horizontal transfer of genetic material from an alpha‐proteobacterium to the early eukaryotes. Using large‐scale phylogenetic analysis we have previously identified 630 orthologous groups of proteins derived from this event. Here we show that this proto‐mitochondrial protein set has undergone extensive lineage‐specific gene loss in the eukaryotes, with an average of three losses per orthologous group in a phylogeny of nine species. This gene loss has resulted in a high variability of the alphaproteobacterial‐derived gene content of present‐day eukaryotic genomes that might reflect functional adaptation to different environments. Proteins functioning in the same biochemical pathway tend to have a similar history of gene loss events, and we use this property to predict functional interactions among proteins in our set. Sequences and trees for the selected orthologous groups can be accessed in the following address: http://www.cmbi.kun.nl/~jagabald/table_groupsb.html

4.1 Introduction In the course of evolution, mechanisms such as gene loss, gene duplication, horizontal gene transfer and gene genesis determine a genome’s gene content. Although the scope and relative weight of these processes have been assessed in various large‐scale studies that considered prokaryotic genomes [1‐5] there is little information regarding eukaryotic genomes, as most of the studies on eukaryotes have focused on pairs of species [6, 7]. Reconstructing genome evolution is not only interesting in itself but also has applications for function prediction. In an analysis of gene loss in one species, Saccharomyces cerevisiae, Aravind and co‐workers [6] noticed that lineage‐specific gene loss affected proteins involved in a limited number of biological processes. They suggested that when genes have a similar loss pattern they are likely to functionally interact, a special instance of inferring functional interactions between proteins that have the same phylogenetic distribution [5, 8]. In this S. cerevisiae gene loss study the question whether differences in gene content between S. cerevisiae and Schizosaccharomyces pombe reflected the gain or the loss of genes was solved by using more distantly related genomes to establish whether a gene was present in the last common ancestor of the species considered [6]. Here, in order to study gene loss in a broad set of eukaryotes we used a set of genes acquired during the endosymbiosis of mitochondria. This event can be regarded as a massive horizontal transfer of genetic material from an alpha‐proteobacterium to the early eukaryotic cell [9, 10], and, because this event preceded the radiation of the eukaryotes considered here, differences in the presence of the genes in the combined material (nuclear and mitochondrial DNA) are likely the result of differential loss. The study of the proto‐mitochondrially derived gene content in eukaryotes is of special importance because of the central role that mitochondria play in the eukaryotic cell [11], and their implication in human disease [12, 13]. Besides their function as energy‐producing organelles, mitochondria are involved in many other processes, some of which are specific to certain species [11, 14, 15] which in turn might reflect variations in the gene set derived from the proto‐mitochondrion. We study the presence of the genes derived from the alpha‐proteobacterial ancestor of

58 Gene loss following mitochondrial endosymbiosis

the mitochondria among nine eukaryotic genomes (Homo sapiens, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Schizosaccaromyces pombe, Candida albicans, Encephalitozoon cuniculi, Plasmodium falciparum and Arabidopsis thaliana). By analyzing the phylogenies of the alpha‐proteobacterial encoded proteins that have homologs in the eukaryotes we have previously identified 630 orthologous groups of eukaryotic proteins whose phylogenies indicate a proto‐mitochondrial origin. This set includes mitochondrial‐encoded as well as nuclear‐encoded proteins whose proto‐mitochondrial origin does not necessarily indicate a mitochondrial localization of the protein [16]. A reconstruction of the proto‐mitochondrion’s metabolism revealed that the ancestor of the mitochondria possessed a large diversity of pathways and transporters many of which are not directly related to ATP production [16]. The distribution of the pool of genes with a proto‐mitochondrial ancestry in the nine fully sequenced eukaryotic genomes reveals a great variability both in the number of genes that are present in one genome as in the proteins they encode. This suggests that extensive lineage‐specific gene loss occurred during the diversification of the eukaryotes as has been suggested for the alpha‐proteobacterial derived, algae specific gene ftsZ [17]. These results are also in line with the recent findings of Richly et al [18] who observed extensive differences between the by sorting algorithms predicted mitochondrial proteomes of 10 eukaryotic species. In that analysis the differences can be the result of either gene gain or loss, aside from differences in (predicted) cellular localization. We mapped the loss events of the proto‐mitochondrial proteome for all the groups on a consensus phylogenetic tree of eukaryotes where discrepancies between different proposed eukaryotic tree topologies were solved by choosing the one that minimized the total number of gene losses in our set. As expected [8], orthologous groups that showed identical patterns of gene loss in the phylogeny tend to function in the same biochemical pathway. We propose this property can be used to predict biological interactions among proteins derived from the proto‐mitochondrion and show an example to illustrate this.

4.2 Methods

Phylome construction. To derive a phylome, the complete set of phylogenetic trees [19], from each alpha‐proteobacterial genome, the following procedure was used: For every protein encoded by that genome, Smith‐Waterman comparisons [20] were used to retrieve a set of homologous proteins with a significant similarity (E<0.01) and with a region of similarity covering more than 50% of the query sequence. Sets of homologs were limited to the most similar 250 sequences but additional homologous proteins were added if they belonged to a species not already present in the most similar 250 sequences. Every set of homologous sequences was aligned using clustalW [21], and Neighbor Joining trees were generated using both Kimura distances and the Dayhoff matrix as implemented in ClustalW and Phylip packages respectively [22, 23].

59 Chapter 4

Selection of orthologous groups: Every tree of a phylome was scanned automatically for partitions that only contained the sequence on which the tree was based, together with eukaryotic proteins and other proteins from alpha, beta and gamma proteobacterial species. Beta‐ and gamma‐ proteobacterial species were allowed in the partition as the alpha‐ beta‐ and gamma‐proteobacteria were not always well resolved within the phylogenetic trees and allowing beta‐ and gamma‐ proteobacterial proteins increased the coverage of the method for likely proto‐ mitochondrial proteins like complex I subunits and proteins from the mitochondrial genome of Reclinomonas americana (data not shown, see [16]). Only groups for which such a eukaryotic‐proteobacterial partition was supported by both distances measures used (Kimura and Dayhoff) were further considered, and proteins within such a partition were regarded as orthologs. This method of large‐scale orthologous group inference by phylogeny approximates the evolutionary definition of orthology [24] more closely than methods that use relative levels of sequence identity, like best‐bidirectional hits. When there were multiple alpha‐proteobacterial/eukaryotes specific partitions within one phylogenetic tree, they were regarded as separate orthologous groups. We filtered out possible cases of horizontal gene transfer after the endosymbiosis by examining the topology of the trees: 1) Whenever a single alpha‐proteobacterial genus branched out lately after the endosymbiosis event and the radiation of eukaryotes we interpreted it as a gene transfer from a eukaryotic species to the alpha‐proteobacteria and this group was therefore discarded. 2) We also discarded orthologous groups that were not present in more than one genus in either Eukaryotes or alpha‐proteobacteria. This combination of filters eliminates such proteins as the ADP/ATP translocases that are only shared between the parasitic Rickettsia and Encephalitozoon cuniculi [25], or a large set of orthologous groups (99) that are only shared between Arabidopsis thaliana and the plant symbiont Mesorhizobium loti. After applying the filters the procedure produced a list of proteins likely derived from the mitochondrial ancestor and their orthologs among the eukaryotic and alpha‐proteobacterial species. Lists derived from the study of the 6 different alpha‐proteobacterial phylomes were merged in a non‐redundant final list of orthologous groups, absences of eukaryotic proteins from these orthologous groups were double‐checked in genomes and proteomes (see bellow). The components of this orthologous groups are listed on line (http://www.cmbi.kun.nl/~jagabald/table_groupsb.html ).

Double‐check of gene losses in the genomes: A Hidden Markov Model [26] of each orthologous group was built and calibrated using the HMMer package [27] and this was used to search against a six‐frame translation of the eukaryotic genomes in which the group was absent using the HFRAME software on a Paracel2 Genematcher. Significant hits (E‐value<10‐10) in the translated proteome that were absent in the original dataset were included in the orthologous groups if they corresponded to proteins in newer proteome releases (most of the cases) or if the manual analysis of the genomic region indicated that it was actually a gene and clustered within the orthologous group when the tree was reconstructed.

60 Gene loss following mitochondrial endosymbiosis

Fig. 1. (Note: a color version of this figure is shown in appendix 2) Schematic representation of the analyses: a) For every protein encoded in one of the six alpha‐proteobacterial genomes (blue rectangle), b) a Smith‐Waterman search was performed against the proteins encoded in a total of 75 genomes (rectangle) including the 6 alpha‐proteobacteria, 9 eukaryotes (in red) and other bacteria and archaea (in yellow), c) The selected homologs were aligned and d) a phylogenetic tree was derived. Repeating this procedure the phylomes of the six alpha‐ proteobacteria were constructed and e) scanned for partitions indicating an alpha‐ protomitochondrial origin of the eukaryotic proteins, f) alpha‐proteobacterial and eukaryotic proteins of this partition constitute an orthologous group whose phylogenetic pattern is derived, g) a Hidden Markov Model is built from an alignment of the members of each orthologous group and used to, h) search against the genomes of the eukaryotic species absent in the profile to detect unpredicted genes and i) search against the eukaryotic proteomes and j) eventually correct the phylogenetic profile, k) this profile is mapped on the species tree to infer the pattern of gene loss events.

Double‐check of gene losses in the proteomes: A new HMM [62,63] for each orthologous group was derived from the columns of the multiple sequence alignment that included the alpha‐proteobacterial proteins, allowing gaps of less than 30 amino acids. This HMM was used to search against all eukaryotic proteomes

61 Chapter 4

using the HMM software on a Paracel2 Genematcher. Significant hits (E‐value<10‐10) that were absent in the original orthologous group were included if they clustered with the orthologous group when the tree was reconstructed.

Co‐evolution measures: For every orthologous group the gene loss events needed to explain the actual phylogenetic distribution were mapped along the edges in the consensus tree. This information was expressed by a simple binary count for the presence (1) or absence (0) of the orthologous group in every edge of the tree. This profile is equivalent to a phylogenetic profile in which the ancestral states for every node in the tree are included. The profile has (2n‐1) bits, where n is the number of species present in the tree and every bit represents a specific edge of the tree. We defined the co‐evolution score of a pair of genes as the amount of concordances of the type (1,1: both genes are present in the edge) minus the amount of discrepancies (0,1 or 1,0: one of the two genes is present in the edge and the other is not) between their profiles. The method is equivalent to superimposing the trees from two orthologous groups, summing the overlapping edges and substracting the unmatched ones. Note that concordances of the type (0,0) are not taken into account, when gene‐pairs have an identical phylogenetic distribution their co‐evolution score increases with the number of species in which they are present. The maximum value of this score is (2n‐1) and represents a perfect match between orthologous groups present in all species.

4.2 Results and discussion 4.2.1 Alpha‐protobacterial derived proteome in eukaryotes From the analyses (Figure 1 and Methods) of the phylomes, the complete set of phylogenies [19], of six alpha‐proteobacterial genomes (Caulobacter crescentus, Brucella melitensis, Rhizobium loti, Rhizobium meliloti, Rickettsia conorii and Rickettsia prowazekii) we selected a total of 630 orthologous groups with a phylogeny that indicates a proto‐mitochondrial origin for the eukaryotic proteins (http://www.cmbi.kun.nl/~jagabald/table_groupsb.html). As we discussed earlier [16] this number is rather an underestimate as revealed by the ~50% of the proteins encoded in the mitochondrial genome of Reclinomonas americana for which the phylogenetic signal that links them to the alpha‐proteobacterial proteins is lost beyond the level of detection of our methodology (see methods). To reduce the level of possible false negatives among the eukaryotic members of the orthologous groups, absences from the (predicted) eukaryotic proteomes were systematically re‐checked by searching with a Hidden Markov Model of the orthologous group against the (predicted) proteomes and against a six‐frame translation of the genomes (see methods). The resulting proto‐mitochondrial derived gene content in eukaryotic genomes is highly variable, both in the number of genes present, ranging from 25 orthologous groups in Encephalotozoon cuniculi to 384 in Arabidopsis thaliana (Figure 2), as in the proteins encoded (see supplementary table). This variation is likely the result of differential loss rather than differential gain, because 1) phylogenetic

62 Gene loss following mitochondrial endosymbiosis

analyses of mitochondrial genes in eukaryotes have never revealed multiple origins, and 2) our method minimizes potential recent transfer between alpha‐proteobacteria and eukaryotes (see methods). In order to identify the gene loss events that determined this variation we need a consensus phylogenetic tree of eukaryotes [28]. Unfortunately there is some controversy about the relative position of Apicomplexa (Plasmodium), Viridaeplantae (Arabidopsis) and opisthokonts (Metazoa and Fungi), and for the relative topology within Metazoa (Homo, Drosophila, Caenorhabditis) [28‐ 30]. We can however escape this dilemma by using a parsimony approach to select from these different topologies the one that minimizes the total amount of losses in the tree: We calculated the amount of gene losses needed to explain the phylogenetic distribution of the groups for all possible topologies of the phyla mentioned above, and chose the phylogenetic tree that minimized the gene losses as the most parsimonious one (Figure 2). For the relative phylogenetic position of Chaenorhabditis elegans, Drosophila melanogaster and Homo sapiens, our results are consistent with a monophyly of C.elegans and D.melanogaster (364 losses), although the difference with the other two possible monophylies, C.elegans with H.sapiens and D.melanogaster with H.sapiens, is rather small (365 and 377 losses respectively) (Figure 2).

Our results also support the grouping of Apicomplexa with Viridiplantae (655 losses), relative to the grouping of either of these with the Ophistokonta (799 and 1057 losses respectively) in a tree that is rooted with the reconstructed proto‐ mitochondrion. Gene content has been used to generate phylogenetic trees based on whole genome comparisons [31]. These trees have been argued to be phenetic rather than phylogenetic because of massive horizontal transfer [32]. The use of a set of proteins derived from a unique event, like the ones used in this paper, can overcome this objection. Another cause of convergent evolution, massive parallel gene loss (e.g. in parasites), can, however, not be overcome. Our approach failed to reconstruct the consensus phylogeny of the eukaryotes when it was allowed to choose among all alternative tree topologies of the nine species included. This is because the unrelated parasites Encephalitozoon cuniculi and Plasmodium falciparum both have suffered massive loss of their proto‐mitochondrial derived gene content, causing the species to cluster together in the most parsimonious gene content tree.

We quantified the extent of gene loss in our set by combining the resulting consensus tree with the phylogenetic distribution of each orthologous group. A total of 1882 gene loss events are needed to explain the observed distribution, resulting in an average of 2,98 losses per orthologous group. Applying more stringent criteria to avoid cases of possible “recent” horizontal transfer from the alpha-proteobacteria to the eukaryotes, by keeping only the 344 groups that are present in at least two of the three main branches of the tree (Metazoa, Fungi and Apicomplexa/Viridaeplantae), did not change the picture of extensive gene loss. In that case the proto-mitochondrial derived gene content varied from 25 groups in E. cuniculi to 292 in A. thaliana, and a total of 1081 loss events were needed to explain the phylogenetic distribution of the groups, but the number of losses per group stayed roughly constant (3,06 losses/group). This indicates that the high frequency of gene loss that we observe in our set does not rather reflect more “recent” horizontal gene transfers between alpha- proteobacteria and eukaryotes.

63 Chapter 4

Fig. 2. Phylogenetic distribution of Eukaryotic proteins that were selected as derived from the proto‐mitochondrion. In ovals are numbers of orthologous groups present in at least one species above it. Following the species name (a/b) are the numbers of a: orthologous groups in this species and b: orthologous groups exclusive for this species.

4.2.2 Parallel gene loss and function prediction. The gene loss events were mapped on the consensus tree to generate a profile (see methods and Figure 1) for every orthologous group. The 1159 yeast proteins included in the PATHWAY database of the Kyoto Encyclopedia of Genes and Genomes [33] was used to benchmark the relationship between co‐evolution and the tendency to function in the same pathway. The average probability of two genes from the complete yeast genome for being in the same map in the PATHWAY database is 0.076. This probability is significantly higher (0.112, p<0.001 in a Chi‐ square test)) for the set of 93 orthologous groups that include yeast proteins and have a proto‐mitochondrial origin (Figure 3), indicating that proteins that share their evolutionary origin are more likely to function in the same biochemical pathway than proteins of diverse origins. The probability of being in the same metabolic map increases further to 0.43 when considering only pairs of orthologous groups with an identical phylogenetic distribution (Figure 3), indicating that proteins gained at the same endosymbiotic event and lost later in the same lineages have a relatively high tendency of functioning in the same pathway. We then investigated different ways of measuring co‐evolution based on the knowledge of the position of the gene‐loss events for every orthologous group in the species tree. The best correlation with a functional relationship was achieved by a score that, given a pair of orthologous groups and the tree in figure 2, resulted from

64 Gene loss following mitochondrial endosymbiosis

summing up the edges in the tree in which both groups were present and subtracting the edges in which only one of them was present.

Fig. 3. Functional interaction expressed as the fraction of gene-pairs functioning in the same metabolic map according to the PATHWAY database: A) Average functional interaction of yeast gene-pairs. B) Average functional interaction of yeast gene-pairs derived from the proto-mitochondrion. C) Average functional interaction of yeast gene-pairs that are derived from the proto-mitochondrion and have the same gene-loss pattern. D) Correlation between the co-evolution score and probability of function in the same pathway map.

This is analogous to counting concordances minus discrepancies after superimposing trees from the two orthologous groups (see methods). Pairs of orthologous groups with a score higher than 25 (see methods) achieved a probability of 0.7 of being in the same biochemical pathway, while groups with negative values have probabilities under the average (Figure 4). These values indicate that the fraction of false‐positives is reduced when considering only groups that have been conserved in most of the lineages, in contrast to methods like mutual information [34] that give equal weight to the correlated absence of genes. The obvious disadvantage of a measure that values “co‐presence” over “co‐absence” is that in a set like the proto‐mitochondrial one that suffered from multiple extensive gene loss events there are very few groups with few losses. Nevertheless, the likelihood that two proteins with the same gene loss pattern function in the same process is high enough to predict interactions when independent supportive evidence is found. Therefore we further investigated the pairs of orthologous groups within our set that followed identical history of gene loss events. This investigation consisted of deeper sequence, literature and genome‐context analyses, making use of several programs and databases such as STRING [35], PubMed (www.ncbi.nlm.nih.gov/PubMed), MITOP [36] , SMART [37] , Pfam [38], Psort [39]. Based on the level of confidence of

65 Chapter 4

the supportive evidence, bootstrap values supporting the alpha‐proteobacterial of the eukaryotic proteins and the importance of the biological processes involved, we select an example that illustrates the application of differential gene‐loss for function prediction.

4.2.3 A case‐history: a novel Complex I interacting protein NADH‐ubiquinone oxidoreductase or complex I is a proton‐pumping of the mitochondrial inner membrane electron transport chain whose impairment causes severe disease in human [40]. Complex I subunits have a dual origin: the core proteins are derived from the alpha‐proteobacteria, while the remainder is eukaryote specific [41]. The set of core components follows a similar evolution in eukaryotes with certain lineage‐specific gene‐loss patterns. We used this property to identify possible ancient interactions that may still play a role in the function of complex I in eukaryotes. In our set of proto‐mitochondrial derived orthologous groups we identified one human protein with unknown function (REFSEQ accession number NM_144736.1, bootstrap 100/100) that followed the same pattern of gene loss as the complex I subunits 20, 27, 49 and 75 kD (Figure 4). This specific pattern is exclusive for these five orthologous groups, and when searching against other fully‐sequenced eukaryotic genomes such as Anopheles gambiae, Neurospora crassa,Yarrowia lipolytic, Chlamydomonas reinhardtii and Mus musculus, they followed the same pattern (present in all of them). Other complex I core components, like 51kD, follow a similar but non‐identical pattern (Figure 4 and results not shown). Moreover NM_144736.1 is significantly co‐expressed (0.6497 uncentered pearson correlation) in a number of experiments in C.elegans [42] with the Y57G11C.12 gene, that encodes a complex I subunit. Based on the co‐evolution with complex I subunits, and its co‐expression with Y57G11C.12 we propose a role for NM_144736.1 protein in complex I function, this role is also consistent with its probable mitochondrial localization (0.967) indicated by MitoProt [36]. Note that the existence of independent kinds of genomic context pointing to the same biological process substantially increases the likelihood of a functional interaction [43]

66 Gene loss following mitochondrial endosymbiosis

Fig. 4. Phylogenetic distribution and gene‐loss events of several complex I subunits and that of the orthologous group containing NM_144736.1. Species are indicated by a three‐letter code, grey‐filled boxes indicate presence of that particular gene in a certain species.

4.3 Concluding remarks Here we have expanded a procedure that uses large‐scale phylogenetic analyses to identify and define orthologous groups derived from the endosymbiosis of proto‐mitochondria [16] to also monitor subsequent losses in eukaryotic lineages. The results indicate that the distribution of the proto‐mitochondrial derived protein pool among present‐day eukaryotes is highly variable, with an average of roughly three gene‐loss events per orthologous group in the phylogeny of nine species. The frequency of gene loss is high for a subset of proteins derived from the ancestor of the mitochondrion, an organell that plays such a central role in the energy metabolism of the eukaryotes. Gene loss has previously been observed to be the most dominant process in prokaryotic gene content evolution, relative to gene duplication, horizontal transfer and gene genesis [1, 44, 45]. In retrospect, in prokaryotes, where genome sizes have reached equilibrium [1, 44, 45] this maybe not so surprising, as gene loss is the only process that balances the three gene‐gain processes. Nearly one third of the gene loss frequency in eukaryotes is explained by the massive gene loss associated with an almost complete secondary loss of mitochondria in the microsporidium species Encephalitozoon cuniculi [25], that only retains 25 of the 630 proto‐mitochondrial derived groups, 6 of which are mitochondrial in yeast and have been detected before [46]. This conservation of proto‐mitochondrial derived proteins is likely related to the existence of a mitochondrial remnant that has been detected in the related microsporidium Trachipleistophora hominis [47]. Considering only

67 Chapter 4

eukaryotic species that possess “true” mitochondria there is still a great variability ranging from 89 groups present in Plasmodium falciparum to 384 in Arabidopsis thaliana, with an invariable core of only 38 orthologous groups that is conserved in all these species. This variability may account for some of the observed differences in the functions performed by mitochondria in different organisms [14], but also in pathways derived from the proto‐mitochondrion that potentially moved elsewhere in the cell. The loss of a proto‐mitochondrial derived ortholog in some species might be explained by either a loss of function or a non‐orthologous gene displacement. Both cases are likely to reflect adaptation to different environments. A remarkable example is the loss of complex I in S. cerevisiae and Schizosaccharomyces pombe. This loss has been proposed to be adaptation to anaerobic environments in S. cerevisiae in which three alternative NADH‐dehydrogenases would perform a similar task with a lower cost of oxidative damage [48]. A wider case of non‐orthologous gene displacement represents the absence of any proto‐mitochondrial derived fatty‐acid metabolism enzyme in the malaria parasite Plasmodium falciparum, while this is conserved in other organisms such as yeast and human. Although the absence of this pathway has been long interpreted as an incapability of de novo fatty acid synthesis by this species, an alternative fatty acid biosynthesis pathway has been recently discovered in Plasmodium [49]. Detection of non–orthologous gene displacement could become an important for the identification of potential drug targets that are only present in a parasitic species. In this sense a potential application of the detection of gene loss with a phylogenetic method could be the subsequent detection of gene displacement by using computational techniques such as anti‐correlating occurrences [50]. As far as the lineage‐specific gene loss among the different eukaryotic lineages is the result of functional adaptation, we can conversely exploit it to predict functional interaction among groups of proteins that have co‐evolved. This approach is an extension of the one proposed by other authors [6], that analyze co‐elimination of proteins during the divergence of two close related species (S.cerevisiae and S. pombe). We limited the study to a set of proteins derived from an early event in the evolution of the eukaryotes, the endosymbiosis of the mitochondria, in order to use a larger and more diverse set of eukaryotic species. With the sequencing of more genomes, the use of lineage‐specific gene loss can become a useful technique for function prediction in eukaryotes, where other context‐based prediction methods such as conserved gene order have a limited value [51]. An obvious extension of this method could be the analysis of lineage‐specific gene loss of the proteins derived from the chloroplast endosymbiosis [52, 53]. Given their size, the number of sequenced eukaryotes is still rather small to use differential loss alone as a signal for function prediction. Increasing this signal by taking genes derived from an endosymbiosis event allows predictions with a reasonable level of confidence to be made.

68 Gene loss following mitochondrial endosymbiosis

Acknowledgements We thank Berend Snel for his critical review of the manuscript. This work was funded in part by a grant from the Netherlands Organization for Scientific Research (NWO).

References

1 Snel, B., Bork, P. and Huynen, M. A. ʺGenomes in flux: the evolution of archaeal and proteobacterial gene contentʺ.Genome Res. 2002 12:17‐25 2 Ochman, H. and Jones, I. B. ʺEvolutionary dynamics of full genome content in Escherichia coliʺ.EMBO J. 2000 19:6637‐43 3 Makarova, K. S., et al. ʺComparative genomics of the Archaea (Euryarchaeota): evolution of conserved protein families, the stable core, and the variable shellʺ.Genome Res. 1999 9:608‐28 4 Koonin, E. V., Makarova, K. S. and Aravind, L. ʺHorizontal gene transfer in prokaryotes: quantification and classificationʺ.Annu Rev Microbiol. 2001 55:709‐42 5 Huynen, M. A. and Bork, P. ʺMeasuring genome evolutionʺ.Proc Natl Acad Sci U S A. 1998 95:5849‐56 6 Aravind, L., et al. ʺLineage‐specific loss and divergence of functionally linked genes in eukaryotesʺ.Proc Natl Acad Sci U S A. 2000 97:11319‐24 7 Katz, L. A. ʺLateral gene transfers and the evolution of eukaryotes: theories and dataʺ.Int J Syst Evol Microbiol. 2002 52:1893‐900 8 Pellegrini, M., et al. ʺAssigning protein functions by comparative genome analysis: protein phylogenetic profilesʺ.Proc Natl Acad Sci U S A. 1999 96:4285‐8 9 Andersson, S. G., et al. ʺThe genome sequence of Rickettsia prowazekii and the origin of mitochondriaʺ.Nature. 1998 396:133‐40 10 Gray, M. W., Burger, G. and Lang, B. F. ʺMitochondrial evolutionʺ.Science. 1999 283:1476‐81 11 Scheffler, I. E. ʺMitochondria make a come backʺ.Adv Drug Deliv Rev. 2001 49:3‐26 12 DiMauro, S. and Schon, E. A. ʺMitochondrial DNA mutations in human diseaseʺ.Am J Med Genet. 2001 106:18‐26 13 Duchen, M. R. ʺMitochondria in health and disease: perspectives on a new mitochondrial biologyʺ.Mol Aspects Med. 2004 25:365‐451 14 Tielens, A. G., et al. ʺMitochondria as we donʹt know themʺ.Trends Biochem Sci. 2002 27:564‐72 15 Gabaldón, T. and Huynen, M. A. ʺShaping the mitochondrial proteomeʺ.Biochim Biophys Acta. 2004 1659:212‐20 16 Gabaldón, T. and Huynen, M. A. ʺReconstruction of the proto‐mitochondrial metabolismʺ.Science. 2003 301:609 17 Beech, P. L., et al. ʺMitochondrial FtsZ in a chromophyte algaʺ.Science. 2000 287:1276‐ 9 18 Richly, E., Chinnery, P. F. and Leister, D. ʺEvolutionary diversification of mitochondrial proteomes: implications for human diseaseʺ.Trends Genet. 2003 19:356‐62 19 Sicheritz‐Ponten, T. and Andersson, S. G. ʺA phylogenomic approach to microbial evolutionʺ.Nucleic Acids Res. 2001 29:545‐52 20 Smith, T. F. and Waterman, M. S. ʺIdentification of common molecular subsequencesʺ.J Mol Biol. 1981 147:195‐7 21 Thompson, J. D., Higgins, D. G. and Gibson, T. J. ʺCLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position‐specific gap penalties and weight matrix choiceʺ.Nucleic Acids Res. 1994 22:4673‐80 22 Saitou, N. N., M. ʺThe neighbor‐joining method: a new method for reconstructing phylogenetic treesʺ.Mol Biol Evol. 1987 4:406‐25

69 Chapter 4

23 Felsenstein, J. ʺPHYLIP (Phylogeny Inference Package)ʺ. 24 Fitch, W. M. ʺDistinguishing homologous from analogous proteinsʺ.Syst Zool. 1970 19:99‐113 25 Katinka, M. D., et al. ʺGenome sequence and gene compaction of the eukaryote parasite Encephalitozoon cuniculiʺ.Nature. 2001 414:450‐3 26 Baldi, P., et al. ʺHidden Markov models of biological primary sequence informationʺ.Proc Natl Acad Sci U S A. 1994 91:1059‐63 27 Eddy, S. R. ʺProfile hidden Markov modelsʺ.Bioinformatics. 1998 14:755‐63 28 Baldauf, S. L., et al. ʺA kingdom‐level phylogeny of eukaryotes based on combined protein dataʺ.Science. 2000 290:972‐7 29 Brown, J. R., et al. ʺUniversal trees based on large combined protein sequence data setsʺ.Nat Genet. 2001 28:281‐5 30 Graham, A. ʺAnimal phylogeny: root and branch surgeryʺ.Curr Biol. 2000 10:R36‐8 31 Wolf, Y. I., et al. ʺGenome trees and the tree of lifeʺ.Trends Genet. 2002 18:472‐9 32 Huynen, M., et al. ʺLateral Gene Transfer, Genome Surveys, and the Phylogeny of Prokaryotesʺ.Science. 1999 286: 33 Ogata, H., et al. ʺKEGG: Kyoto Encyclopedia of Genes and Genomesʺ.Nucleic Acids Res. 1999 27:29‐34 34 Huynen, M., et al. ʺPredicting protein function by genomic context: quantitative evaluation and qualitative inferencesʺ.Genome Res. 2000 10:1204‐10 35 von Mering, C., et al. ʺSTRING: a database of predicted functional associations between proteinsʺ.Nucleic Acids Res. 2003 31:258‐61 36 Scharfe, C., et al. ʺMITOP, the mitochondrial proteome database: 2000 updateʺ.Nucleic Acids Res. 2000 28:155‐8 37 Schultz, J., et al. ʺSMART, a simple modular architecture research tool: identification of signaling domainsʺ.Proc Natl Acad Sci U S A. 1998 95:5857‐64 38 Bateman, A., et al. ʺThe Pfam protein families databaseʺ.Nucleic Acids Res. 2002 30:276‐80 39 Nakai, K. and Horton, P. ʺPSORT: a program for detecting sorting signals in proteins and predicting their subcellular localizationʺ.Trends Biochem Sci. 1999 24:34‐6 40 Schagger, H. ʺRespiratory chain supercomplexes of mitochondria and bacteriaʺ.Biochim Biophys Acta. 2002 1555:154‐9 41 Friedrich, T. and Weiss, H. ʺModular evolution of the respiratory NADH:ubiquinone oxidoreductase and the origin of its modulesʺ.J Theor Biol. 1997 187:529‐40 42 Kim, S. K., et al. ʺA gene expression map for Caenorhabditis elegansʺ.Science. 2001 293:2087‐92 43 von Mering, C., et al. ʺComparative assessment of large‐scale data sets of protein‐ protein interactionsʺ.Nature. 2002 417:399‐403 44 Mirkin, B. G., et al. ʺAlgorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotesʺ.BMC Evol Biol. 2003 3:2 45 Kunin, V. and Ouzounis, C. A. ʺThe balance of driving forces during genome evolution in prokaryotesʺ.Genome Res. 2003 13:1589‐94 46 Karlberg, O., et al. ʺThe dual origin of the yeast mitochondrial proteomeʺ.Yeast. 2000 17:170‐87 47 Williams, B. A., et al. ʺA mitochondrial remnant in the microsporidian Trachipleistophora hominisʺ.Nature. 2002 418:865‐9 48 Joseph‐Horne, T., Hollomon, D. W. and Wood, P. M. ʺFungal respiration: a fusion of standard and alternative componentsʺ.Biochim Biophys Acta. 2001 1504:179‐95 49 Waller, R. F., et al. ʺA type II pathway for fatty acid biosynthesis presents drug targets in Plasmodium falciparumʺ.Antimicrob Agents Chemother. 2003 47:297‐301 50 Morett, E., et al. ʺSystematic discovery of analogous enzymes in thiamin biosynthesisʺ.Nat Biotechnol. 2003 21:790‐5 51 Huynen, M. A., et al. ʺFunction prediction and protein networksʺ.Curr Opin Cell Biol. 2003 15:191‐8

70 Gene loss following mitochondrial endosymbiosis

52 Martin, W., et al. ʺEvolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleusʺ.Proc Natl Acad Sci U S A. 2002 99:12246‐51 53 Martin, W., et al. ʺGene transfer to the nucleus and the evolution of chloroplastsʺ.Nature. 1998 393:162‐5

71 Chapter 5

Afirmava que ell era molt senzill, sense reparar que tenia la mateixa complicació de tothom, amb una anatomia interna composta de moltes i molt meravelloses peces*.

Pere Calders (Invasió subtil I alters contes)

*He used to say he was a very simple guy, without realizing he was as complex as everybody, with an internal anatomy made of many, very wonderful pieces.

72 Tracing the evolution of Complex I

Chapter 5

Tracing the evolution of a large protein complex in the eukaryotes, NADH:ubiquinone oxidoreductase (Complex I)

Toni Gabaldón1, Daphne Rainey1 and Martijn A. Huynen

Journal of Molecular Biology 2005 May 13;348(4):857‐70

(1) Both authors contributed equally

73 Chapter 5

Summary The increasing availability of sequenced genomes enables reconstructing the evolutionary history of large protein complexes. Here we trace the evolution of NADH:ubiquinone oxidoreductase (Complex I), which has increased in size, by so called‐supernumerary subunits, from fourteen subunits in the bacteria to 30 in the plants and algae, 37 in the fungi and 46 in the mammals. Using a combination of pair‐wise and profile based sequence comparisons at the levels of proteins and the DNA of the sequenced eukaryotic genomes, combined with phylogenetic analyses to establish orthology relationships, we were able to 1) trace the origin of six of the supernumerary subunits to the alpha‐proteobacterial ancestor of the mitochondria, 2) detect previously unidentified homology relations between subunits from fungi and mammals 3) detect previously unidentified subunits in the genomes of several species and 4) document several cases of gene duplications among supernumerary subunits in the eukaryotes. One of these, a duplication of N7BM (B17.2), is particularly interesting as it has been lost from genomes that have also lost Complex I proteins, making it a candidate for a Complex I interacting protein. A parsimonious reconstruction of eukaryotic Complex I evolution shows an initial increase in size that predates the separation of plants, fungi and metazoa, followed by a gradual adding and incidental losses of subunits in the various evolutionary lineages. This evolutionary scenario is in contrast to that for Complex I in the prokaryotes, for which the combination of several separate, and previously independently functioning modules into a single complex has been proposed.

Supplementary material for this paper: http://www.cmbi.kun.nl/~jagabald/CI_supplementary.htm

Introduction The energy‐transducing enzyme NADH:ubiquinone oxidoreductase (Complex I; EC 1.6.5.3) couples the oxidation of NADH to the reduction of ubiquinone with the generation of a proton gradient that can be used for the synthesis of ATP. With such a central bioenergetic role, it is not surprising that a number of severe human mitochondrial disorders, including the most frequently encountered, are associated with Complex I deficiencies [1‐3]. In bacteria and archaea, Complex I comprises a minimal set of fourteen subunits that resides in the plasma membrane[4‐6]. A previous study that focused on these two phyla has shown that Complex I subunits follow a modular evolution in which different components that are ascribed different functions, like NADH oxidation or proton pumping, tend to co‐evolve: i.e. the proteins of those modules tend to occur in the same genomes [7]. Eukaryotic Complex I resides in the inner membrane of mitochondria and differs from its bacterial counterpart in that it is larger and more intricate. From an evolutionary perspective, mitochondrial Complex I has a dual origin. The fourteen subunits that are orthologous to the Complex I functioning in bacteria comprise the so‐called bacterial core, which originates from the alpha‐proteobacterial ancestor of the mitochondria [8, 9]. In addition, a variable number of so‐called supernumerary

74 Tracing the evolution of Complex I

subunits found specifically in the eukaryotic Complex I are thought to have originated within the eukaryotic lineages [7, 10].

The experimental determination of the subunit composition of such an intricate protein assembly is a complex task that requires highly pure preparations of the enzyme and the combined use of elaborate techniques [11]. Only recently has the subunit content of Complex I from various eukaryotic species been determined. The species for which the most complete experimental characterization of Complex I subunit composition has been performed is the mammal Bos Taurus [11, 12], that contains 46 subunits, of which 45 have been identified. In other eukaryotes the determination of the Complex I subunit content has been based in a combination of experimental data and sequence comparisons with the B. taurus set. These include the mammal Homo sapiens [3, 13] (45 subunits), the fungi Yarrowia lipolytica [14] (37 subunits), and Neurospora crassa [15] (35 subunits), the algae Chlamydomonas reinhardtii (30 subunits) [16] and several plants including Arabidopsis thaliana [17] (30 Subunits). Notably, most recent studies [14, 16, 17] include sequence comparisons and genomic searches in other Complex I model species to establish homologies between the newly identified subunits and proteins in other species. However little is known on the subunit composition of Complex I from non‐model species and a general overview on the evolution of the eukaryotic Complex I combining the various data remains to be established.

In order to gain further insight into the function, evolution and subunit content for each eukaryotic species we conducted a comprehensive comparative genomics analysis of the Complex I subunits in seventeen eukaryotic species from which both the complete nuclear as well as the organellar genome sequences were available. Representatives are included from different phylogenetic groups including one algae (Chlamydomonas reinhardtii), the nucleomorph genome of one cryptophyte (Guillardia theta), one plant (Arabidopsis thaliana), five fungi (Saccharomyces cerevisiae, Schizosaccharromyces pombe, Candida albicans, Neurospora crassa and Yarrowia lipolytica), one microsporidian (Encephalotozoon cuniculi), two apicomplexa (Plasmodium falciparum and Cryptosporidium hominis) and six metazoa (Caenorhabditis elegans, Drosophila melanogaster, Anopheles gambiae, Takifugu rubripes, Mus musculus and Homo sapiens), representing different life styles from aerobic to (facultatively) anaerobic. The set includes species for which the subunit content of Complex I has been (partially) characterized as well as species thought to lack Complex I such as the fungi S. cerevisiae and S. pombe, and the microsporidian E. cuniculi. The combined use of pair‐wise and profile‐based searches at the DNA and protein sequence levels allowed us not only to detect previously unidentified Complex I subunits in several species, but also to trace back to the bacteria the origin of at least seven of the supernumerary subunits, six of which were likely present in the proto‐mitochondrial ancestor. Moreover, the establishment of homology and orthology relationships, by means of profile‐to‐profile comparisons and phylogenetic reconstructions respectively, enables the comparison of experimental results obtained from different species. In addition, the detection of homologies with proteins of known function and the detection of gene duplications within the eukaryotic lineages may provide

75 Chapter 5

clues for the function of certain Complex I subunits and candidates for new Complex I‐ interacting proteins respectively. Finally we reconstruct the evolutionary history of gains and losses of the Complex I subunits along the eukaryotic lineages.

Results and Discussion

Phylogenetic distribution of Complex I subunits We detect a variable number of genes encoding orthologs of Complex I subunits and of proteins involved on its assembly in the sixteen analyzed species (Table 1 and supplementary material: http://www.cmbi.kun.nl/~jagabald/CI_supplementary.htm). It must be noted that the identification of a Complex I subunit ortholog in a genome does not necessarily imply that the encoded protein is actually part of the complex, a possibility that can only be corroborated experimentally. Besides the species shown we also investigated the presence of Complex I subunits in species like the chordate Ciona intestinalis [18] for which a draft genome is available. However as we detected many unexpected cases of absence of widespread supernumerary subunits (nine in C. intestinalis), and considering that this might be caused by the incompleteness of the genome we did not include these results in the analyses. For the considered species the number of subunits, including the CI assembly proteins CIA30 and CIA84, ranges from 39 subunits (Y. lypolitica and C. albicans) to 47 (T. rubripes, H. sapiens and Mus musculus) in species with Complex I. Compared to the size of the bacterial Complex I, these data show a nearly three‐fold increase in the subunit content. However this size is not that far from the set of 37 subunits that likely formed the primitive Complex I present in the common ancestor of the three major eukaryotic kingdoms (see bellow). As expected, no Complex I subunit was detected in the nucleomorph genome of the algae G. theta nor the nuclear genome of the microsporidian E. cuniculi. In the case of the apicomplexa P. falciparum and C. hominis our results also indicate a complete absence of Complex I subunits from both the nuclear and the organellar genomes. In the case of P. falciparum, this is not consistent with a recent report of Krungkrai and co‐workers about Complex I activity in this species [19]. We believe that, though rotenone‐sensitive, the NADH‐dehydrogenase activity reported by Krungkrai et al is accomplished by an alternative enzyme, as we did not find homology to any of the Complex I subunits in our study, nor is there actually space for classical organellar‐ encoded Complex I genes in the P. falciparum mitochondrial genome.

Also in the case of C. hominis our results are not consistent with the metabolic reconstruction published along with the genome [20] in which a (partial) Complex I is depicted within a mitochondrial‐like organelle. In the C. hominis genome a homolog of a non‐rotenone sensitive NADH dehydrogenases that does not pump protons is however present similar to the situation in S. cerevisiae and P. falciparum. Interestingly, orthologous sequences of Complex I subunits were found in two species that to lack Complex I: S. pombe (3 subunits) and S. cerevisiae (1 subunit) (see bellow). Besides confirming earlier results [14, 16] we detect previously unidentified orthologous sequences of Complex I subunits in several species. This is the case for

76 Tracing the evolution of Complex I

all the nuclear‐encoded Complex I subunits of T. rubripes, six subunits in D. melanogaster (NIGM, NUDM, NB7M, NUOM, NINM and NUPM), two in A. gambiae (NINM, NUOM), seven in C. albicans (NB5M, NIDM, NIPM, NB2M, NB6M, NI9M and NB8M), one in C. elegans (NUOM) and three in M. musculus (NUOM, NB7M and NINM), none of these have been, to our knowledge, identified in previous approaches [21]. Many of these subunits are only detected at the DNA level by profile‐based searches, and likely represent cases of errors in gene predictions in the genome annotation process (see table 1).

The improvement of gene prediction and annotation appears as a common theme in the comparative analyses of pathways [22, 23], underscoring the value of “manual” medium‐scale analyses. Previous, large‐scale analyses based on the phylogenetic distribution of the homologs of the proteins in a pathway or complex [24], or of automatically determined orthologs [25], have pointed to the limited modularity in the evolution of biomolecular systems. Wrong orthology assignment does however appear to a key factor in explaining part of this limited evolutionary modularity [25]. Nevertheless, even after manual analysis of phylogenetic trees to establish orthology, the composition of Complex I remains highly variable between species.

Orthology and homology relationships among Complex I subunits In order to establish orthology relationships among Complex I subunits in different species we complemented the sequence‐comparison data with the construction and analysis of phylogenetic trees, thus using the original definition of orthology [26]. For the Complex I subunits for which homologs with known 3D could be detected by homology detection using Psi‐pred[27] (NUAM:1fdi00, NUBM:1qguB0, NUCM:1h2rL0, NUHM:1f37A, NUIM:1hqfB, NUKM:2frvA, ACPM:1dnyA, NUEM:1sb8A0, N5BM:1pw4A, NI8M:1s3aA, NIDM:1x79B, CI84:1w3bA and NUDM:1j90A), these were included in the considerations of homology by using the DALI 3D‐fold classification tool [28]. These analyses defined 60 different orthologous groups for Complex I subunits and two for the transiently associated proteins CI30 and CI84 (table 1). By using phylogenetic analyses we can distinguish between orthology and paralogy relationships, facilitating the task of comparing experimental results from different species. While the use of protein weights, sedimentation values and sequence fragments are often handy to particular researchers working with a specific organism, their use makes comparison between species unwieldy. For consistency, we have chosen here to use the Swiss‐Prot naming schema that is used for human and Neurospora for their corresponding orthologs in the other species. When discussing specific orthologous groups we mention both the Swiss‐Prot name and the Bos taurus protein name. In several cases our procedure, and specially when using profile‐to‐profile comparisons, detected homology between fungal and metazoan groups with low similarity that were originally considered independent. This is the case of the Y. lipolytica proteins NUVM and NUWM that our analyses assign to the NB5M(B15) and NESM(ESSS) orthologous groups respectively. A multiple sequence alignment for the NB5M/NUVM family is shown in Figure 1.

77 Chapter 5

Figure 1. ClustalX [29] view of the multiple sequence alignment of the NB5M/NUVM orthologous group. The alignment was performed with MUSCLE [30] and then refined manually. The numbers at the start and the end of the sequences indicate the first and last amino acid positions of the alignment. The bar plot at the bottom indicates the level of conservation of the positions in the alignment.

Table 1. Phylogenetic distribution of the 62 different Complex I subunits orthologs and transiently associated proteins. The first column contains, when applicable, the universal SwissProt code for that orthologous group. Light and dark grey cells indicate subunits from the bacterial core and the ancestral eukaryotic core respectively. The second column contains the common names for the corresponding subunits in Bovine Complex I or, in case of being absent from the Bovine Complex I, N. crassa. The last column indicates, with the corresponding name, the existence of bacterial homologs. The COG names are from the COG database [31], the NOG name is from the STRING database [32], CIA30 has not been assigned to an orthologous group yet, the number in the table is the accession number for the Rubrivivax gelatinosus protein shown in the tree of Figure 3. Columns three to sixteen indicate the presence/absence of a certain orthologous group in the combined nuclear and organellar genomes of the species considered. The three types of evidence for the presence of a gene (Experimental evidence, a predicted gene that is orthologous to a known subunit, an unpredicted gene that is orthologous to a known subunit) are indicated by dark‐grey/light‐grey/light‐grey with a stripe, respectively. Species code:Bt, B. taurus; Mm, M. musculus; Tr, T. rubripens, Dm, D. melanogaster; A. gambiae; Ce, C. elegans; Nc, N. crassa; Yl, Y. lipolytica; Sc, S. cerevisiae; Sp, S pombe; At, A. thaliana; Atc A. thaliana (chloroplast genome); Cr, C. reinhardtii.

78 Tracing the evolution of Complex I

79 Chapter 5

Our results, based on MUSCLE alignments [30] and profile‐to‐profile comparisons [33], are consistent with the existence of homology between NI2M(B22) and NB4M(B14) subunits. These subunits have been placed in the same Pfam family (pfam05347) but recent comparisons based on ClustalW alignments have raised doubts about their homology relationships [16]. In addition several homology relationships encountered between Complex I subunits and other proteins of known functions can provide a clue on the specific role of these subunits. Finally we are able to correct wrong orthology assignments regarding fungal and mammal supernumerary subunits in SwissProt. NUMM subunit in mammal corresponds to an unnamed protein in fungi (Gene Bank accession no. 32403914 in N. crassa), while the fungal NUMM proteins are orthologous to the mammalian NIDM (see also ref. 16).

Homology relationships between Complex I subunits and proteins of unknown functions The phylogenetic analyses of Complex I subunits have revealed several cases of gene duplication events during the radiation of eukaryotes (Table 2). Some of the duplicated subunits might remain functionally associated with the complex, something that could be tested experimentally.

Complex I subunit Homolog / function N5BM TIM17/22 family NI8M Mitochondrial ribosomal proteins L43/S525 NUDM KITM, Deoxyribonucleoside kinase NB4M NI2M, Complex I subunit NI2M NB4M, Complex I subunit N7BM Unknown function, only present in species with Complex I CI30 Unknown function, fungal specific. NUML Unknown function, metazoan specific NESM Unknown function, rodent specific NI9M Unknown function, duplicated in Human and Chimp

Table 2. Paralogous relationships found between Complex I supernumerary subunits (first column) and proteins of known and unknown function (second column). Functional annotations refer, when applicable, to the proteins that are paralogous to the known Complex I subunits.

An ancient duplication is that of the N7BM(B17.2) subunit, which is duplicated in all species with complex I with the sole exception of the algae C. reinhardtii (Figure 2). Remarkably both paralogous groups have only been retained in species with Complex I while being lost several times independently from the species that lack the enzyme, like S. pombe, S. cerevisiae, E. cuniculi and the apicomplexa P. falciparum and C. hominis. This similarity between their evolutionary profiles and that

80 Tracing the evolution of Complex I

of Complex I strongly suggests that the new paralogous group is also functionally associated with Complex I. In another case, the phylogenetic reconstructions of the transiently associated protein CI30 (Figure 3) indicate the existence of two paralogous groups in fungi. The group that includes the experimentally characterized CI30 from N. crassa [34] is present in all species with Complex I, including members from fungi, metazoans and photosynthetic eukaryotes. On the contrary, the other paralogous group is restricted to fungi and includes the Complex I‐devoid yeast S. pombe, while it is absent from species with Complex I such as N. crassa and Y. lipolytica, indicating that there is likely no functional relationship between this paralogous group and Complex I. This paralogous relationship solves the puzzle of the presence of a CI30 homolog in the Complex I devoid S. pombe [34]. The fact that the orthologous group containing CI30 has been lost concomitantly with Complex I in several lineages strongly suggests that the association with Complex I observed in N. crassa is conserved in other species. However it must be noted that no physical association with Complex I have so far been detected in other species nor have mutations in this gene have associated with Complex I deficiency in human [35]. The existence of a third CI30 paralogous group in eukaryotes, represented in the tree by an Arabidopsis sequence of unknown function (At4g18810), is intriguing. This sequence is evolutionary related to a Synechocistis sp. protein, suggesting a plastidial origin for the nuclear‐encoded plant gene. Moreover At4g18810 is predicted to be targeted to the chloroplast by P‐Sort (p=0.98) and TargetP (0=0.78) [36, 37]. Based on its homology to CI30 and its likely plastidial origin and localization we hypothesize that At4g18810 function is related to the plastidial Ndh complex and its Synechocistis counterpart sll0096 to the cyanobacterial Ndh complex.

Finally, paralogous groups resulting from more recent duplications can be observed for the NESM(ESSS), NI9M(B9) subunits and the metazoan‐specific subunit NUML(MLRQ). NESM is duplicated in rodents, NI9M in human and chimp and NESM has two paralogous groups restricted to vertebrates (Table 2). These paralogous proteins are potentially associated with Complex I in certain tissues or under specific conditions. The GEO expression database (www.ncbi.nlm.nih.gov/UniGene/ESTProfileViewer/) indicates that the NUML and NESM paralogs are both highly expressed in neural and muscular tissues, which would be consistent with a role in ATP generation. For the NI9M paralog no expression profile is available, and only in fetal tissue has thus far an EST for this gene been observed. That these gene duplications occurred only recently and that the proteins have a high level of sequence identity to their paralogs NUML(66%), NI9M(73%) and NESM(73%), further supports a function as Complex I subunits. For the NUML, NI9M and NESM paralogous groups we can, in the absence of cases of Complex I loss among vertebrates, not use the similarity in the gene‐loss profile as an indication of a functional interaction with Complex I proteins.

81 Chapter 5

Figure 2. Neighbor Joining tree for the N7BM(B17.2) subunit and its paralogous group. Besides C. reinhardtii all species with Complex I have a member from each group, and both groups have been lost in the species without Complex I. Statistical support of the partitions is indicated by bootstrap values. Based on homology to N7BM and its history of gene loss events we predict that also the N7BM paralogous group is somehow involved with Complex I.

Homology relationships between Complex I subunits and proteins of known function In some cases, the homology searches and phylogenetic reconstructions have revealed the existence of homology relationships between Complex I subunits and proteins of known function (Table 2). Most of them have been described before, nevertheless we include them for completeness. This is the case of the N5BM(B14.7) subunit, which is paralogous to the TIM17/22 family. This family participates in the transport of proteins through the mitochondrial inner membrane [38]. Moreover, two Complex I subunits have paralogous relationships with ribosomal proteins, specifically NI8M(B8) belongs to a family of proteins that includes the mitochondrial ribosomal proteins L43 and S25 (Pfam accession PF05047) [39, 40], while the fungal‐ specific subunit NUZM is related to ribosomal protein L2. Remarkably all these homologous proteins are associated with mitochondrial multi‐protein complexes. As the ribosome, the protein translocation machinery and Complex I perform unrelated tasks, the existence of homology with subunits of these complexes would suggest a structural role for N5BM, NI8M and NUZM within the eukaryotic Complex I., e.g acting as scaffold, rather than being involved in Complex I function in other ways

82 Tracing the evolution of Complex I

such as electron transfer or proton‐pumping because there is no known electron transfer nor proton‐pumping in the ribosome or mitochondrial import machinery.

Figure 3. Neighbor Joining tree for CI30 Complex I associated protein and its fungal paralogous group. Statistical support of the partitions is indicated by bootstrap values. Note that the fungi‐specific paralogous group is present in species without Complex I (S. pombe) as well as absent from species with Complex I (e.g Y. lipolytica, N. crassa), indicating that it is likely not involved in the synthesis of Complex I.

A primitive eukaryotic Complex I composed of, at least, 33 subunits and 2 assembly proteins The observed subunit composition of Complex I is variable, not only in terms of the number of subunits, but also in terms of the specific polypeptide chains contained (Table 1 and supplementary material). Nevertheless a core of 32 subunits common to all functional eukaryotic Complex I can be defined. As expected, all fourteen subunits of the so‐called bacterial core (NU1M‐NU6M, NULM, NUAM, NUBM, NUCM, NUGM, NUHM, NUIM, NUKM) are ubiquitous to all species with

83 Chapter 5

Complex I. In addition, a group of eighteen supernumerary subunits (Table 1) are found in all eukaryotic species with Complex I, see also [16]. Besides this common set we detect subunits that are restricted to a single or a few species of our set. Mapping this distribution of the orthologous groups of Complex I subunits (Table 1) onto the phylogenetic tree of the eukaryotes [41, 42] and considering always the most parsimonious scenario, we obtain an overview of the history of gains and losses that affected Complex I during the diversification of eukaryotes (Figure 4). Mitochondrial Complex I originated from the bacterial enzyme present in the alpha‐proteobacterial ancestor of mitochondria [43], as can clearly be observed in the phylogenetic reconstructions of the bacterial‐core subunits done in previous studies [8, 9]. After the emergence of the mitochondrial Complex I several subunits, mostly of eukaryotic origin, were recruited into the complex. This gain must have predated the diversification of Fungi, Plants and Metazoa, since we lack any intermediary example in the species considered. In addition to the abovementioned 32 common subunits, subunits that are common to both plants and fungi to the exclusion of metazoans (NUXM) must have been recruited before the divergence of the three kingdoms. Furthermore, subunits that are only missing in C.elegans (NIMM, CI84) would have been lost from the respective species, although we cannot exclude the possibility that we failed to detect them because of extensive sequence divergence. Alternatively this could be due to incompleteness of the genome, however the C.elegans genome is regarded as complete (Daniel Lawson, pers. communication) furthermore searches for NIMM and CI84 against C. elegans EST databases and the unfinished genome of the related species Caenorhabditis briggsae produced no significant hits.

The evolutionary reconstruction thus provides a minimal estimate of 35 subunits, including 2 assembly proteins, for the Complex I that was likely present in the common ancestor of the three main eukaryotic lineages. This set includes all 31 widespread subunits found by Cardol et. al.[16]. When compared to the fourteen subunits of the alpha‐proteobacterial Complex I, 35 subunits represent a substantial gain in complexity. This increase in complexity that followed the endosymbiotic event is not uncommon to other electron transport chain complexes [44] and has been related to a possible implication of the subsidiary subunits in the biogenesis and stabilization of the complexes [12, 44].

Most of the supernumerary subunits originated within the eukaryotic lineages. Nevertheless our phylogenetic reconstructions for all Complex I subunits reveal an alpha‐proteobacterial origin for six supernumerary subunits (ACMP, NUEM, NI8M, CI30, NUMM and N7BM), suggesting that, being already present in the proto‐mitochondrion, these proteins were eventually recruited to function in Complex I. At which stage they actually became part of Complex I cannot be decided from these data. Examining the genomic context of the supernumerary subunits in alpha‐proteobacteria does not hint at an interaction with Complex I proteins, as, in the sequenced genomes, they never occur in operons with Complex I proteins.

84 Tracing the evolution of Complex I

Figure 4. Reconstruction of eukaryotic Complex I evolution based on the most parsimonious scenario of gene gains and losses. Incoming and outgoing arrows from the tree represent, respectively, gains and losses of the subunits indicated in the box. Names of subunits with bacterial homologs are indicated in bold. Dashed lines indicate lineages in which (most) of the Complex I subunits have been lost. Numbers accompanying the species names indicate the number of Complex I subunits found by this study in their genomes. After an initial surge in rise of complexity in early eukaryotic evolution of Complex I, there has been a gradual adding of proteins with incidental losses.

85 Chapter 5

The situation of these subunits is somewhat similar to the subunits NIMM(MWFE) and NUMM(13kD) whose genes have been found in fungal genomes but whose proteins have not been detected as part of Complex I in fungi. In the case of the latter two their phylogenetic distribution mimics that of other Complex I subunits (Table 1) arguing that they are involved in its function [14]. At least one of the supernumerary proteins, the acyl carrier protein appears also to function independently (ACPM). This protein is found in Neurospora, human and bovine complexes [45, 46] and presumably is a part of all eukaryotic complexes. Similarly, the NADPH‐bound subunit NUEM (39kD) has previously been related to short‐chain isomerases/reductases in bacteria [47].

Diversification of Complex I along eukaryotic lineages Subsequent to the gain of this primary set of, at least, 21 subunits, Complex I diversified along the various eukaryotic lineages and a number of lineage‐specific gain and loss events can be observed. Before the divergence of Fungi and metazoa the complex slightly increased its size with the recruitment of the NIAM(ASHI), NI9M(B9) and NB5M(B15) subunits that are exclusive to both lineages. The second largest gain of additional subunits occurs at the level of metazoa, prior to the bifurcation of nematodes and eumetazoans. At this stage seven additional subunits: NISM(SGDH), NIGM(AGGG), NUDM(42kD), N4AM(B14.5a), NB7M(B17), NUOM(10kD) and N4BM(B14.5b) are recruited to the metazoan Complex I, while a single subunit (NUXM) is lost from the complex. One of these metazoan‐specific subunits, NUDM, is homologous to the bacterial orthologous group of deoxynucleaside kinases (COG1428). This orthologous group is absent from the alpha‐proteobacteria and the closest relatives in the bacteria are from gram‐positives. A proto‐mitochondrial origin for this protein therefore appears unlikely. Recruitment of new subunits has continued along the diversification of eukaryotes with the acquisition of NUML(MLRQ) and NINM(MNLL) before the branching of coelomata (insecta in our study) and that of NIKM(KFYI) predating the diversification of fishes and mammals. Gains in fungi have apparently been more modest with the recruitment of NUZM(21.3a) prior to the radiation of fungi, followed by the gain of NURM(17.8kD) in the Neurospora lineage. Finally a gain of five subunits common to the photosynthetic eukaryotes occurred before the bifurcation of plant and algae while additional gains of two and four subunits occurred in the lineages leading to Chlamydomonas and Arabidopsis respectively. When the gain and loss events are examined in terms of Complex I structural domains, a picture emerges suggesting that the gains and losses spread out over the complex (Figure 5, see also [48]). In this way Complex I retained its characteristic L‐ shaped structure while significantly increasing its size and molecular mass [48, 49]. This pattern is in accordance with the hypothesis that supernumerary subunits are involved in the efficiency/stability of the complex. Nevertheless modifications of the complex are also important to other aspects of development in the individual species. For example, Complex I has been shown to be essential to sexual development in

86 Tracing the evolution of Complex I

Neurospora crassa [50]. In C. elegans, mutations in Complex I subunits have been shown to alter development as well as to produce marked effects on aging [51].

Figure 5. (Note: a color version of this figure can be found in appendix 2) Schematic structure of human/bovine Complex I, indicating which subunits were gained at every evolutionary stage. Circles and ovals represent different subunits, with their size being roughly proportional to their molecular weight. The position of the different subunits within the structure is in accordance with their presence/absence in the different elution fractions and sub‐complexes, to gene fusions and to other data described in the literature, see Methods for details. The color code indicates the evolutionary stage in which a certain subunit was gained. The figure illustrates that subunits have been gained in the evolutionary lineage leading to H.sapiens in all fractions of Complex I and in all stages.

Remnant subunits in species without a Complex I Besides the abovementioned lineage‐specific losses of certain subunits, Complex I was lost as a whole independently in a number of lineages. Nevertheless our analysis reveals the existence of remnants of Complex I in the genomes of the yeasts S. cerevisiae (ACMP subunit) and S. pombe (ACMP, NUBM and NUHM) (see also[34]). These findings raise questions about the possible biological role of these subunits in the context of Complex I and as isolated entities. In S. cerevisiae, ACMP is located in the mitochondrial matrix where it participates in the synthesis of octanoic acid as a precursor of lipoic acid [52, 53]. The different phenotypic effects that the inactivation of the ACMP gene has in N. crassa and S. cerevisiae suggests that, placed in the context of Complex I, ACMP might play a different role. For instance the disruption of this gene in S. cerevisiae produced respiratory‐deficient mutants devoid

87 Chapter 5

of lipoic acid [54], while in N. crassa it resulted in misassembled Complex I without any effect on lipoic acid synthesis [53]. The other two subunits that are found in S. pombe (NUBM and NUHM), belong to the bacterial core of Complex I and correspond to the bacterial 51 and 24kD, which are part of the flavoprotein fraction [55]. Notably these two subunits are proposed to be adjacent in Complex I [11] and they have already been found to functionally interact in the absence of the rest of the Complex I. For instance homologs of these pair of subunits are fused in the gene encoding the alpha subunit of the NAD‐reducing hydrogenase of Ralstonia eutrofus [56]. Also in eukaryotes a fusion between paralogs of the 51 and 24kD subunits is found in the hydrogenosomal ciliate Nyctotherus ovalis, likely coupling the oxidation of NADH to the reduction of protons, as both proteins are also fused with a hydrogenase [57]. There is no information, however, on what could be the electron acceptor for the NUBM and NUHM subunits in S. pombe.

The plastidial homolog of Complex I in photosynthetic eukaryotes Photosynthetic eukaryotes constitute a special case in that, besides the mitochondrial Complex I, they generally posses a Complex I homolog in the thylakoid membranes of their chloroplasts, the so‐called Ndh complex. Ndh complex is derived from the secondary endosymbiotic origin of plastids and is related to the Ndh complexes present in Synechocystis [58]. The eleven subunits of the Ndh complex that are encoded in the plastidial genome of A. thaliana were detected by sequence‐comparison analyses and clearly classified as different orthologous groups by the subsequent phylogenetic analyses. These subunits include homologs for all the bacterial‐core subunits of Complex I except those forming the NADH‐binding oxidizing module: NUAM, NUBM and NUHM (Table 1). As previously reported [59], no Ndh gene is detected in the plastid genome of C. reinhardtii, indicating a loss of the complex in this algal species. Consistently we do not detect homologs of the recently identified Ndh subunits in Synechocistis, NdhM and NdhN [58], in the algal genome. The absence of these genes from the genome of C. reinhardtii and their presence in the genomes of Ndh‐containing plant species such as A. thaliana, Zea mays [58] and Oryza sativa (results not shown) is consistent with the proposed role of these proteins within the Ndh complex of plants. Using profile‐to‐profile comparisons [33] we found no evidence for homology between B13/NUFM and the slr1623/NdhM subunits from A. thaliana and Synechocystis respectively, that has been previously suggested [58].

Concluding remarks The comparative genomics analysis presented here combines the most recent experimental data and computational techniques to gain insight into the evolution of eukaryotic Complex I in terms of its subunit content. The emerging picture is that of a multi‐protein enzyme of alpha‐proteobacterial descent that tripled its size by recruiting new subunits of both alpha‐proteobacterial and eukaryotic origins. We have no indication that the proteins that were added to Complex I in eukaryotic evolution function(ed) as separate multi‐protein complexes. The scenario of Complex I evolution in Eukaryotes, where new proteins are added to an existing complex,

88 Tracing the evolution of Complex I

therefore contrasts with the scenario in the more early, prokaryotic evolution of Complex I, where separate multi‐protein complexes appear to have been combined to a single one [10, 60]. The major upgrade of Complex I in eukaryotic evolution occurred prior to the divergence of the three main eukaryotic domains, resulting in a primitive eukaryotic complex of, at least, 35 subunits including two proteins involved in its assembly. Subsequently Complex I diversified along the different eukaryotic lineages with a continuous recruitment of new subunits and eventually a few losses. One intriguing question is whether the addition of new subunits in evolution mimics that in the assembly of Complex I. This is an appealing hypothesis, but the evidence is, so far, not convincing. On the one hand, in H. sapiens the supernumerary subunits NB8M (B18), NIPM(15kD) and NUEM(39kD) appear indeed to be added to an earlier assembled complex consisting mainly of the prokaryotic core proteins [61]. On the other hand, the supernumerary, metazoan specific subunit NB7M(B17) is added at an early stage in the assembly of the membrane subcomplex [61]. A definite answer to this question awaits a higher resolution of the reconstruction of Complex I assembly than is presently available. With more genomes being sequenced and the experimental characterization of Complex I being extended to other taxa, we expect to gain deeper insight on the evolution of the eukaryotic Complex I. The experimental determination of Complex I in more diverse taxa, such as insects or fishes, will likely detect new lineage specific subunits like there are present for fungi, algae and mammals. Moreover the analysis of tissue‐specific and transiently associated Complex I components is necessary to obtain a wider view of the evolution of eukaryotic Complex I that includes possible tissue‐specific adaptations and the origin of additional assembly factors.

Materials and Methods

Sequence data Acquisition of Genomes: 128 complete bacterial and archaeal genomes as well as the genomes of the sixteen eukaryotic species mentioned above were used in this study. Sequence data (DNA genomic sequence and predicted proteins) were obtained from the EBI databases (http://www.ebi.ac.uk/proteomes/ and http://www.ebi.ac.uk/genome) except for A gambiae, C. albicans, C. reinhardii and P. falciparum which were retrieved from Genbank (http://www.ncbi.nlm.nih.gov/genomes/static/euk.html), H. sapiens and M. musculus from the IPI set at EBI (http://www.ebi.ac.uk/IPI), N. crassa from MIT (http://www‐genome.wi.mit.edu/cgi‐ bin/annotation/neurospora/download_license.cgi), Y. lipolytica from Genolevures (http://www.cbi.labri.fr/genolevures/), C. hominis from [20] (http://www.hominis.mic.vcu.edu) and T. rubripes from IMCB Fugu genome project (www.fugu‐sg.org). For most eukaryotic species the organellar genomes were included in the gene set per species, otherwise those were downloaded separately from NCBI eukaryotic organelles site (http://www.ncbi.nlm.nih.gov/genomes/static/euk_o.html).

89 Chapter 5

Homology and Orthology detection: For each experimentally identified Complex I subunit in either Bos taurus[11, 12], N. crassa [15], Y. lipolytica [14], A. thaliana [17] or C. reinhardtii [16], as well as for the two intermediate associated proteins, CI30 and CI84, identified in N. crassa [34], a Smith‐Waterman search [62] was performed against the above‐mentioned database containing all proteomes to retrieve a set of homologous proteins with a significant similarity (E<0.01) and a region of similarity covering more than 50% of the query sequence. The sets of homologous sequences were aligned using MUSCLE [30]. Hidden Markov models (HMMs) were created from the abovementioned alignments using the HMMER program [63]. HMMs were subsequently used to screen the proteome database to recover more distantly related homologs potentially missed by the Smith‐Waterman search. In addition, and whenever a subunit was not found in the predicted proteome of a eukaryotic organism, the same HMMs were used to screen the genomic sequences (at the DNA level), using HFRAME software on a Paracel2 Genematcher. The latter screen served as a check for missing subunits that represent unpredicted genes in the genome annotation process. All sequences recovered by the HMMs screens were added to the initial sets of homologous proteins and new alignments were created. Neighbor‐Joining trees were derived from these alignments using Kimura distances as implemented in ClustalW [64, 65]. A bootstrap analysis using 1000 samples was conducted for each phylogenetic tree. The selection of the orthologous group for each Complex I subunit was subsequently done by manually examining all trees generated by the above‐ mentioned techniques. The orthology relationships among the eukaryotic sequences were assigned based on the species‐phylogeny for eukaryotes[41, 42], and the presence of these sequences within a monophyletic branch of the tree that contained known Complex I subunits. In parallel sequence comparisons were preformed using PSI‐Blast [66] (E‐value cut‐off 0.005). For those cases where “conspicuous proteins” (e.g. Complex I subunits, or proteins from species from which the query Complex I protein was missing) appeared in the PSI‐Blast “hit list”, but above the E‐value threshold (0.005< E < 100), we did reciprocal searches. The alignment from the results of the original search was then compared with that from the reciprocal search with COMPASS [33] using an E‐value cut‐off of 1E‐7.

Construction of Figure 5: In Figure 5 the subunits are drawn according to results from purification and co‐elution chromatography studies obtained from bacteria and bovine heart mitochondria [12, 67, 68], combined with bioinformatics analyses of the Complex I genes with respect to gene fusions, determined by STRING [32], and the presence of transmembrane regions predicted by the TMHMM server [69]. Sazanov et al., and later Carroll et al., recovered and named several subcomplexes from the bovine heart complex. A hydrophilic subcomplex, I‐lambda, extends into the mitochondrial matrix and consists of approximately sixteen subunits. Within this subcomplex, subunits were placed according to information from gene fusions in prokaryotes (NUBM with NUHM in Streptomyces coelicor, a paralog of NUAM with a paralog of

90 Tracing the evolution of Complex I

NUHM in Thermotoga maritima, NUCM with NUGM in the gammaproteobacteria, NUGM with NUKM in Archaeoglobus fulgidus), assuming the encoded proteins physically interact in the protein complex in eukaryotes. Subunits predicted to have TM regions along the entire length of the sequence (NU1M‐NU5M) are shown as embedded within the mitochondrial membrane, while those containing only an N‐ or C‐terminal prediction are shown to be anchored in the membrane (NB5M, NESM, NIMM). The gene fusion associations were confirmed and supplemented with crosslinking studies [67, 68] showing associations between NUCM/NUGM, NUCM/NUMM and NUCM/NUYM. I‐lamba is likely anchored to the membrane by proteins that are also presnt in the larger subcomplex I‐alpha (NB2M, NB6M, NIMM, NUOM, NUPM, and NUYM). Two mainly hydrophobic subcomplexes, I‐beta and I‐ gamma, were found, which represent the majority of subunits that are embedded in the membrane. I‐gamma consists of subunits NU1M, NU2M, NU3M, NIKM, NUDM, NUEM, NULM, N5BM. I‐beta is composed of membrane bound subunits such as NU4M, NU5M, NISM, NUJM, NI2M, NB8M. Within these subcomplexes, Sazanov et al. found clusters of two to three subunits that always co‐precipitated. For instance, from the I‐beta subcluster, NU4M and NU5M always co‐precipitated and from I‐ gamma, NU1M, NU3M, and NULM formed a tight cluster.

Acknowledgements We would like to thank the referees for their comments. This work was supported by a grant from the Netherlands organization for scientific research (NWO) and by the European Community’s sixth Framework Programme for Research, Priority 1 “Life sciences, genomics and biotechnology for health”, contract number LSHM‐CT‐2004‐503116.

References

1 Schapira, A. H. "Mitochondrial disorders".Curr Opin Neurol. 2000 13:527-32 2 Robinson, B. H. "Human complex I deficiency: clinical spectrum and involvement of oxygen free radicals in the pathogenicity of the defect".Biochim Biophys Acta. 1998 1364:271-86 3 Smeitink, J., et al. "Human NADH:ubiquinone oxidoreductase".J Bioenerg Biomembr. 2001 33:259-66 4 Weidner, U., et al. "The gene locus of the proton-translocating NADH: ubiquinone oxidoreductase in Escherichia coli. Organization of the 14 genes and relationship between the derived proteins and subunits of mitochondrial complex I".J Mol Biol. 1993 233:109-22 5 Xu, X., Matsuno-Yagi, A. and Yagi, T. "DNA sequencing of the seven remaining structural genes of the gene cluster encoding the energy-transducing NADH-quinone oxidoreductase of Paracoccus denitrificans".Biochemistry. 1993 32:968-81 6 Sazanov, L. A., et al. "A role for native lipids in the stabilization and two-dimensional crystallization of the Escherichia coli NADH-ubiquinone oxidoreductase (complex I)".J Biol Chem. 2003 278:19483-91 7 Friedrich, T. and Weiss, H. "Modular evolution of the respiratory NADH:ubiquinone oxidoreductase and the origin of its modules".J Theor Biol. 1997 187:529-40 8 Gabaldón, T. and Huynen, M. A. "Reconstruction of the proto-mitochondrial metabolism".Science. 2003 301:609

91 Chapter 5

9 Emelyanov, V. V. "Common evolutionary origin of mitochondrial and rickettsial respiratory chains".Arch Biochem Biophys. 2003 420:130-41 10 Friedrich, T. "Complex I: a chimaera of a redox and conformation-driven proton pump?"J Bioenerg Biomembr. 2001 33:169-77 11 Hirst, J., et al. "The nuclear encoded subunits of complex I from bovine heart mitochondria".Biochim Biophys Acta. 2003 1604:135-50 12 Carroll, J., et al. "Analysis of the subunit composition of complex I from bovine heart mitochondria".Mol Cell Proteomics. 2003 2:117-26 13 Bourges, I., et al. "Structural organization of mitochondrial human complex I: role of the ND4 and ND5 mitochondria-encoded subunits and interaction with the prohibitin".Biochem J. 2004 Pt: 14 Abdrakhmanova, A., et al. "Subunit composition of mitochondrial complex I from the yeast Yarrowia lipolytica".Biochim Biophys Acta. 2004 1658:148-56 15 Videira, A. and Duarte, M. "From NADH to ubiquinone in Neurospora mitochondria".Biochim Biophys Acta. 2002 1555:187-91 16 Cardol, P., et al. "Higher plant-like subunit composition of mitochondrial complex I from Chlamydomonas reinhardtii: 31 conserved components among eukaryotes".Biochimica et Biophysica Acta (BBA) - Bioenergetics. 2004 1658:212-224 17 Heazlewood, J. L., Howell, K. A. and Millar, A. H. "Mitochondrial complex I from Arabidopsis and rice: orthologs of mammalian and fungal components coupled with plant- specific subunits".Biochim Biophys Acta. 2003 1604:159-69 18 Dehal, P., et al. "The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins".Science. 2002 298:2157-67 19 Krungkrai, J., et al. "Mitochondrial NADH dehydrogenase from Plasmodium falciparum and Plasmodium berghei".Exp Parasitol. 2002 100:54-61 20 Xu, P., et al. "The genome of Cryptosporidium hominis".Nature. 2004 431:1107-12 21 Tsang, W. Y. and Lemire, B. D. "The role of mitochondria in the life of the nematode, Caenorhabditis elegans".Biochim Biophys Acta. 2003 1638:91-105 22 Huynen, M. A., Dandekar, T. and Bork, P. "Variation and evolution of the citric-acid cycle: a genomic perspective".Trends Microbiol. 1999 7:281-91 23 Gille, C., et al. "A comprehensive view on proteasomal sequences: implications for the evolution of the proteasome".J Mol Biol. 2003 326:1437-48 24 Peregrin-Alvarez, J. M., Tsoka, S. and Ouzounis, C. A. "The phylogenetic extent of metabolic enzymes and pathways".Genome Res. 2003 13:422-7 25 Snel, B. and Huynen, M. A. "Quantifying modularity in the evolution of biomolecular systems".Genome Res. 2004 14:391-7 26 Fitch, W. M. "Distinguishing homologous from analogous proteins".Syst Zool. 1970 19:99- 113 27 McGuffin, L. J., Bryson, K. and Jones, D. T. "The PSIPRED protein structure prediction server".Bioinformatics. 2000 16:404-5 28 Holm, L. and Sander, C. "Dali: a network tool for protein structure comparison".Trends Biochem Sci. 1995 20:478-80 29 Aiyar, A. "The use of CLUSTAL W and CLUSTAL X for multiple sequence alignment".Methods Mol Biol. 2000 132:221-41 30 Edgar, R. C. "MUSCLE: a multiple sequence alignment method with reduced time and space complexity".BMC Bioinformatics. 2004 5:113 31 Tatusov, R. L., et al. "The COG database: an updated version includes eukaryotes".BMC Bioinformatics. 2003 4:41 32 von Mering, C., et al. "STRING: a database of predicted functional associations between proteins".Nucleic Acids Res. 2003 31:258-61 33 Sadreyev, R. and Grishin, N. "COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance".J Mol Biol. 2003 326:317-36 34 Schulte, U. "Biogenesis of respiratory complex I".J Bioenerg Biomembr. 2001 33:205-12 35 Janssen, R., et al. "CIA30 complex I assembly factor: a candidate for human complex I deficiency?"Hum Genet. 2002 110:264-70 36 Nakai, K. and Horton, P. "PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization".Trends Biochem Sci. 1999 24:34-6 37 Emanuelsson, O., et al. "Predicting subcellular localization of proteins based on their N- terminal amino acid sequence".J Mol Biol. 2000 300:1005-16

92 Tracing the evolution of Complex I

38 Wiedemann, N., Frazier, A. E. and Pfanner, N. "The protein import machinery of mitochondria".J Biol Chem. 2004 279:14473-6 39 Gan, X., et al. "Tag-mediated isolation of yeast mitochondrial ribosome and mass spectrometric identification of its new components".Eur J Biochem. 2002 269:5203-14 40 Koc, E. C., et al. "A proteomics approach to the identification of mammalian mitochondrial small subunit ribosomal proteins".J Biol Chem. 2000 275:32585-91 41 Hedges, S. B., et al. "A molecular timescale of eukaryote evolution and the rise of complex multicellular life".BMC Evol Biol. 2004 4:2 42 Baldauf, S. L., et al. "A kingdom-level phylogeny of eukaryotes based on combined protein data".Science. 2000 290:972-7 43 Gray, M. W., Burger, G. and Lang, B. F. "Mitochondrial evolution".Science. 1999 283:1476- 81 44 Berry, S. "Endosymbiosis and the design of eukaryotic electron transport".Biochim Biophys Acta. 2003 1606:57-72 45 Runswick, M. J., et al. "Presence of an acyl carrier protein in NADH:ubiquinone oxidoreductase from bovine heart mitochondria".FEBS Lett. 1991 286:121-4 46 Sackmann, U., et al. "The acyl-carrier protein in Neurospora crassa mitochondria is a subunit of NADH:ubiquinone reductase (complex I)".Eur J Biochem. 1991 200:463-9 47 Schulte, U., et al. "A reductase/isomerase subunit of mitochondrial NADH:ubiquinone oxidoreductase (complex I) carries an NADPH and is involved in the biogenesis of the complex".J Mol Biol. 1999 292:569-80 48 Guenebaut, V., et al. "Consistent structure between bacterial and mitochondrial NADH:ubiquinone oxidoreductase (complex I)".J Mol Biol. 1998 276:105-12 49 Grigoriev, A. "A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae".Nucleic Acids Res. 2001 29:3513-9 50 Duarte, M. and Videira, A. "Respiratory chain complex I is essential for sexual development in neurospora and binding of iron sulfur clusters are required for enzyme assembly".Genetics. 2000 156:607-15 51 Tsang, W. Y., et al. "Mitochondrial respiratory chain deficiency in Caenorhabditis elegans results in developmental arrest and increased life span".J Biol Chem. 2001 276:32240-6 52 Harington, A., et al. "Identification of a new nuclear gene (CEM1) encoding a protein homologous to a beta-keto-acyl synthase which is essential for mitochondrial respiration in Saccharomyces cerevisiae".Mol Microbiol. 1993 9:545-55 53 Schneider, R., et al. "Different respiratory-defective phenotypes of Neurospora crassa and Saccharomyces cerevisiae after inactivation of the gene encoding the mitochondrial acyl carrier protein".Curr Genet. 1995 29:10-7 54 Brody, S., et al. "Mitochondrial acyl carrier protein is involved in lipoic acid synthesis in Saccharomyces cerevisiae".FEBS Lett. 1997 408:217-20 55 Galante, Y. M. and Hatefi, Y. "Purification and molecular and enzymic properties of mitochondrial NADH dehydrogenase".Arch Biochem Biophys. 1979 192:559-68 56 Pilkington, S. J., et al. "Relationship between mitochondrial NADH-ubiquinone reductase and a bacterial NAD-reducing hydrogenase".Biochemistry. 1991 30:2166-75 57 Akhmanova, A., et al. "A hydrogenosome with a genome".Nature. 1998 396:527-8 58 Prommeenate, P., et al. "Subunit composition of NDH-1 complexes of Synechocystis sp. PCC 6803: identification of two new ndh gene products with nuclear-encoded homologues in the chloroplast Ndh complex".J Biol Chem. 2004 279:28165-73 59 Maul, J. E., et al. "The Chlamydomonas reinhardtii plastid chromosome: islands of genes in a sea of repeats".Plant Cell. 2002 14:2659-79 60 Mathiesen, C. and Hagerhall, C. "The 'antiporter module' of respiratory chain complex I includes the MrpC/NuoK subunit -- a revision of the modular evolution scheme".FEBS Lett. 2003 549:7-13 61 Ugalde, C., et al. "Human mitochondrial complex I assembles through the combination of evolutionary conserved modules: a framework to interpret complex I deficiencies".Hum Mol Genet. 2004 13:2461-72 62 Smith, T. F. and Waterman, M. S. "Identification of common molecular subsequences".J Mol Biol. 1981 147:195-7 63 Eddy, S. R. "Profile hidden Markov models".Bioinformatics. 1998 14:755-63

93 Chapter 5

64 Thompson, J. D., Higgins, D. G. and Gibson, T. J. "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice".Nucleic Acids Res. 1994 22:4673-80 65 Felsenstein, J. "PHYLIP (Phylogeny Inference Package)". 66 Altschul, S. F., et al. "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs".Nucleic Acids Res. 1997 25:3389-402 67 Sazanov, L. A., et al. "Resolution of the membrane domain of bovine complex I into subcomplexes: implications for the structural organization of the enzyme".Biochemistry. 2000 39:7229-35 68 Yamaguchi, M. and Hatefi, Y. "Mitochondrial NADH:ubiquinone oxidoreductase (complex I): proximity of the subunits of the flavoprotein and the iron-sulfur protein subcomplexes".Biochemistry. 1993 32:1935-9 69 Krogh, A., et al. "Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes".J Mol Biol. 2001 305:567-80

94 Tracing the evolution of Complex I

95 Chapter 6

Lʹévolution ne tire pas ses nouveautés du néant. Elle travaille sur ce qui existe déjà, soit quʹelle transforme un système ancien pour lui donner une fonction nouvelle, soit quʹelle combine plusieurs systèmes pour en échafauder un autre plus complexe *.

François Jacob (Le jeu des possibles, 1981)

* Evolution does not draw its innovations from nothing. It works on what already exists, either by transforming an old system to give it a new function, or by combining several systems to create a more complex one.

96 Evolution of the mitochondrial metabolism

Chapter 6

Evolution of the mitochondrial metabolism

Toni Gabaldón and Martijn A. Huynen

(A modified version of this chapter will be submitted for publication)

97 Chapter 6

Abstract Mitochondria are eukaryotic organelles derived from the endosymbiosis of an alpha‐proteobacterium. Since then the mitochondrial proteome, and therefore its metabolism, underwent major changes while adapting to different circumstances in the various eukaryotic lineages. Here we trace this transition by comparing past and present states of the mitochondrial metabolism. First, we perform large‐scale phylogenomic analyses of eleven alpha‐proteobacterial genomes, among a total of 144 complete genomes, to reconstruct the ancestral proto‐mitochondrial metabolism. We subsequently compare it to the modern yeast and human mitochondrial metabolisms as derived from recent high‐throughput proteomics datasets. The comparison shows a functional bias towards energy metabolism and protein synthesis in the proto‐mitochondrion, which is amplified in modern mitochondria despite an extensive turnover of the proteome. Although several proto‐mitochondrial pathways have been retained in both yeast and human mitochondria, differential gain, loss and re‐targeting of others have have resulted in significant metabolic differences between the two.

Introduction Current data support the view that all mitochondria derive from a single alpha‐proteobacterial endosymbiont, the so‐called proto‐mitochondrion [1]. However, the exact scenario in which the original relationship between the endosymbiont and its host was established is still a matter for debate [2‐6]. Since the endosymbiotic event the mitochondrial proteome has undergone several changes, gaining, losing and modifying biochemical pathways, in an interesting evolutionary process that lead the transformation of the early bacterial endosymbiont into a eukaryotic organelle [7]. In present‐day eukaryotes, mitochondria play a central role in the metabolism of the cell, comprising a diversity of functions that vary from species to species [8, 9]. Although large changes in the metabolism are thought to have occurred during mitochondrial evolution [10, 11], no comprehensive analysis has been performed yet to quantify that transition. It is still largely unknown, for example, to what extent the metabolism of modern mitochondria resembles that of its bacterial ancestor or how the current diversity observed in mitochondria from different organisms was achieved. To address these questions, we performed a comparative analysis of ancient and modern mitochondrial metabolisms. This analysis exploits both the availability of fully sequenced genomes as well as that of recent proteomics surveys of highly pure mitochondria.

To infer the metabolism of present‐day mitochondria we use two recent comprehensive mitochondrial proteomics sets from of human [12] and yeast [13]. The comparison of the two sets reveals a number of lineage‐specific mitochondrial pathways, as well as a shared core of metabolic pathways. We subsequently compare this set with the reconstructed metabolism of the proto‐mitochondrion, which is an updated version of a previous reconstruction [14]. Here we extend the phylogenomic analysis from 70 genomes to a broader set of 144 genomes, including those of eleven alpha‐proteobacteria and sixteen eukaryotes. The comparison of the

98 Evolution of the mitochondrial metabolism three reconstructed metabolisms, reveals that mitochondrial evolution was accompanied by major changes not only in the distribution of functional groups represented in the proteome but also in their specific composition. The metabolism got more biased towards energy conversion and protein synthesis by diminishing functional classes such as amino acid and nucleotide metabolism, and losing others like carbohydrate metabolism and transport or cell envelope biogenesis. Confirming earlier results [14], some of these pathways have not been completely lost from the cell, instead they moved to the cytosol or other cell compartments. This re‐targeting process involved more than half of the proto‐mitochondrial derived proteins encoded in yeast and human genomes but affected some functional classes more than others. Moreover, the low fraction (10‐16%) of ancient proto‐mitochondrial proteins in modern mitochondria, indicates an extensive renewal of proteins in all functional classes. This fraction is not uniformly distributed, with coenzyme biosynthesis (40‐60%) and energy metabolism (30‐40%) showing the highest content in proto‐mitochondrial derived proteins, while signal transduction or ion transport and metabolism have no detectable trace of a mitochondrial ancestry.

Material and Methods Genome sequence data: A total of 144 publicly available complete proteome sequences was retrieved from the EBI Proteome database (http://www.ebi.ac.uk/proteome), except for Plasmodium falciparum from PlasmoDB (http://plasmodb.org), Candida albicans from CandidaDB (http://genolist.pasteur.fr/CandidaDB), Takifugu rubripes, from Fugu Genome Project (http://www.fugu‐sg.org), Danio rerio from Sanger Sequencing project (http://www.sanger.ac.uk), Neurospora crassa from Center for Genome Research (http://www.broad.mit.edu) Homo sapiens , Mus musculus and Rattus norvergicus from the IPI set at EBI (http://www.ebi.ac.uk/IPI), and Anopheles gambiae from ensembl (http://www.ensembl.org). For the eukaryotic species the organellar genomes were included in the gene set per species.

Mitochondrial proteomics data: Sequences and annotation data for experimentally identified mitochondrial proteins in human and yeast were retrieved from the mitoproteome (http://www.mitoproteome.org) [12] and SGD [15] databases as well as the supplementary material from Sickmann et al.[13].

Reconstruction of the proto‐mitochondrial proteome: For every protein encoded in each of the eleven alpha‐proteobacterial genomes, Smith‐Waterman comparisons [16] were used to retrieve, from the complete proteomes, a set of homologous proteins with a significant similarity (E<0.01) and with a region of similarity covering more than 50% of the query sequence. The sets of homologs that included proteins from eukaryotic genomes were further analyzed. These sets were first limited to the most similar 250 sequences and additional homologous proteins were added only if they belonged to a species not already present in the initial 250 sequences. Every set of homologous sequences was aligned using MUSCLE [17]. Protein families with a likely proto‐mitochondrial origin were selected by a two‐step procedure: First, Neighbor Joining (NJ) trees were generated using Kimura distances

99 Chapter 6

as implemented in ClustalW [18], using 100 samples to perform the bootstrap analyses. Resulting phylogenetic trees were scanned by an algorithm (see below) for partitions indicating a monophyly of eukaryotic and alphaproteobacterial proteins, the resulting set of selected orthologous groups is referred to as NJ‐set. Secondly, all original alignments selected in the NJ‐set were used to generate Maximum likelihood (ML) trees using PhyML v2.1b1 [19], with a four rates gamma‐distribution model. The tree‐scanning algorithm was used for a second time over these ML trees, and the resulting selected OGs conformed the ML‐set. Note that OGs included in the ML‐set have an alpha‐proteobacterial descent that is supported by both NJ and ML tree‐ reconstruction techniques. ML‐set is thus a subset of NJ‐set and is expected to include the OGs with the strongest phylogenetic signal of an alpha‐proteobacterial origin.

Tree scanning algorithms

Selection of orthologous groups derived from the proto‐mitochondrion: Phylogenetic trees were scanned for partitions that contained eukaryotic and alpha‐ proteobacterial proteins. Presence of gamma and beta‐proteobacterial proteins in the partition was also allowed for reasons of coverage (results not shown). In case an alpha‐proteobacterial/eukaryotic partition existed, proteins in that branch were regarded as orthologs. The group was further divided if the proteins from the alpha‐ proteobacteria formed different sub‐partitions with the eukaryotic ones. We filtered out possible cases of horizontal gene transfer by examining the topology of the trees: 1) In cases where the alpha‐proteobacterial/eukaryotic group could be rooted by other species we interpreted the branching of a single alpha‐proteobacterial protein within a cluster of eukaryotic proteins as a gene transfer from a eukaryote to the alpha‐proteobacteria and discarded this group. 2) We also discarded cases where only one genus of both eukaryotes and alpha‐proteobacteria was present, eliminating proteins such as the ADP/ATP translocases that are only shared between the parasitic Rickettsia and E. cuniculi. Finally, new proteins were added to the selected orthologous groups by the OG extension algorithm (see below).

OG extension algorithm: We noticed that, in many cases, true members of the selected orthologous groups were not included in the selected tree partition. This was often caused by a single sequence from an unrelated bacterial species (not belonging to the alpha, beta or gamma subdivisions of proteobacteria), which branched within the orthologous group. This caused some true members from the selected OG to fall out of the selected partition. In order to minimize this effect, we extended the selected tree partitions to include close eukaryotic sequences in the OG if the number of sequences from unrelated species in the extended partition represented less than one fifth of the total number of sequences from alpha‐ proteobacterial species. Note that this extension algorithm, which was applied to both NJ‐ and ML‐sets, does not affect, in any manner, the number of selected orthologous groups nor the reconstructed proto‐mitochondrial metabolism. Hence, its effect is limited to the species coverage of the previously selected OGs.

100 Evolution of the mitochondrial metabolism

Metabolic reconstructions: Annotated biochemical and cellular functions of the mitochondrial proteins or proto‐mitochondrial derived orthologous groups were mapped onto metabolic KEGG maps [20] and assigned to a functional class as defined by the COG database [21]. In the case of the proto‐mitochondrial metabolism, only pathways that have several consecutive steps represented in the ML‐set were included. Additional adjacent reactions were added if they were present in the NJ‐set. In the case of the human proteome we excluded enzymes from the glycolitic pathway that is present in this set, as they likely result from contamination [22].

Results and discussion

Reconstruction of the proto‐mitochondrial proteome The phylogenetic analyses (see methods) selected a total of 1026 (NJ‐set) and 832 (ML‐ set) orthologous groups whose likely proto‐mitochondrial origin is supported, respectively, by one (NJ) and two (NJ and ML) tree reconstruction methods (see table 1 and supplementary material). Although these numbers are considerably larger than our previous estimate [14], we still consider them to be minimal estimates of the ancestral proto‐mitochondrial proteome because: (1) Many transferred genes may have diverged too much to be identified by phylogenetic analyses and (2) our procedure cannot recover genes that have been lost from either all the eukaryotic or all the alpha‐proteobacterial genomes considered. Such is the case for the bacterial division protein FtsZ, which is present in several protists, where it is presumably involved in the division of the mitochondrial inner membrane. Although FtsZ phylogeny is clearly consistent with a proto‐mitochondrial origin [23], its complete absence from all sixteen eukaryotic genomes considered here prevents its detection by our procedure. Other proteins might have been lost from all eukaryotic genomes and therefore will never be found. To provide a rough estimate of the coverage and sensitivity of our method, we benchmarked our procedure by using the mitochondrial genome of Reclinomonas americana and the genome of the bacterium Deinococcus radiodurans. The jakobid R. americana possesses one of the most gene‐rich and eubacterial‐like mitochondrial genomes known to date [24]. It encodes 67 proteins that presumably have an alpha‐ proteobacterial origin and thus can be used as a “golden set” to test the coverage of our method (table 1).

Our procedure retrieved 71,6% (NJ‐set) and 62,7% (ML‐set) of all the R. americana mitochondrial‐encoded proteins, indicating that, even when proteins occur in both alpha‐proteobacteria and eukaryotes, the phylogenetic analyses fail to recover a significant fraction of the true cases. In addition, to estimate the fraction of false positives we used the bacterium Deinococcus radiodurans, which has no direct relation to the eukaryotes [25]. Here our procedure selected, out of a total of 3085 proteins, only 34 (1.1%) in the NJ‐set and 1 (0.03%) in the ML‐set. Taken together these results indicate that both sets are minimal but highly accurate estimates of the composition of the proto‐mitochondrial proteome. Compared to our previous reconstruction [14], in which we obtained 49% and 1.3% coverages of the R. americana

101 Chapter 6

and D. radiodurans genomes, the broader genome sampling used here increases both the accuracy and the number of selected orthologous groups. The increase in selected orthologous groups is most apparent in the NJ‐set, which selection procedure can be considered largely equivalent to that used in our previous approach [14]. The inclusion of the scanning of ML trees as a second filter greatly reduces the number of false positives, at a reasonable cost of coverage.

NJ‐ Species proteins 2003 NJ‐set ML‐set Reclinomonas Americana (mitochondrion) 67 49 71,6 62,70 Deinococus radiodurans 3084 1,3 1,1 0,03 Agrobacterium tumefaciens (Cereon) 5392 ‐ 13,03 7,59 Agrobacterium tumefaciens (Washington) 5298 ‐ 13,11 11,76 Bradyrhizobium japonicum 8257 ‐ 11,18 8,25 Brucella melitensis 3186 18,1 16,1 11,08 Brucella suis 3247 ‐ 15,86 9,67 Caulobacter crecentus 3718 17,9 13,23 8,85 Magnetococcus magnetotacticum 4280 ‐ 11,36 8,74 Rhizobium loti 7259 13,3 13,08 8,94 Rhizobium meliloti 6149 13,3 13,67 9,17 Rickettsia conorii 1374 17,1 20,3 16,59 Rickettsia prowazekii 834 23,5 25,06 19,78 Total selected 630 1026 832

Table1: Selected OGs and Benchmarking of the different approaches. First column indicate the three‐letters codes and the names of the species considered. Under proteins it is stated the number of protein‐coding genes contained by each genome. Last three columns indicate the percentage of selected proteins in each genome by each approach and the total number of selected orthologous groups (bottom row). NJ‐2003 refers to the selected set of our previous approach [14], NJ‐ and ML‐sets are described in the text.

Finally, to investigate whether our procedure is reaching saturation in terms of the number of genomes used, we calculated the number of orthologous groups selected for every chronological subset of the alpha‐proteobacterial and eukaryotic genomes. The results (Figure 1) suggest that the approach has not yet reached saturation, and thus the inclusion of new alpha‐proteobacterial or eukaryotic genomes is likely to reveal new protein families of alpha‐proteobacterial descent. Nevertheless, the decrease in the curve slope suggests that the expected increase in the number of selected OGs by the inclusion of few new genomes will not be large. It must be noted, however, that the inclusion of organisms that are evolutionarily distant to the current set might reveal considerably higher numbers of new OGs, because they have likely retained different gene pools. The inclusion of genomes from anaerobic eukaryotic organisms that harbour hydrogenosomes [26, 27] will be particularly interesting.

102 Evolution of the mitochondrial metabolism

1200

1000

800

600

400 selected OGs

200

0 1234567891011 1 2 3 4 5 6 7 8 910111213141516 eukaryotic genomes alpha-proteobacterial genom es

Figure 1: Number of selected orthologous groups (y axis) in the NJ‐ (dots connected by dashed lines) and ML‐sets using A) every chronological subset of the eleven alpha‐ proteobacterial genomes compared against the rest of the proteomes and, B) every chronological subset of the sixteen eukaryotic genomes compared against the rest of the proteomes (x axes). Chronological order in the alpha‐proteobacteria: Ricketsia prowazekii, Mesorhizobium loti, Caulobacter crescentus, Rizhobium meliloti, Rickettsia conorii, Agrobacterium tumefaciens (Washington) , A. tumefaciens (cereon), Brucella melitensis, Bradyrhizobium japonicum, Brucella suis, Magnetococcus magnetotacticum. The chronological order in the eukaryotes is: Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Homo sapiens, Gillardia theta, Encephalitozoon cuniculi, Schizosaccharomyces pombe, Candida albicans, Takifugu rupripes, Plasmodium falciparum, Anopheles gambiae, Mus musculus, Neurospora crassa, Rattus norvergicus, Danio rerio.

Reconstruction of the proto‐mitochondrial metabolism Mapping the annotated functions of the selected orthologous groups onto the metabolic maps of the KEGG pathways database [20], we reconstructed the proto‐ mitochondrial metabolism (Figure 2). Pathways that are shown in the figure have several consecutive steps present in the most stringent ML‐set and have been completed or extended with adjacent reactions from the NJ‐set. The resulting reconstructed metabolism is, in general terms, similar to our previous reconstruction [14]. However, a significant reduction in the number of gaps in the pathways indicates a higher coverage. For some of the pathways the fraction of recovered steps is high enough to suggest they were complete in the proto‐mitochondrion. This is the case of Oxidative phosphorylation (28 OGs) and beta‐oxidation (7 OGs), which clearly indicate that the proto‐mitochondrion had a (facultatively) aerobic metabolism, in which the latter could provide the former with NADH and FADH2. These two pathways together with lipid synthesis, biotin, Vitamine B6 and heme synthesis and iron‐sulfur cluster assembly can be reconstructed almost completely. In contrast, some mitochondrial pathways like the citric acid cycle appear quite incomplete while the urea cycle is absent. Previous work on the origin of the citric

103 Chapter 6

acid cycle in yeast [28, 29] show a complex phylogeny for this group of proteins and are consistent with our results. Notably, the incomplete citric acid cycle predicted by our analyses also exists in present day organisms like Chlamydia [29, 30], and can be used for the catabolism of glutamate via 2‐oxoglutarate. As expected [14, 31], the glycolitic pathway is not recovered but we do find some steps from fructose and mannose metabolism like fructose‐2,6‐biphosphatase and mannose‐6‐P isomerase, and a considerable number of connected steps from the pentose phosphate pathway such as transketolase and deoxyribokinase (Fig 2). Pathways from pentose phosphate metabolism could have provided the proto‐ mitochondrion with intermediates for the anabolism of amino acids, vitamins and nucleotides. Indeed, the synthesis of erithrose‐4‐P from glucose provides the link between the reconstructed pentose phosphate pathway and Vitamine B6 synthesis (Fig 2). Nucleotide metabolism (23 OGs) is indeed well represented in the proto‐ mitochondrion but, in contrast to the above‐mentioned pathways, contains mainly “isolated enzymes” (for the exceptions see Figure 2) and are far from complete. From aminoacid metabolism (60 OGs) we recover many stretches of interconnected steps, separated by some gaps. Some of the amino acid metabolism enzymes in the proto‐ mitochondrion such as Threonine synthase, Threonine dehydratase and L‐Serine dehydratase, are specifically involved in the interconversion of amino acids, indicating a potential to convert certain amino acids to others. Furthermore, the abovementioned Vitamine B6 is needed as a by enzymes that catalize transaminations and other reactions of the amino acid metabolism, thus indicating consistency of the reconstructed metabolism. Particularly relevant is the presence of transporters in the proto‐ mitochondrial proteome, because they provide essential information about the proto‐ mitochondrion’s relationship with the host, specifically in the cases where they can be linked to the pathways in the proto‐mitochondrion. In general, the abundance of metabolite transporters, which was already described in our previous work [14], suggests a host dependency of the proto‐mitochondrion. Of the cation transporters the Fe2+ importer is particularly interesting since it could have provided the iron for the FeS cluster assembly pathway. Notably, the protein importing iron into present day mitochondria has not been identified, the candidates from this analysis have, however, only been observed in the cellular membrane [32]. Also the protein that exports FeS clusters from mitochondria (ATM1) appears to have been present in the proto‐mitochondrion, although it might not have had an FeS export function at that stage of evolution. There are several other cation transporters (Mg2+/Co2+ and K+) that could have been used either to maintain the ion homeostasis or to obtain the cofactors needed for the enzyme activities. The emerging picture thus is that of a (facultatively) aerobic endosymbiont catabolizing lipids, glycerol and amino acids provided by the eukaryotic host. From the host point of view, although energy conversion has been a dominant factor throughout the evolution of the mitochondria, this appears not to have been the sole benefit from the early symbiotic relationship. Therefore, our previous suggestion of a multifaceted benefit for the host [14] still holds on.

104 Evolution of the mitochondrial metabolism

Reconstruction of the yeast and human mitochondrial metabolisms Recent advances in proteomics techniques have provided the means to approximate the full identification of the protein complement of mitochondria [33, 34]. In recent years, large datasets of mitochondrial proteomes of several model organisms have been published [12, 13, 22, 35, 36]. For their relevance and high coverage we have chosen the MitoProteome database of experimentally identified human mitochondrial proteins [12] and a recent mass spectrometry analysis of highly pure isolated yeast mitochondria [13]. From the annotated functions of the proteins in these sets we inferred (see methods) the metabolisms of present‐day mitochondria from yeast (Figure 3) and human (Figure 4). Both reconstructed metabolisms, and specially that of human mitochondria, have to be regarded as incomplete, because the proteomics sets do not present full coverage and some of the identified mitochondrial proteins do not have a (predicted) function. Moreover they might miss complete pathways that are only expressed in certain tissues or under specific conditions. Nevertheless we consider both sets to be highly informative and representative enough to provide an overall picture of the metabolic composition of modern mitochondria. A comparison between both modern metabolism is shown in figure 5 where only major shared and lineage‐specific pathways are represented.

Remarkably, the overlap between the human and yeast mitochondrial proteomes is rather modest (figure 5) with 312 (42%) of the yeast mitochondrial proteins having 400 (47%) orthologs in human mitochondria. This difference can be explained by an incomplete coverage of both sets, by the existence of metabolic differences between human and yeast mitochondria, or by a combination of both effects. At least for the observed large fraction of human mitochondrial proteins that cannot be found in yeast mitochondria, we expect the lack of coverage to have a minor effect because the proteome set used here is estimated to account for about 80-90% of the total proteome [7, 13]. In addition, in many cases, the lack of yeast mitochondrial orthologs of human mitochondrial proteins affects complete pathways, or there is no ortholog in the complete yeast genome. This indicates that, besides a common core of functions, there is a significant metabolic differentiation in the mitochondria of these two species. Examples of specific pathways and complexes that are present in human mitochondria but absent from their yeast counterparts are NADH:Ubiquinone oxidoreductase (Complex I) [37], fatty acid beta-oxidation [38], steroid biogenesis [39], and the apoptotic Bcl2-family signaling pathway [40]. Conversely, glycerone-P metabolism, trehalose synthesis and starch and sucrose degradation appear to be specific for yeast mitochondria. Many other pathways such as most of the oxidative phosphorylation, Fe-S cluster assembly, the ATP/ADP translocator, and the protein synthesis and import machineries are shared by mitochondria from both organisms.

Comparative analysis of present and past mitochondrial metabolism In order to compare the overall functional diversity of the reconstructed metabolisms, and, therefore, trace the metabolic transition of mitochondria, we used the COG functional classification scheme [41, 42] to classify the considered proteomes (Figure 6). In the proto‐mitochondrion, the largest fractions of proteins with known function are devoted to energy conversion (13,8%), amino acid metabolism (14,3%) and protein synthesis (9,6%). Compared to the free‐living alpha‐ proteobacteria Caulobacter crescentus (6%, 8.5%, and 10% respectively) and

105 Chapter 6

Mesorhizobium loti (7.2%, 16%, and 4.4% ) or the parasitic species Rickettsia prowazekii (11.7%, 4.3%, and 20.2%), the major bias in the proto‐mitochondrion is towards energy conversion and the metabolism of amino acids. Conversely, processes such as cell division (1%) or signal transduction (1%) are nearly non‐existent, and were either not very elaborate in the proto‐mitochondrion or, more likely, have been extensively lost in either the eukaryotes or the alpha‐proteobacteria. A similar bias, in terms of which functional classes dominate, is also found in modern mitochondria. However, the bias appears stronger, specifically towards energy conversion and protein synthesis and folding that together represent more than the 50% of the proteins with known function (as compared to 28% in the proto‐mitochondrion). That the functional bias of present‐day mitochondria is more pronounced than that of the proto‐mitochondrion is confirmed by calculating the entropy (H) of the distribution of proteins among functional classes (H = ‐ Σ (Pi x log Pi); being Pi the relative frequency of the class i). The entropy is lower (the distribution is more dominated by a few frequencies) in yeast (H=0.79) and human (H=0.95) than in the proto‐mitochondrion (H=1.09), confirming an increase in the level of specialization, which is most pronounced in yeast. As part of this specialization, the classes Amino acid metabolism and secondary metabolism have been significantly diminished while “carbohydrate metabolism and transport” or “cell envelope biogenesis” have virtually disappeared. A major proteome turnover during mitochondrial evolution The overall similarity in bias between the proto‐mitochondrion and the current mitochondria is particularly striking when one realizes that there has been a massive turnover of proteins. Indeed, previous analysis of modern mitochondrial proteomes [14, 43] have shown that only a minor fraction of it has a clear alpha‐ proteobacterial ancestry. This extensive turnover is confirmed in our present analysis, although the use here of broader proteomics sets and different phylogenetic approaches introduces variations in the estimates. Quantitatively, only 16,3% (human) and 12,6% (yeast) of the mitochondrial proteome can be traced back to the protomitochondrion if we use the NJ‐set as a reference. When the more stringent ML‐ set is used, these percentages are reduced to 13,6% and 10,8%, respectively. The low fraction of proto‐mitochondrial proteins in modern mitochondria is the result of the combination of a proteome reduction and a proteome expansion processes [7, 44]. Firstly, some proto‐mitochondrial pathways such as LPS‐biosynthesis or lipid synthesis have been lost from the mitochondrion. Secondly, new proteins have been recruited to the mitochondrion by the gain of novel pathways, such as the protein import machinery [45] or the ATP/ADP transporter. These processes are accompanied, as well, by a parallel expansion and reduction of the metabolic capacities. In addition, the amelioration of some pathways, such as the recruitment of new subunits to complex I [37] and other electron transport chain complexes [46], would have contributed to the proteome renewal without significantly altering the metabolic capacities. Our results (Figure 6) show that these processes have affected some functional classes more than others. For instance, the fraction of alpha‐

106 Evolution of the mitochondrial metabolism proteobacterial derived proteins is relatively larger in classes such as coenzyme metabolism (57% in yeast and 47% in human) or energy production and conversion (41 and 30,6%) than in classes such as translation (15,7 and 5,4%) or protein turnover and chaperones (12 and 14%).

25%

20%

15%

10%

5%

0%

n . t. . . o rt, d. ti a es s et bol a etab. a ron metb. m ransl chr. P yme T anscription z : Chape J L: Replication and lity and secret & conversion . Trans & Met.en : Tr if. ti : Signal trant. ansp. & K T c tid fus M: Cell envelope Co ll eo : e N: mo mino ac. Trns & M H t. Mod A a I: Lipid tr Q: Secondary met D: C P: Ion transport and met. E: ansl arbohydrates Trans & MeF: Nucl C : osttr C: Energy produG O: P

Figure 6: Relative weights of the functional classes in the proteomes of the proto-mitochondrion (left bars), and modern yeast (middle) and human (right) mitochondrial proteomes. The color in the bars indicate the origin of the proteins in that functional class for that given organism (yellow: alphaproteobacterial origin. red: Other origin). We used the NJ-set as a reference to calculate the fraction that is evolutionary derived from the proto-mitochondrion. Functional classes are derived from COG [42]. Relative numbers of the proto-mitochondrion and yeast are comparable as the number of proteins with known functions is similar. Alpha-proteobacterial derived proteins are a minority in all classes except Coenzyme biosynthesis, the energy production/conversion class is the most “alpha- proteobacterial derived”.

The (almost) complete renewal of classes such as cell division and fusion, transcription, replication and signal transduction is consistent with the fact that a major difference of the early endosymbiont and present‐day mitochondria is that the latter are under the full control of the host. Although chloroplasts usually have retained a bacterial type division machinery, most mitochondria use a completely eukaryotic‐derived system [47], something which could have facilitated the control on the number and shape of mitochondria in a cell. Moreover, the renewal of proteins involved in transcription and replication may account for the different structural features of mitochondrial genes and genomes that distinguish them from their bacterial counterparts. For instance, mitochondrial genes are transcribed by a phage‐like RNA polymerase [48] and can incorporate introns [49].

107 Chapter 6

Sub‐cellular re‐targeting of proto‐mitochondrial proteins As we previously reported [14], a significant fraction of the proto‐ mitochondrial proteins that are not present in modern mitochondria were not completely lost from the genome. Instead, they likely moved to other parts of the cell. Compared to our previous estimates, the use here larger mitochondrial sets reduces, as expected, the non‐mitochondrial fractions of proteins of proto‐mitochondrial descent. In the present set, non‐mitochondrial proteins represent more than 50% (68% in human and 57% in yeast) of the total set of alpha‐proteobacterial derived proteins in the cell. This fraction includes complete pathways such as lipid synthesis, fructose and mannose metabolism and fatty‐acid oxidation in yeast. This extensive re‐targeting of proteins shows that the contribution from the mitochondrial endosymbiosis to the eukaryotic cell goes beyond the pathways that are currently present in mitochondria. Quantitatively, this is similar to what has been observed for proteins derived from the cyanobacteria, of which the majority are not known to be located in the chloroplast [50]. As in the case of the mitochondrial proteome turnover, the process of re‐targeting has also affected some classes more than others. For instance, from the 41 yeast proto‐mitochondrial derived proteins in our NJ‐set whose mutants impair respiration according to a large‐scale analysis [51], 36 (88%) have a mitochondrial localization. This indicates that most of the respiratory metabolism donated by the mitochondrial ancestor has remained inside the organelle.

Concluding remarks As we show here, the estimated proto‐mitochondrial proteome size depends on the number of genomes used and has not yet reached saturation. This, together with the incomplete coverage of our approach suggests that the proto‐mitochondrial proteome might have been composed of, at least, thousand proteins, but probably many more. A massive donation of genes from the mitochondrial ancestor has been proposed to account for the excess of eubacterial genes present in modern eukaryotes [52]. However, it must be noted that our reconstructed proteome is the result of a summation of genes found in different eukaryotic genomes. The extensive gene loss that has affected this proto‐mitochondrial gene pool (see chapter 4) has resulted in small numbers of alpha‐proteobacterial genes in the individual eukaryotic genomes. For instance, the NJ‐set contains only 172 yeast proteins that represent ~3% of the total genome. This seems to reinforce the idea of an archaeal‐eubacterial fusion predating the mitochondrial endosymbiosis, although we cannot discard that extensive pre‐endosymbiotic horizontal gene transfer between the alpha‐ proteobacterial ancestor and other bacteria had already masked the alpha‐ proteobacterial affiliation of most proto‐mitochondrial genes [52]. Regarding its metabolic capacities, the reconstructed proto‐mitochondrion suggests a (facultatively) aerobic endosymbiont with a broad metabolism and a multifaceted benefit for its host. The inferred proto‐mitochondrial metabolism presented here is largely similar to our previous reconstruction [14]. However, the use of a broader genome sampling has increased the accuracy and coverage at which the metabolism can be reconstructed. It is expected that the inclusion in the set of specific organisms such as those harboring hydrogenosomes [53], might reveal new pathways as these

108 Evolution of the mitochondrial metabolism organisms have likely retained different sets of proto‐mitochondrial proteins. The comparison of the reconstructed ancestral metabolism with that of present‐day mitochondria reveals that mitochondrial evolution was accompanied by major changes not only in functional groups represented in the proteome but also in their specific compositions. The metabolism got more biased towards energy conversion and protein synthesis by diminishing functional classes such as amino acid and nucleotide metabolism. Some of these pathways were re‐targeted to other parts of the cell. The processes of protein gain, loss and re‐targeting have acted in a lineage‐ specific manner, thus resulting in the metabolic differences observed between human and yeast mitochondria. Pathways that are common to both organisms and are not of proto‐mitochondrial origin such as protein import machinery and the ATP/ADP translocator have likely played a key role in the bacterium‐to‐organelle transition of mitochondria.

109 Chapter 6

Figure2: An overview of metabolism and transport in the proto‐mitochondrion, as deduced from the orthologous groups present in its estimated proteome (see text). Boxes, arrows, and cylinders indicate pathways, enzymes, and transporters, respectively. Several consecutive steps can be condensed in a bigger arrow with a number indicating the steps included. Single missing steps connecting recovered pathways are indicated as dashed lines.

110 Evolution of the mitochondrial metabolism

Figure3: Reconstructed yeast mitochondrial metabolism, as deduced from the function of the proteins present in the proteomics set [13]. Symbols as in figure 2.

111 Chapter 6

Figure 4: Reconstructed human mitochondrial metabolism, as deduced from the function of the proteins compiled in the MitoProteome dataset [12]. Symbols as in figure 2.

112 Evolution of the mitochondrial metabolism

Figure 5: (Note a color version of this figure is shown in appendix 2) Comparison of the reconstructed human and yeast mitochondrial metabolisms. Pathways shared by the two species are depicted in the middle region of the figure, pathways at the extremes of the dashed lines are exclusive for human (left) or yeast (right) mitochondria. Color codes indicate whether the pathway was likely present in the proto‐mitochondrion (blue) or has a different origin (red). Only those pathways with two or more consecutive steps are depicted. Symbols as in figure 2.

113 Chapter 6

References

1 Gray, M. W., Burger, G. and Lang, B. F. 1999. ʺMitochondrial evolutionʺ Science (283) 5407:1476‐81 2 Andersson, S. G., et al. 1998. ʺThe genome sequence of Rickettsia prowazekii and the origin of mitochondriaʺ Nature (396) 6707:133‐40 3 Margulis, L. 1981 ʺSymbioses in Cell Evolutionʺ edited by W. H. Freeman W.H. Freeman San Francisco 4 Martin, W., et al. 2001. ʺAn overview of endosymbiotic models for the origins of eukaryotes, their ATP‐producing organelles (mitochondria and hydrogenosomes), and their heterotrophic lifestyleʺ Biol Chem (382) 11:1521‐39 5 Martin, W. and Müller, M. 1998. ʺThe hydrogen hypothesis for the first eukaryoteʺ Nature (392) 6671:37‐41 6 Emelyanov, V. V. 2001. ʺRickettsiaceae, rickettsia‐like endosymbionts, and the origin of mitochondriaʺ Biosci Rep (21) 1:1‐17 7 Gabaldón, T. and Huynen, M. A. 2004. ʺShaping the mitochondrial proteomeʺ Biochim Biophys Acta (1659) 2‐3:212‐20 8 Scheffler, I. E. 2001. ʺMitochondria make a come backʺ Adv Drug Deliv Rev (49) 1‐2:3‐ 26 9 Tielens, A. G., et al. 2002. ʺMitochondria as we donʹt know themʺ Trends Biochem Sci (27) 11:564‐72 10 Gray, M. W., Burger, G. and Lang, B. F. 2001. ʺThe origin and early evolution of mitochondriaʺ Genome Biol (2) 6:REVIEWS1018 11 Heazlewood, J. L., et al. 2003. ʺWhat makes a mitochondrion?ʺ Genome Biol (4) 6:218 12 Cotter, D., et al. 2004. ʺMitoProteome: mitochondrial protein sequence database and annotation systemʺ Nucleic Acids Res (32 Database issue) D463‐7 13 Sickmann, A., et al. 2003. ʺThe proteome of Saccharomyces cerevisiae mitochondriaʺ Proc Natl Acad Sci U S A (100) 23:13207‐12 14 Gabaldón, T. and Huynen, M. A. 2003. ʺReconstruction of the proto‐mitochondrial metabolismʺ Science (301) 5633:609 15 Christie, K. R., et al. 2004. ʺSaccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organismsʺ Nucleic Acids Res (32 Database issue) D311‐4 16 Smith, T. F. and Waterman, M. S. 1981. ʺIdentification of common molecular subsequencesʺ J Mol Biol (147) 1:195‐7 17 Edgar, R. C. 2004. ʺMUSCLE: a multiple sequence alignment method with reduced time and space complexityʺ BMC Bioinformatics (5) 1:113 18 Thompson, J. D., Higgins, D. G. and Gibson, T. J. 1994. ʺCLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position‐specific gap penalties and weight matrix choiceʺ Nucleic Acids Res (22) 22:4673‐80 19 Guindon, S. and Gascuel, O. 2003. ʺA simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihoodʺ Syst Biol (52) 5:696‐704 20 Kanehisa, M. and Goto, S. 2000. ʺKEGG: kyoto encyclopedia of genes and genomesʺ Nucleic Acids Res (28) 1:27‐30 21 Tatusov, R. L., et al. 2003. ʺThe COG database: an updated version includes eukaryotesʺ BMC Bioinformatics (4) 1:41

114 Evolution of the mitochondrial metabolism

22 Taylor, S. W., et al. 2003. ʺCharacterization of the human heart mitochondrial proteomeʺ Nat Biotechnol (21) 3:281‐6 23 Kiefel, B. R., Gilson, P. R. and Beech, P. L. 2004. ʺDiverse eukaryotes have retained mitochondrial homologues of the bacterial division protein FtsZʺ Protist (155) 1:105‐ 15 24 Lang, B. F., et al. 1997. ʺAn ancestral mitochondrial DNA resembling a eubacterial genome in miniatureʺ Nature (387) 6632:493‐7 25 Makarova, K. S., et al. 1999. ʺComparative genomics of the Archaea (Euryarchaeota): evolution of conserved protein families, the stable core, and the variable shellʺ Genome Res (9) 7:608‐28 26 Embley, T. M., et al. 2003. ʺMitochondria and hydrogenosomes are two forms of the same fundamental organelleʺ Philos Trans R Soc Lond B Biol Sci (358) 1429:191‐201; discussion 201‐2 27 Dyall, S. D. and Johnson, P. J. 2000. ʺOrigins of hydrogenosomes and mitochondria: evolution and organelle biogenesisʺ Curr Opin Microbiol (3) 4:404‐11 28 Huynen, M. A., Dandekar, T. and Bork, P. 1999. ʺVariation and evolution of the citric‐ acid cycle: a genomic perspectiveʺ Trends Microbiol (7) 7:281‐91 29 Read, T. D., et al. 2000. ʺGenome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39ʺ Nucleic Acids Res (28) 6:1397‐406 30 Schnarrenberger, C. and Martin, W. 2002. ʺEvolution of the enzymes of the citric acid cycle and the glyoxylate cycle of higher plants. A case study of endosymbiotic gene transferʺ Eur J Biochem (269) 3:868‐83 31 Canback, B., Andersson, S. G. and Kurland, C. G. 2002. ʺThe global phylogeny of glycolytic enzymesʺ Proc Natl Acad Sci U S A (99) 9:6097‐102 32 Yun, C. W., et al. 2000. ʺSiderophore‐iron uptake in saccharomyces cerevisiae. Identification of ferrichrome and fusarinine transportersʺ J Biol Chem (275) 21:16354‐9 33 Warnock, D. E., Fahy, E. and Taylor, S. W. 2004. ʺIdentification of protein associations in organelles, using mass spectrometry‐based proteomicsʺ Mass Spectrom Rev (23) 4:259‐80 34 Taylor, S. W., Fahy, E. and Ghosh, S. S. 2003. ʺGlobal organellar proteomicsʺ Trends Biotechnol (21) 2:82‐8 35 Mootha, V. K., et al. 2003. ʺIntegrated analysis of protein composition, tissue diversity, and gene regulation in mouse mitochondriaʺ Cell (115) 5:629‐40 36 Tanaka, N., et al. 2004. ʺProteomics of the rice cell: systematic identification of the protein populations in subcellular compartmentsʺ Mol Genet Genomics 37 Gabaldón, T., Rainey, D. and Huynen, M. A. 2005. ʺTracing the Evolution of a Large Protein Complex in the Eukaryotes, NADH:Ubiquinone Oxidoreductase (Complex I)ʺ J Mol Biol (348) 4:857‐70 38 van Roermund, C. W., et al. 2003. ʺFatty acid metabolism in Saccharomyces cerevisiaeʺ Cell Mol Life Sci (60) 9:1838‐51 39 Miller, W. L. 1995. ʺMitochondrial specificity of the early steps in steroidogenesisʺ J Steroid Biochem Mol Biol (55) 5‐6:607‐16 40 Kuwana, T. and Newmeyer, D. D. 2003. ʺBcl‐2‐family proteins and the role of mitochondria in apoptosisʺ Curr Opin Cell Biol (15) 6:691‐9 41 Tatusov, R. L., et al. 2000. ʺThe COG database: a tool for genome‐scale analysis of protein functions and evolutionʺ Nucleic Acids Res (28) 1:33‐6 42 Tatusov, R. L., et al. 2001. ʺThe COG database: new developments in phylogenetic classification of proteins from complete genomesʺ Nucleic Acids Res (29) 1:22‐8 43 Karlberg, O., et al. 2000. ʺThe dual origin of the yeast mitochondrial proteomeʺ Yeast (17) 3:170‐87

115 Chapter 6

44 Andersson, S. G., et al. 2003. ʺOn the origin of mitochondria: a genomics perspectiveʺ Philos Trans R Soc Lond B Biol Sci (358) 1429:165‐77; discussion 177‐9 45 Wiedemann, N., Frazier, A. E. and Pfanner, N. 2004. ʺThe protein import machinery of mitochondriaʺ J Biol Chem (279) 15:14473‐6 46 Berry, S. 2003. ʺEndosymbiosis and the design of eukaryotic electron transportʺ Biochim Biophys Acta (1606) 1‐3:57‐72 47 Osteryoung, K. W. and Nunnari, J. 2003. ʺThe division of endosymbiotic organellesʺ Science (302) 5651:1698‐704 48 Tracy, R. L. and Stern, D. B. 1995. ʺMitochondrial transcription initiation: promoter structures and RNA polymerasesʺ Curr Genet (28) 3:205‐16 49 Burger, G., Gray, M. W. and Lang, B. F. 2003. ʺMitochondrial genomes: anything goesʺ Trends Genet (19) 12:709‐16 50 Martin, W., et al. 2002. ʺEvolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleusʺ Proc Natl Acad Sci U S A (99) 19:12246‐51 51 Steinmetz, L. M., et al. 2002. ʺSystematic screen for human disease genes in yeastʺ Nat Genet (31) 4:400‐4 52 Esser, C., et al. 2004. ʺA genome phylogeny for mitochondria among alpha‐ proteobacteria and a predominantly eubacterial ancestry of yeast nuclear genesʺ Mol Biol Evol (21) 9:1643‐60 53 Embley, T. M., et al. 2003. ʺHydrogenosomes, mitochondria and early eukaryotic evolutionʺ IUBMB Life (55) 7:387‐95

116 Evolution of the mitochondrial metabolism

117 Chapter 7

It is not the strongest of the species that survives. It is the one that is the most adaptable to change.

Charles Darwin

118 Shaping the mitochondrial proteome

Chapter 7

Shaping the mitochondrial proteome

Toni Gabaldón and Martijn A. Huynen

Biochim. Biophys. Acta.‐ Bioenergetics 2004 Dec 6;1659(2‐3):212‐20.

119 Chapter 7

Summary Mitochondria are eukaryotic organelles that originated from a single bacterial endosymbiosis some 2 billion years ago. The transition from the ancestral endosymbiont to the modern mitochondrion has been accompanied by major changes in its protein content, the so‐called proteome. These changes included complete loss of some bacterial pathways, amelioration of others and gain of completely new complexes of eukaryotic origin such as the ATP/ADP translocase and part of the mitochondrial protein import machinery. This renewal of proteins has been so extensive that only a 15‐16% of modern mitochondrial proteome has an origin that can be traced back to the bacterial endosymbiont. The rest consists of proteins of diverse origin that were eventually recruited to function in the organelle. This shaping of the proteome content reflects the transformation of mitochondria into a highly specialized organelle that, besides ATP production, comprises a variety of functions within the eukaryotic metabolism. Here we review recent advances in the fields of comparative genomics and proteomics that are throwing light onto the origin and evolution of the mitochondrial proteome.

Introduction According to the widely accepted endosymbiotic theory, mitochondria are eukaryotic organelles of bacterial descent. Data supporting this theory point to an ancient alpha‐proteobacterium, most likely an ancestor of the Rickettsiales, establishing a symbiotic relationship inside a primitive eukaryotic cell some 2 billion years ago [1‐3]. Since then the mitochondrion has undergone major changes that transformed it into a highly specialized organelle that plays a key role within the metabolism of most eukaryotic cells. This process involved not only a dramatic reduction at the level of its genome but also an extensive renewal at the level of its proteome that affected more than 80% of the protein content of modern mitochondria. Nowadays mitochondria are present in a multitude of eukaryotic organisms adapted to a variety of niches [4], with modern mitochondrial proteomes reflecting a large diversity in organellar capabilities. Moreover there is increasing evidence that certain organisms that lack mitochondria “sensu‐stricto” actually harbor relicts of these organelles. These, together with the hydrogenosomes, most of which appear of mitochondrial descent [5], reflect extreme forms of mitochondrial adaptation to microarophilic or anaerobic environments. The evolution of the mitochondrial genome in both its structure and gene content has been the focus of several reviews [6‐8], but the scarcity of data has long prevented similar surveys on the evolution of the mitochondrial proteome. Recent advances in the fields of proteomics and comparative genomics begin to give us a picture on the processes that shaped the mitochondrial proteome from the early stages of the endosymbiosis to the modern adaptation of mitochondria to diverse environments and cellular functions. Recent reviews have focused on the evolution of a specific mitochondrial system, such us the mitochondrial import machinery [9], or have analyzed more generally the processes that transformed ancestors of both mitochondria and plastids into modern organelles [10]. Here we review recent advances and data on the proteomes of the mitochondrial ancestor and modern mitochondria and discuss the

120 Shaping the mitochondrial proteome most relevant events that affected the mitochondrial protein content and hence its metabolic potentialities.

The starting point: the proto‐mitochondrial proteome To unravel the evolution of the mitochondrial proteome it is essential to get an idea of what it looked like during the first endosymbiotic stages, the common point from which all modern mitochondrial proteomes evolved. This knowledge of the mitochondrial ancestor’s proteome is also crucial to understand how the initial symbiosis was established and perpetuated, an issue that is hotly debated and for which several hypotheses have been proposed [11‐15]. First approaches to infer the mitochondrial ancestor’s nature were based on the similarity, in terms of metabolic capabilities, of mitochondria with some bacterial groups. In this way the gamma‐ proteobacterium Bdellovibrio and the alpha‐proteobacterium Paracoccus were the first proposed models for the proto‐mitochondrion [16]. Later on, phylogenetic analyses of small subunit ribosomal RNAs and proteins from the respiratory complexes confirmed a monophyletic origin of mitochondria from an alpha‐proteobacterial ancestor [1, 3]. Moreover the sequencing of the genome of the bacterium Rickettsia prowazekii [17] and subsequent phylogenetic analyses [17, 18] identified members of the Rickettsia genus as the closest relatives of modern mitochondria. Some of the Rickettsiales are obligate intracellular parasites, a feature that suggested a similar life‐style for the mitochondrial ancestor [19]. However any attempt to establish parallelisms between the proto‐mitochondrial proteome and those of modern alpha‐ proteobacteria must be cautious, bearing in mind the 2 billion years of evolution that separate them. Indeed the adaptation to an intracellular lifestyle of the modern Rickettsiales is more likely the result of a different event than that of the endosymbiosis of mitochondria and therefore the similarities in both processes are rather the result of convergent evolution [17, 20]. The identification of its phylogenetic affiliation with the alpha‐proteobacteria narrowed the scope of the speculations on the proto‐mitochondrion’s lifestyle but the great diversity of size and composition of modern alpha‐proteobacterial genomes, ranging from 834 to more than 8000 protein‐coding genes, provides still enough room for many alternative models. The problem can be re‐formulated as to which extent the proto‐ mitochondrial proteome resembled that of modern alpha‐proteobacteria. To answer this question it is necessary to distinguish truly common features from those resulting from secondary adaptations. A valid approach is to trace back the origin of the modern proteomes to find out which proteins are directly derived from the endosymbiotic event. Assuming no genetic transfer to the mitochondrial genome, the proteins encoded there constitute a ‘bona‐fide’ subset of proteins derived from the proto‐mitochondrion. But mitochondrial genomes sequenced so far encode only 3‐67 proteins that are involved in few processes, mainly respiration and protein synthesis and, occasionally, transcription, RNA maturation and protein import [6]. The set was extended by a phylogenetic analysis of the yeast mitochondrial proteome [21] that identified an additional number (~20) of nuclear‐encoded proteins whose phylogenies indicated an alpha‐proteobacterial origin. When combined both sets form a reduced core of the proto‐mitochondrial proteome whose deduced metabolic capabilities reflect that of a cell harboring few metabolic pathways, able to couple

121 Chapter 7

electron transport to the production of ATP and synthesizing the required proteins. This picture changed considerably after a large‐scale phylogenetic comparison of alpha‐proteobacterial and eukaryotic genomes [22] identified 630 eukaryotic proteins that were likely derived from the alpha‐proteobacterial ancestor of mitochondria. Mapping the metabolic functions of these orthologous groups, a minimal metabolism for the proto‐mitochondrion ancestor could be reconstructed. Besides the above‐ mentioned processes, other pathways such as the oxidation and synthesis of fatty acids, biotine and heme synthesis, iron‐sulfur cluster assembly, and fructose and sucrose metabolism pathways emerged. Also notable was the presence of many metabolite transporters. Altogether the accumulation of data on the proto‐ mitochondrial proteome points towards a (facultatively) aerobic organism living on several compounds provided by the host. Whether the proto‐mitochondrion was a parasite, something that is compatible with the data available, depends on the existence of potential benefits for the host. In the case of a mutual benefit, the lack of an ATP transporter and the presence of pathways not directly related with the oxidative phosphorylation suggest that ATP was not the main currency used by the proto‐mitochondrion to pay back host’s services. The presence of the Fe‐S cluster assembly pathway and the ABC transporter that is likely involved in the export of Fe‐S clusters from mitochondria (ATM1) indicate an alternative benefit to the host that could have been key in the initial symbiotic relationship. This would be in agreement with the finding that proteins of this pathway are among the few conserved by mitochondrial remnants in the microsporidian Enzephalitozoon cuniculi [23], the protozoan Giardia intestinalis [24] and the apicomplexan Cryptosporidium parvum [25]. Although a secondary loss in these organisms of the pathway that was the original rationale for the symbiosis cannot be discarded.

A crucial step: the origin of the mitochondrial import machinery According to current data, the proto‐mitochondrial proteome soon underwent a series of transformations that shaped it, this was triggered to a great deal by the hallmark acquisition of a mechanism that allowed the import of proteins to the mitochondrion. The above mentioned proto‐mitochondrial reconstruction points to an ancient endosymbiont that was autonomous regarding protein synthesis, with no sophisticated system for the import of proteins synthesized elsewhere. This contrasts with the modern situation in which most mitochondrial proteins are encoded by nuclear genes, synthesized in the cytoplasm and subsequently imported into the organelle. The latter step is carried out by a complex machinery consisting of dozens of proteins located in the inner and outer membranes of the mitochondria [26], as well as many soluble chaperones that assist in the process. This machinery recognizes specific N‐terminal signals that are sufficient and necessary to direct the import of proteins into mitochondria. The existence of such a system constitutes a pre‐requisite not only for the escape from mitochondria to the nucleus of genes whose products should be targeted back but also for the recruitment of proteins of different origin to the organelle. Considering that both processes have been rampant [7, 22], there is little doubt that the emergence of the protein import system was a crucial step in the evolution of mitochondria. Indeed it might well be the event that marked the beginning of the irreversibly transition from endosymbiont to organelle,

122 Shaping the mitochondrial proteome because once genes coding for essential mitochondrial proteins were transferred to the nucleus, the host took definitive control of the mitochondrion.

The phylogenetic analysis of the mitochondrial import machinery [9] reveals that although most of its components are of eukaryotic origin, some of the components in the inner membrane and most of the soluble chaperones that assist the translocation have bacterial homologs. This mixed origin argues in favor of the hypothesis [9] that a rudimentary translocation system already existed in the ancestor and rapidly evolved into a more sophisticated machinery able to import proteins with high efficiency and specificity. Furthermore in its initial stages the protein import system could have been tightly coupled with translation as suggested by the preferably synthesis of ancient mitochondrial proteins in polysomes attached to mitochondria [27]. The upgrading process of the translocation system should have run parallel to the evolution of the pre‐sequences that direct the targeting of the mitochondrial proteins. These pre‐sequences have been proposed to have evolved from proteins with an inherent propensity to be targeted to the mitochondria [28]. Furthermore such targeting sequences are not uncommon in prokaryotic proteins and can easily be matched by random sequences [29]. Once present in a few genes it is easy to conceive that these pre‐sequences were passed to other genes by means of duplication and recombination events [30], allowing evolution to potentially test the mitochondrial localization of any nuclear‐encoded protein. This paved the way to the recruitment of proteins to function in the mitochondrion and therefore conferred a new dimension to the evolution of the mitochondrial proteome because it added expansion to the process of reduction.

Turning mitochondria into the cell’s energy factory Perhaps the gain of function that most radically affected the role of the mitochondrion within the eukaryotic cell was the acquisition of an ATP/ADP translocase and therefore the ability to pump ATP out of the mitochondria into the cell’s cytoplasm. Although the ATP‐production system is derived from the bacterial ancestor the ATP/ADP translocase has a eukaryotic origin [31]. The origin of the ATP/ADP exchanger provided mitochondria with a new function for the cell: that of an energy‐producing organelle. This new role might have favored as well subsequent mitochondrial transformations such us the elaborate folding pattern of the inner membrane (cristae) and the increase in complexity of the ATP‐synthase [32] and the electron transport chain (ETC) complexes [33]. This increase in complexity has affected all mitochondrial ETC complexes with the exception of Succinate dehydrogenease, notably the only ETC complex that does not pump protons across the inner membrane. For the rest a ~3‐fold increase in terms of protein content can be observed when the bacterial components are compared to their mammal counterparts: 3 to 11 subunits in the Cytochrome bc1 complex, 4 to 13 in Cytochrome c oxidase and 14 to 46 NADH:Ubiquinone oxidoreductase (complex I). In the latter, the addition of extra subunits has not altered the characteristic L‐shaped structure of the complex [34]. Notably, not all new subunits recruited to the ETC complexes are of eukaryotic origin. The acyl carrier component of complex I is actually present in alpha‐proteobacteria, but has never been identified as part of complex I in prokaryotes [35]. It is difficult to assess the function of the so‐called subsidiary

123 Chapter 7

subunits and their participation in the biogenesis or stabilization of the complexes is the most prevalent hypothesis [34, 36].

The presence of the mitochondrial‐type ATP/ADP translocase in both aerobic and anaerobic mitochondria studied so far [37] as well as in hydrogenosomes [5, 38] suggests that this gain of function preceded the diversification of mitochondria in terms of the terminal acceptor of electrons used to fuel the synthesis of ATP, and raise doubts on which was the original sink for the electron transport chain [4]. Besides oxygen, modern mitochondria are able to use alternative electron acceptors such as nitrate, nitrite or fumarate. Although some anaerobic mitochondria, like the ones present in parasitic helminths and freshwater snails, are likely the result of secondary adaptations from aerobiosis [39], current data on other anaerobic mitochondria and hydrogenosomes [4] are better explained by an scenario in which main anaerobic traits are verticaly derived by a facultatively anaerobic ancestor. Interestingly, the finding of an evolutionarily related but distinct ATP/ADP translocase in the flagellate Trichomonas gallinae [40, 41] might represent the exception to a single origin for ATP/ADP exchangers.

Modern mitochondrial proteomes Modern mitochondria represent the current extremes of the mitochondrial proteome evolution. There are in fact multiple ends, as mitochondria diverged along the different eukaryotic lineages (figure 1). Defining the modern mitochondrial proteome could seem easier than inferring the ancestral one, since we have the extant organisms at hand. But only recently has it become technically feasible to obtain comprehensive large‐scale data on the protein content of mitochondria, thanks to the implementation and combination of several proteomic techniques to the analyses of organellar proteomes [42, 43]. Analysis of the mitochondrial proteome starts by a purification of the organelle by means of a centrifugation on a density gradient followed by a separation of the protein content to identify the individual proteins.

124 Shaping the mitochondrial proteome

Figure 1‐ Eukaryotic phylogeny according to [2] and [44] (left), symbols next to a phylum’s name represents possession of canonic mitochondria (ovals), mitosomes or mitochondrial remnants (light circles) and hydrogenosomes (dark circles) . Bars on the right represent total estimated proteome size (in grey) according to [45], and fraction of experimentally determined proteome (in black) according to [46] (human), [47] (yeast) and [48] (Arabidopsis).

Different groups have been fine‐tuning every single step of the process to cover an increasing fraction of the mitochondrial proteome. In the separation phase 1D or 2D polyacrylamide gel electroforesis has been combined or used in parallel with other techniques such us liquid‐chromatography or iso‐electric focusing [49]. In the final step, consisting in the identification of proteins in the different gel‐spots or aliquots, a variety of Mass spectrometry techniques (MS) have been applied. These include, among others, matrix‐assisted laser‐desorption ionization MS, electrospray ionization MS and tandem MS. Successive proteomic studies have shown and increasingly higher coverage and we have now reasonably good proteomic sets for the yeast [47] and human [46, 50] mitochondria while great progress has been achieved in other model organisms such as mouse [49], rice [51] and Arabidopsis [48] (figure 1). Alternatively, epitope‐ [52] and GFP‐ [53] protein fusions combined

125 Chapter 7

with (immuno)fluorescence microscopy have been used in a genome‐wide scale to analyze the subcellular localization of the yeast proteome, identifying 332 and 526 mitochondrial proteins respectively. The comparison of the proteomic [47], the GFP‐ tagging genome‐wide analyses [53] sets and that of an expert‐compiled list of experimentally determined mitochondrial proteins [54] (Figure 2) shows a higher coverage of the reference set for proteomics (80%) when compared to the GFP‐ tagging analysis (62%).

Figure 2‐ Partial overlaps between three mitochondrial proteome sets, 526 proteins identified by GFP‐tagging and fluorescence microscopy by Huh et al [53], 750 proteins identified by mitochondrial isolation and proteomics analyses by Sickmann et al. [47] and 566 proteins of a manually curated list of experimentally identified mitochondrial proteins by Eric Schon [54].

This high coverage indicates that we are getting very close to the complete identification of the yeast mitochondrial proteome but the still large fraction of proteins only detected by one of the techniques (49% of the combined set), and specially those nearly hundred for which a mitochondrial localization has been found by small‐scale experiments but not by the high‐throughput ones, make us aware that there is still a part of the mitochondrial proteome resisting identification by massive techniques. These proteins are likely present in low concentrations or only expressed under certain conditions. As expected, the situation in human, mouse and other model organisms is not better in terms of coverage. In the absence of proteomic data mitochondrial‐targeting prediction tools [45, 55] have also been used to obtain a rough estimate of the protein contents of mitochondria of organisms with a fully‐sequenced genome (figure 1). A preliminary analysis of the available sets reveals that modern mitochondrial proteomes are highly variable not only in size but also in content, likely reflecting a parallel diversity in metabolic and regulatory functions among the different phyla. Besides their function as energy producing organelles, mitochondria are pivotal in many other processes such as the synthesis of heme group [56] and iron‐sulfur clusters [57]. In addition they harbor metabolic pathways which are specific for certain species or genus [4, 56, 58]. Many of these lineage‐specific

126 Shaping the mitochondrial proteome functions, such as the role of plant mitochondria in folate synthesis [59], have long been known from physiological studies but the availability of large‐scale proteomic data is likely to uncover many more aspects of mitochondrial diversity. This functional diversity is the result of the two processes that shaped the mitochondrial proteome (figure 3): firstly continuous loss of proteins has reduced the ancient fraction of the proteome and therefore the bacterial identity of the organelle and, secondly, an also continuous recruitment of proteins has expanded the proteome providing it with new functions or adapting the old ones to the new circumstances. Notably the loss from the mitochondrial proteome does not necessary imply loss from the total cell’s proteome, as ancient mitochondrial proteins can be re‐targeted to other locations in the cell [22] as exemplified by the peroxisomal location of part of the beta‐oxidation pathway in yeast.

Figure 3‐ Overview of the processes that transformed the proto‐mitochondrial proteome into those of modern mitochondria.

Gain and loss processes have acted along the various lineages in a different manner in terms of quality and intensity. Mitochondria have selectively loss proteins whose functions were no longer needed in a certain lineage, like the specific complex I loss from yeast, while keeping the required functions. In the yeast mitochondrial proteome, most of what has been kept from the alpha‐proteobacterial ancestor (59 out of 97 proteins) is related to respiration according to a genome‐wide analysis [60], thus indicating an specialization of mitochondria towards respiration when compared with its ancestor. Major expansions in the size of the proteome are observed in vascular plants and metazoans and might reflect the adaptation to multicellularity and tissue differentiation. Indeed many tissue‐specific mitochondrial properties and proteins have been described in plants [61] and mammals [49]. In plants, comparison between photosynthetic and non‐photosynthetic tissues have demonstrated that the composition of mitochondrial proteomes varies in accordance to different metabolic needs, like a greater demand for the mitochondrial oxidation of glycine generated by

127 Chapter 7

photo‐respiration [62]. In mammals mitochondria of certain tissues have been shown to participate in tissue‐specific processes such as the regulation of insulin secretion in pancreatic beta‐cells [63] or steroidogenesis in adrenal‐cortex [64]. Besides these lineage‐specific expansions, the mitochondrial proteome likely gained a common set of proteins before the divergence of most eukaryotic species. These common gains include the above‐mentioned ATP/ADP translocase and part of the mitochondrial import machinery. The total overlap between yeast and human mitochondrial proteomes, 371 proteins in the E. Schon set [54], greatly exceeds the alpha‐proteobacterial fraction, 79 proteins (~20%) [22], indicating that the mitochondrial proteome experienced a major expansion already before the divergence of fungi and metazoan. Consistently the ETC complexes that are present in both organisms share most of the subsidiary subunits of eukaryotic origin. It is worth to remember that the recruitment of proteins to the mitochondrial proteome was not restricted to proteins of eukaryotic origin. In principle any nuclear‐encoded protein could potentially be targeted to mitochondria once the necessary pre‐sequence is attached to the gene. Therefore is not surprising to find mitochondrial proteins of diverse origin like the ones of the oxidative branch of the krebs cycle, phylogenetically close to the Cyptophaga‐flavobacterium‐bacteroides (CFB) group [65] although a lateral gene transfer cannot be ruled out, or the viral‐like mitochondrial RNA‐polymerase. The latter represents an interesting case of non‐ orthologous gene displacement, as the proto‐mitochondrial ancestor likely harbored the eubacterial‐type RNA‐polymerase, still encoded in the mitochondrion of the protist Reclinomonas americana [66]. The RNA‐polymerase case is not the only example of a viral‐like protein working in mitochondria, the so‐called twinkle protein is a mitochondrial DNA‐helicase with homology to phage‐T7 primase helicase [67]. Phylogenetic analyses of the mitochondrial proteomes did however not indicate other proteins of viral origin (results not shown).

Mitochondrial proteome remnants in “amitochondriate” eukaryotes The monophyletic origin of mitochondria and its presence in a broad range of eukaryotic species suggests that the acquisition of these organelles occurred very early in the evolution of eukaryotes. How early is still a matter of discussion. The initial placing of most amitochondriate eukaryotes at the base of the reconstructed phylogenies suggested that these diversified previous to the mitochondrial endosymbiosis [68]. Recent phylogenetic reconstructions [69, 70] however support the re‐location of some of the amitochondriates, such as microsporidia, amebozoa and parabasalia, upper in the tree indicating that their lack of mitochondria is rather the result of a secondary adaptation to anaerobic environments. Moreover the existence of sub‐cellular structures that resemble mitochondrial remnants, the so‐ called mitosomes, in microsporidia [71], criptosporidia [72], amoebozoans [73] and diplomonads [24], and the presence of hydrogenosomes in parabasalids, ciliates and anaerobic fungi [74] might represent forms of highly specialized – rather than degraded – mitochondria, thus supporting the hypothesis that the mitochondrial endosymbiosis was, if not the first, an early event in eukaryotic evolution. This view is also supported by the presence of proteins of a likely mitochondrial origin in all fully sequenced eukaryotes so far. A search for proto‐mitochondrial derived proteins

128 Shaping the mitochondrial proteome in nine fully sequenced eukaryotic genomes [22] evidenced an ubiquitous but variable presence of ancient mitochondrial proteins, including 29 proteins in the microsporidian Encephalitozoon cuniculi (Table 1).

Gene name Annotation Mit. C.p Ecu‐CU09_920 Putative ribosomal RNA methyl transferase Ecu‐CU07_1340 Ribosomal RNA methyl transferase Ecu‐CU11_0540* Mitochondrial HSP70 Ecu‐CU03_0240 Similar to ABC transporter Ecu‐CU11_1200* Similar to ABC transporter (ATM1) Ecu‐CU04_0480 ABC transporter 7 Ecu‐CU07_0600* Adrenodoxin Ecu‐CU11_1770* Similar to NFS1 Ecu‐CU05_1308* Similar to hypoth. Prot. (glutaredoxin‐like) Ecu‐CU01_1770* NIFU‐like protein Ecu‐CU05_1250 CDP diacyl glycerol synthase Ecu‐CU05_1190 ABC transporter Ecu‐CU04_0600 Proline aminopeptidase Ecu‐CU05_0310 AcetylCoA synthetase Ecu‐CU10_0710 DNA mismatch repair protein Ecu‐CU06_0730 Ribonucleoside di‐P reductase chain M2 Ecu‐CU11_0300 Similar to Dephospho‐CoA kinase n.p Ecu‐CU07_1320 Similar to YAGE‐SCHPO

Table 1‐ List of Encephalitozoon cuniculi proteins with a likely proto‐mitochondrial origin from [22] and their functional annotation, proteins within the same box belong to the same orthologous group and proteins involved in Fe‐S cluster assembly are marked with an asterisk. Columns on the right express the mitochondrial localization of a yeast mitochondrial ortholog according to either [54] or [47], whenever an orthologous yeast protein exists otherwise this is indicated with a n.p (not present) sign. Finally, grey‐boxes in the last column indicate the presence of a detectable homolog (E‐value < 10‐15, with a BlastP search) in the genome of Cryptosporidium parvum.

Interestingly a large fraction (22) of the proto‐mitochondrial proteins in E. cuniculi have yeast orthologs targeted to mitochondria, suggesting they might indeed be part of the mitosome. Moreover 19 of these, including the 6 proteins involved in Fe‐S cluster assembly, can also be found in the recently‐sequenced genome of Cryptosporidium parvum [75], another parasite with a mitochondrial remnant. In addition other Fe‐S cluster assembly genes and mitochondrial chaperonin‐like genes have also been found in amoebozoans, diplomonads, amitochondriate apicomplexans and parabasalids [76‐79], some of them have been proven to be targeted to the mitochondrial remnants. The apparently ubiquitous presence of proto‐mitochondrial derived proteins in virtually all eukaryotes supports the view of an early endosymbiosis at the root of the eukaryotic tree, although it might be argued that these proteins could have been gained via subsequent horizontal transfers. In the end the higher plasticity of the mitochondrial proteome, when compared to the

129 Chapter 7

genome, makes it less powerful to assess the origin of highly reduced remnants. The discovery of a genome in the hydrogenosomes of the ciliate Nyctoterus ovalis [80], provides a way out to this dilemma with initial analyses proving a mitochondrial ancestry for the hydrogenosomes in these species.

Conclusions Since its endosymbiosis, the mitochondrial proteome has experienced not only an extensive loss of its original content but also, unlike the genome, major gains. The balance of both processes differs from one eukaryotic phylum to another resulting in a great diversity of modern mitochondrial proteomes and, therefore, its respective metabolic capacities. Accumulating data from genomic and proteomic studies provide us with the opportunity to draw the main lines of the mitochondrial proteome evolution. Notably the emerging picture is that there is no extant eukaryote whose amitochondriate state can be categorically considered ancestral, if this is proven to be true, specially when a greater diversity of protozoan genomes are sequenced, it would have tremendous implications on the role that mitochondrial endosymbiosis played in the process of eukaryogenesis. Even considering the existence of ancient amitochondriate eukaryotes the evolutionary success that the acquisition of mitochondria had in eukaryotic evolution is beyond any doubt.

Acknowledgements We are grateful to Charles Kurland for critically reviewing the manuscript. This work was supported in part by a grant from the Netherlands organization for Scientific Research (NWO).

References

1 Gray, M. W., Burger, G. and Lang, B. F. "Mitochondrial evolution".Science. 1999 283:1476- 81 2 Hedges, S. B., et al. "A molecular timescale of eukaryote evolution and the rise of complex multicellular life".BMC Evol Biol. 2004 4:2 3 Yang, D., et al. "Mitochondrial origins".Proc Natl Acad Sci U S A. 1985 82:4443-7 4 Tielens, A. G., et al. "Mitochondria as we don't know them".Trends Biochem Sci. 2002 27:564-72 5 Voncken, F., et al. "Multiple origins of hydrogenosomes: functional and phylogenetic evidence from the ADP/ATP carrier of the anaerobic chytrid Neocallimastix sp".Mol Microbiol. 2002 44:1441-54 6 Burger, G., Gray, M. W. and Lang, B. F. "Mitochondrial genomes: anything goes".Trends Genet. 2003 19:709-16 7 Timmis, J. N., et al. "Endosymbiotic gene transfer: organelle genomes forge eukaryotic chromosomes".Nat Rev Genet. 2004 5:123-35 8 Nosek, J. and Tomaska, L. "Mitochondrial genome diversity: evolution of the molecular architecture and replication strategy".Curr Genet. 2003 44:73-84 9 Herrmann, J. M. "Converting bacteria to organelles: evolution of mitochondrial protein sorting".Trends Microbiol. 2003 11:74-9 10 Dyall, S. D., Brown, M. T. and Johnson, P. J. "Ancient invasions: from endosymbionts to organelles".Science. 2004 304:253-7

130 Shaping the mitochondrial proteome

11 Martin, W. and Müller, M. "The hydrogen hypothesis for the first eukaryote".Nature. 1998 392:37-41 12 Kurland, C. G. and Andersson, S. G. "Origin and evolution of the mitochondrial proteome".Microbiol Mol Biol Rev. 2000 64:786-820 13 Moreira, D. and López-García, P. "Symbiosis between methanogenic archaea and delta- proteobacteria as the origin of eukaryotes: the syntrophic hypothesis".J Mol Evol. 1998 47:517-30 14 Martin, W., et al. "An overview of endosymbiotic models for the origins of eukaryotes, their ATP-producing organelles (mitochondria and hydrogenosomes), and their heterotrophic lifestyle".Biol Chem. 2001 382:1521-39 15 López-García, P. and Moreira, D. "Metabolic symbiosis at the origin of eukaryotes".Trends Biochem Sci. 1999 24:88-93 16 Margulis, L. "Symbioses in Cell Evolution". 17 Andersson, S. G., et al. "The genome sequence of Rickettsia prowazekii and the origin of mitochondria".Nature. 1998 396:133-40 18 Emelyanov, V. V. "Evolutionary relationship of Rickettsiae and mitochondria".FEBS Lett. 2001 501:11-8 19 Emelyanov, V. V. "Rickettsiaceae, rickettsia-like endosymbionts, and the origin of mitochondria".Biosci Rep. 2001 21:1-17 20 Muller, M. and Martin, W. "The genome of Rickettsia prowazekii and some thoughts on the origin of mitochondria and hydrogenosomes".Bioessays. 1999 21:377-81 21 Karlberg, O., et al. "The dual origin of the yeast mitochondrial proteome".Yeast. 2000 17:170- 87 22 Gabaldón, T. and Huynen, M. A. "Reconstruction of the proto-mitochondrial metabolism".Science. 2003 301:609 23 Vivares, C. P., et al. "Functional and evolutionary analysis of a eukaryotic parasitic genome".Curr Opin Microbiol. 2002 5:499-505 24 Tovar, J., et al. "Mitochondrial remnant organelles of Giardia function in iron-sulphur protein maturation".Nature. 2003 426:172-6 25 LaGier, M. J., et al. "Mitochondrial-type iron-sulfur cluster biosynthesis genes (IscS and IscU) in the apicomplexan Cryptosporidium parvum".Microbiology. 2003 149:3519-30 26 Wiedemann, N., Frazier, A. E. and Pfanner, N. "The protein import machinery of mitochondria".J Biol Chem. 2004 279:14473-6 27 Marc, P., et al. "Genome-wide analysis of mRNAs targeted to yeast mitochondria".EMBO Rep. 2002 3:159-64 28 Lucattini, R., Likic, V. A. and Lithgow, T. "Bacterial proteins predisposed for targeting to mitochondria".Mol Biol Evol. 2004 21:652-8 29 Lemire, B. D., et al. "The mitochondrial targeting function of randomly generated peptide sequences correlates with predicted helical amphiphilicity".J Biol Chem. 1989 264:20206-15 30 Kadowaki, K., et al. "Targeting presequence acquisition after mitochondrial gene transfer to the nucleus occurs by duplication of existing targeting signals".EMBO J. 1996 15:6652-61 31 Amiri, H., Karlberg, O. and Andersson, S. G. "Deep origin of plastid/parasite ATP/ADP translocases".J Mol Evol. 2003 56:137-50 32 Boyer, P. D. "The ATP synthase--a splendid molecular machine".Annu Rev Biochem. 1997 66:717-49 33 Berry, S. "Endosymbiosis and the design of eukaryotic electron transport".Biochim Biophys Acta. 2003 1606:57-72 34 Guenebaut, V., et al. "Consistent structure between bacterial and mitochondrial NADH:ubiquinone oxidoreductase (complex I)".J Mol Biol. 1998 276:105-12 35 Sackmann, U., et al. "The acyl-carrier protein in Neurospora crassa mitochondria is a subunit of NADH:ubiquinone reductase (complex I)".Eur J Biochem. 1991 200:463-9 36 Carroll, J., et al. "Analysis of the subunit composition of complex I from bovine heart mitochondria".Mol Cell Proteomics. 2003 2:117-26 37 Palmieri, F. "Mitochondrial carrier proteins".FEBS Lett. 1994 346:48-54 38 van der Giezen, M., et al. "Conserved properties of hydrogenosomal and mitochondrial ADP/ATP carriers: a common origin for both organelles".EMBO J. 2002 21:572-9 39 van Hellemond, J. J., et al. "Biochemical and evolutionary aspects of anaerobically functioning mitochondria".Philos Trans R Soc Lond B Biol Sci. 2003 358:205-13; discussion 213-5

131 Chapter 7

40 Tjaden, J., et al. "A divergent ADP/ATP carrier in the hydrogenosomes of Trichomonas gallinae argues for an independent origin of these organelles".Mol Microbiol. 2004 51:1439- 46 41 Dyall, S. D., et al. "Presence of a member of the mitochondrial carrier family in hydrogenosomes: conservation of membrane-targeting pathways between hydrogenosomes and mitochondria".Mol Cell Biol. 2000 20:2488-97 42 Canovas, F. M., et al. "Plant proteome analysis".Proteomics. 2004 4:285-98 43 Warnock, D. E., Fahy, E. and Taylor, S. W. "Identification of protein associations in organelles, using mass spectrometry-based proteomics".Mass Spectrom Rev. 2004 23:259-80 44 Baldauf, S. L., et al. "A kingdom-level phylogeny of eukaryotes based on combined protein data".Science. 2000 290:972-7 45 Richly, E., Chinnery, P. F. and Leister, D. "Evolutionary diversification of mitochondrial proteomes: implications for human disease".Trends Genet. 2003 19:356-62 46 Cotter, D., et al. "MitoProteome: mitochondrial protein sequence database and annotation system".Nucleic Acids Res. 2004 32 Database issue:D463-7 47 Sickmann, A., et al. "The proteome of Saccharomyces cerevisiae mitochondria".Proc Natl Acad Sci U S A. 2003 100:13207-12 48 Heazlewood, J. L., et al. "Experimental analysis of the Arabidopsis mitochondrial proteome highlights signaling and regulatory components, provides assessment of targeting prediction programs, and indicates plant-specific mitochondrial proteins".Plant Cell. 2004 16:241-56 49 Mootha, V. K., et al. "Integrated analysis of protein composition, tissue diversity, and gene regulation in mouse mitochondria".Cell. 2003 115:629-40 50 Taylor, S. W., et al. "Characterization of the human heart mitochondrial proteome".Nat Biotechnol. 2003 21:281-6 51 Tanaka, N., et al. "Proteomics of the rice cell: systematic identification of the protein populations in subcellular compartments".Mol Genet Genomics. 2004 52 Kumar, A., et al. "Subcellular localization of the yeast proteome".Genes Dev. 2002 16:707-19 53 Huh, W. K., et al. "Global analysis of protein localization in budding yeast".Nature. 2003 425:686-91 54 Schon, E. "Gene products present in mitochondria of yeast and animal cells." 55 Guda, C., Fahy, E. and Subramaniam, S. "MITOPRED: a genome-scale method for prediction of nucleus-encoded mitochondrial proteins".Bioinformatics. 2004 56 Scheffler, I. E. "Mitochondria make a come back".Adv Drug Deliv Rev. 2001 49:3-26 57 Lill, R. and Kispal, G. "Maturation of cellular Fe-S proteins: an essential function of mitochondria".Trends Biochem Sci. 2000 25:352-6 58 Ohta, S. "A multi-functional organelle mitochondrion is involved in cell death, proliferation and disease".Curr Med Chem. 2003 10:2485-94 59 Neuburger, M., et al. "Mitochondria are a major site for folate and thymidylate synthesis in plants".J Biol Chem. 1996 271:9466-72 60 Steinmetz, L. M., et al. "Systematic screen for human disease genes in yeast".Nat Genet. 2002 31:400-4 61 Bowsher, C. G. and Tobin, A. K. "Compartmentation of metabolism within mitochondria and plastids".J Exp Bot. 2001 52:513-27 62 Srinivasan, R. and Oliver, D. J. "Light-dependent and tissue-specific expression of the H- protein of the glycine decarboxylase complex".Plant Physiol. 1995 109:161-8 63 Maechler, P. "Mitochondria as the conductor of metabolic signals for insulin exocytosis in pancreatic beta-cells".Cell Mol Life Sci. 2002 59:1803-18 64 Rosol, T. J., et al. "Adrenal gland: structure, function, and mechanisms of toxicity".Toxicol Pathol. 2001 29:41-8 65 Baughn, A. D. and Malamy, M. H. "A mitochondrial-like aconitase in the bacterium Bacteroides fragilis: implications for the evolution of the mitochondrial Krebs cycle".Proc Natl Acad Sci U S A. 2002 99:4662-7 66 Lang, B. F., et al. "An ancestral mitochondrial DNA resembling a eubacterial genome in miniature".Nature. 1997 387:493-7 67 Korhonen, J. A., Gaspari, M. and Falkenberg, M. "TWINKLE Has 5' -> 3' DNA helicase activity and is specifically stimulated by mitochondrial single-stranded DNA-binding protein".J Biol Chem. 2003 278:48627-32 68 Cavalier-Smith, T. and Chao, E. E. "Molecular phylogeny of the free-living archezoan Trepomonas agilis and the nature of the first eukaryote".J Mol Evol. 1996 43:551-62

132 Shaping the mitochondrial proteome

69 Cavalier-Smith, T. "The phagotrophic origin of eukaryotes and phylogenetic classification of Protozoa".Int J Syst Evol Microbiol. 2002 52:297-354 70 Stechmann, A. and Cavalier-Smith, T. "Rooting the eukaryote tree by using a derived gene fusion".Science. 2002 297:89-91 71 Williams, B. A., et al. "A mitochondrial remnant in the microsporidian Trachipleistophora hominis".Nature. 2002 418:865-9 72 Riordan, C. E., et al. "Cryptosporidium parvum Cpn60 targets a relict organelle".Curr Genet. 2003 44:138-47 73 Tovar, J., Fischer, A. and Clark, C. G. "The mitosome, a novel organelle related to mitochondria in the amitochondrial parasite Entamoeba histolytica".Mol Microbiol. 1999 32:1013-21 74 Embley, T. M., et al. "Mitochondria and hydrogenosomes are two forms of the same fundamental organelle".Philos Trans R Soc Lond B Biol Sci. 2003 358:191-201; discussion 201-2 75 Abrahamsen, M. S., et al. "Complete genome sequence of the apicomplexan, Cryptosporidium parvum".Science. 2004 304:441-5 76 Arisue, N., et al. "Mitochondrial-type hsp70 genes of the amitochondriate protists, Giardia intestinalis, Entamoeba histolytica and two microsporidians".Parasitol Int. 2002 51:9-16 77 Germot, A., Philippe, H. and Le Guyader, H. "Presence of a mitochondrial-type 70-kDa heat shock protein in Trichomonas vaginalis suggests a very early mitochondrial endosymbiosis in eukaryotes".Proc Natl Acad Sci U S A. 1996 93:14614-7 78 Germot, A. and Philippe, H. "Critical analysis of eukaryotic phylogeny: a case study based on the HSP70 family".J Eukaryot Microbiol. 1999 46:116-24 79 Slapeta, J. and Keithly, J. S. "Cryptosporidium parvum mitochondrial-type HSP70 targets homologous and heterologous mitochondria".Eukaryot Cell. 2004 3:483-94 80 Akhmanova, A., et al. "A hydrogenosome with a genome".Nature. 1998 396:527-8

133 Chapter 8

The evolutionary trees that adorn our textbooks have data only at the tips of their branches; the rest is inference.

Stephen Jay Gould

134 The evolutionary origin of the peroxisome

Chapter 8

The evolutionary origin of the peroxisome

Toni Gabaldón, Berend Snel, Frank van Zimmeren, Wieger Hemrika, Henk Tabak and Martijn A. Huynen.

(Submitted for publication)

135 Chapter 8

Abstract Peroxisomes are ubiquitous eukaryotic organelles involved in various oxidative reactions. We examined their evolutionary origin through large‐scale phylogenetic analyses of the yeast and rat peroxisomal proteomes. Most peroxisomal proteins (38‐56%) are of eukaryotic origin, comprising all proteins involved in organelle biogenesis or maintenance. A significant number of proteins (17‐18%) can be traced to the alpha‐proteobacteria, the progenitors of mitochondria. This group consists of enzymes and appears the result of recruitment of proteins originally targeted to mitochondria. Our results indicate that peroxisomes do not have an endosymbiotic origin and that its proteins were recruited from pools existing within the primitive eukaryote itself.

Results and discussion Peroxisomes were first isolated from liver and biochemically characterized by the group of De Duve [1]. These organelles contain various oxidative enzymes together with catalase to decompose the H2O2 produced during these reactions, hence their name: peroxisomes. Later it became clear that these single membrane bound organelles can differ substantially with respect to their enzymatic content depending on the species involved. The conversion of fatty acids in carbohydrate by way of the glyoxylate cycle is the hallmark of glyoxysomes present in plants, protozoa and yeasts. Part of the glycolytic pathway is compartmentalized in the glycosomes of Trypanosomatids. Photorespiration is typical for plant peroxisomes while peroxisomes of various yeasts can oxidize alkanes or methanol. Despite of these enzymatic differences all these organelles are part of the same microbody family. This became clear with the discovery that they all share the same targeting codes (PTS1 and PTS2) for the import of proteins into the organel and the identification of similar sets of proteins involved in the biogenesis and maintenance of the organelles [2]. Although the unity within the microbody family is thus firmly established, the evolutionary origin of the organelles is still a matter of debate [3]. There are strong arguments to consider peroxisomes as autonomous organelles with an endosymbiotic origin: i), matrix enzymes are synthesized on free polyribosomes and posttranslationally imported into the organelles and ii), peroxisomes have their own protein import machinery, like mitochondria and chloroplasts, and iii) peroxisomes have been shown to divide [4].

Recent discoveries, however, have challenged the view of peroxisomes as autonomous organelles. Firstly, after several generations the lacking of peroxisomes in some mutants is reversible upon the introduction of the wild‐type gene [5]. Secondly, it has recently been observed that new peroxisomes originate from the ER [6]. These observations are at odds with peroxisomes being autonomous organelles and therefore weaken the case for a supposedly endosymbiotic origin. Here we address the issue of peroxisomal evolution by phylogenetic analysis of peroxisomal proteins. To this end we collected an exhaustive set of proteins with an experimentally determined peroxisomal location in the yeast Saccharomyces cerevisiae and the rodent Rattus norvegicus, and performed a phylogenetics analysis to investigate whether there is significant evolutionary signal in the peroxisomal proteome just as has been shown for mitochondria [7, 8].

136 The evolutionary origin of the peroxisome

From databases and experimental literature we collected 62 yeast and 50 rat proteins with a peroxisomal location or function (table 1). Since our lists include proteins from various large‐scale proteomics analyses [9‐11], as well as from individual studies under various conditions, we consider them to be representative samples of peroxisomal proteomes. Phylogenies (for methods see supplementary material) of the peroxisomal proteins were reconstructed to determine their origin. We consider a protein to be of eukaryotic origin when it has no homologs in prokaryotes, or when the prokaryotic branches within the tree are mono‐phyletic (Figure 1a). A protein is considered of bacterial or archaeal origin when it clusters “within” a prokaryotic branch (Figure 1b). A summary of the results is presented in the form of a pie chart (Figure 1c). Most of the proteins are clearly of eukaryotic origin: 56.4% of the yeast proteome, 38% of the rat proteome (Figure 1c). A significant fraction of these belong to the so called Pex proteins involved in the biogenesis and maintenance of peroxisomes. This is also the group of proteins that is most consistently present in all microbodies, underlining their essential functions. Interestingly, the most ancient Pex proteins (see below) show homology with the ERAD (Endoplasmatic Reticulum Associated Decay) system, which pulls proteins from the ER membrane and ubiquitinylates them in preparation to degradation in the proteasome [12]. Pex1, a AAA cassette containing protein, has evolved from Cdc48/p97 (Figure 1a) [13]; Pex2 and Pex10, ubiquitin ligase domain containing proteins are homologous to the ubiquitin ligase Hrd1; Pex5, a TPR containing protein, is homologous to the Hrd1 interacting protein Hrd3; Pex4 contains an E2 ubiquitin conjugating enzyme domain and is homologous to Ubc1, Ubc6 and Ubc7.

These similarities in amino acid sequence extend into similarities in function and subcellular location. Pex1 and Pex6 (both AAA containing proteins) are needed to extract the cycling PTS1 receptor Pex5 from the peroxisome membrane to facilitate a new cycle of Pex5 mediated protein import [14]. Ubiquitinylation of Pex5 is part of this recycling process. In both cases, the ERAD and the peroxisomal AAA proteins operate in the cytoplasm and are recruited to the membrane by organel‐specific anchor proteins: Cdc48/p97 to the ER membrane by VIMP [15], Pex1 and Pex6 to the peroxisomal membrane by Pex15 (in yeast) and Pex26 (in mammals). This resemblance in very old proteins with similar functions and the link with the universal endomembrane compartment of the eukaryotic cell suggest that the peroxisome is an invention that took place within the eukaryotic lineage itself.

137 Chapter 8

Figure 1 A: Maximum likelihood phylogenetic tree of the CDC48 orthologous group and its paralogs, including PEX1 and PEX6. The crenarchaeon Pyrobaculum aerophilum and euryarchaeon Archaeoglobus fulgidus sequences cluster together, consistent with an ancient eukaryotic origin of this protein family rather than an origin from an horizontal transfer, and they are used as outgroup. PEX1/6 appears to have evolved from CDC48. B: Maximum likelihood phylogenetic tree of the Npy1p orthologous group and its mitochondrial paralogs. This protein family has a single origin (is monophyletic) in the alpha-proteobacteria. Bootstrap support over 100 replicates of the maximum likelihood tree generated by PhyML [16] is shown in all partitions. C: Pie chart showing the relative distribution of peroxisomal proteins according to their phylogenetic origin in yeast (left) and rat (right). Proteins that do have prokaryotic homologs but for which no reliable tree can de constructed, e.g. due to short stretches of homology, are considered “unresolved”.

The group of proteins of eukaryotic descent also contains certain household proteins with dual or plural functions with respect to organelles. The ER located or associated proteins Erg1, Erg6, Emp24, Rho1 and the multipurpose dynamin Vps1 have also been implicated in peroxisomal functions [17]. We have visualized the

138 The evolutionary origin of the peroxisome retargeting of proteins from various cellular locations to the peroxisome in evolution in a cartoon (Figure 2).

Figure 2. The various scenarios for the origin of peroxisomal proteins. Proteins are derived from the alpha‐proteobacterial ancestor of the mitochondria, including the transfer of their genes to the nucleus (scenario I). Proteins without a (detectable) alpha‐proteobacterial ancestry have been retargeted from the mitochondria (scenario II). Proteins have been retargeted from other compartments of the cell like the Endoplasmatic Reticulum (scenario III).

Remarkably, the second largest fraction of proteins, 17‐18%, has an alpha‐ proteobacterial origin (Figure 1c). This is similar to what has been found for mitochondria [7, 8], and, at first sight appears to be at odds with a eukaryotic origin of the peroxisome. Upon close examination it turns out, however, that these peroxisomal proteins have likely been retargeted from the mitochondria (Figure 2, scenario I), rather than having evolved directly from an independent endosymbiont. This observation is consistent with the high degree of re‐targetting observed for proteins derived from the proto‐mitochondrion [8]. Seven of the eleven S. cerevisiae peroxisomal proteins of alpha‐proteobacterial origin are closely related to mitochondrial proteins, being themselves targeted to that organelle or having mitochondrial paralogs or orthologs. For instance, thioesterase (Tes1p) and the homoaconitase (Lys4p) are located to both peroxisome and mitochondrion of S. cerevisiae [18]. There are also cases in which products of duplicated genes occur in peroxisomes and mitochondria: i), the peroxisomal glycerole‐P dehydrogenase Gpd1p has a paralog in yeast (Gpd2p) with a cytoplasmic and mitochondrial localization [19]; ii) the peroxisomal Fat2p is homologous to the mitochondrial long‐ chain fatty acid CoA ligases iii), the orthologous group consisting of Eci1p, Dci1p and 3,2‐transenoyl CoA isomerase is peroxisomal in yeast and in human but has a mitochondrial homolog in mammals [20]; and iv), the nudix phosphatase family (Npy1p) of which the yeast, human and plant orthologs are peroxisomal has a paralogous group in metazoa that is mitochondrial according to GFP‐fusion studies in mouse [21] and to Mitoprot (p=0.97). The phylogenetic tree (Figure 1b) indicates a single origin from the alpha‐proteobacteria of both mitochondrial and peroxisomal

139 Chapter 8 proteins. The four remaining cases of yeast peroxisomal proteins of alpha‐ proteobacterial origin are Pnc1p, Gto1p, Fox2p and Pcd1p. For these no homologs with experimental evidence of mitochondrial location were found, although Pcd1p does have a bona‐fide mitochondrial targeting signal (P=0.97 in Mitoprot [22]). With respect to the rat peroxisome, there are four proteins of alpha‐ proteobacterial descent that do not have orthologs in the yeast peroxisome. Two of these present cases of dual targeting: the sterol carrier protein SCP2, which is peroxisomal in most cases, is associated with mitochondria in steroidogenic tissues [23] and isoforms of peroxisomal bile acid thioestherase BAAT have been detected in mitochondria and the cytoplasm in human liver [24]. Not only proteins of alpha‐proteobacterial origin have homologs in the mitochondrion and the peroxisome. This is also the case for the peroxisomal proteins without a (detectable) alpha‐proteobacterial origin Idp3p, Cta1p, Faa1p, Cit2p and Faa2p [18, 25] (Figure 2, Scenario II). In contrast to proteins of alpha‐proteobacterial origin, here one cannot simply argue that the mitochondrial localization preceded the peroxisomal one. Using phylogenetic analysis of cellular localization, the original location of some proteins can however be delineated. This is the case for Cit2p, a peroxisomal protein from the family. The other two members of this family in S. cerevisiae, Cit1p and Cit3p, are mitochondrial and so are their homologs in Homo sapiens, Arabidopsis thaliana and Caernohabditis elegans. The phylogeny of the citrate synthase family in the fungi indicates that Cit1p and Cit2p originate from a recent gene duplication, after which Cit2p lost its mitochondrial targeting signal (Figure 3). This suggests an ancestral mitochondrial localization and subsequent re‐ targeting to the peroxisome of the citrate synthase protein family. That the re‐ targeting of proteins between mitochondria and peroxisomes frequently happens during evolution is also indicated by the localization of alanine:glyoxylate aminotransferase (AGT), whose peroxisomal or mitochondrial location is species‐ dependent and related to diet in mammals [26]. In humans, where AGT is peroxisomal, a single point mutation produces a miss‐localization of the protein to the mitochondrion, leading to the hereditary kidney stone disease: primary hyperoxaluria type 1 (PH1) [27]. These examples illustrate the ease with which sub‐ cellular locations can be changed during evolution.

Figure 3. (Note: a color version of this figure is shown in appendix 2) ClustalX [28] view of the multiple sequence alignment of several fungal members of the Cit1/2p orthologous group. The alignment was performed with MUSCLE [29], only the N‐terminal region is shown. Amino acids around the signal‐peptide cleavage‐sites, as predicted by Mitoprot [22] are marked with a rectangle (white arrow indicates the position in the alignment) they correspond to a YS (YA in C. tropicalis) that is missing in Cit2p. No mitochondrial localization nor a cleavage‐site is predicted for Cit2p consistent with its peroxisomal location.

140 The evolutionary origin of the peroxisome

To investigate the order of the recruitment of proteins to the peroxisome in time we reconstructed the evolution of the peroxisomal proteome based on the absence/presence of genes among sequenced genomes and assuming a parsimonious scenario (Figure 4). First we reconstructed the peroxisome of the minimal ancestral ophistokont, the common ancestor of metazoa and fungi, by including proteins present in both yeast and rat peroxisomal proteomes or that are present in only one of the two proteomes but whose orthologs in plants have a (putative) peroxisomal location in the Araperox database [30]. In addition, we reconstructed the protein composition of the common ancestor of all known peroxisomes, glycosomes and glyoxysomes from proteins that, besides being present in the ancestral ophistokont peroxisome, are present in genomes from fungi, mammals, plants and kinetoplastida (Trypanosoma brucei and Leishmania major). This core‐set is formed by six PEX proteins (Pex1p, Pex2p, Pex4p, Pex5p, Pex10p, Pex14p) and four proteins involved in fatty acid metabolism and transport (Pxa1p/Pxa2p, Fox2p, Faa2p). The peroxisomal hallmark protein catalase (Cta1p) was included even though it is absent from most glycosomes [31] and kinetoplastidial genomes because it is found in the glycosomes of the non‐pathogenic trypanosomatid Crithidia [32]. Similarly Fox1p, the enzyme catalyzing the first step of long‐chain fatty acid beta‐oxidation, was included despite its absence from kinetoplastida, because the concomitant loss from peroxisomes of Fox1 (the enzyme generating H2O2) and catalase (the enzyme detoxifying H2O2) has been observed in species such as Neurospora crassa [33].

The results indicate that the gain of proteins of various phylogenetic origins has been continuous in peroxisome evolution. The lineage‐specific recruitment of certain proteins or pathways to the peroxisome and, eventually, the loss of some of the earlier recruited ones would explain the metabolic diversity of peroxisomes. The earliest tractable function of peroxisomes appears to be the beta‐oxidation of fatty acids. This pathway already contains at least one protein of alpha‐proteobacterial descent (FOX2), indicating that the presence of long‐chain fatty acid beta‐oxidation in the peroxisome followed the endosymbiosis of mitochondria and underscoring the centrality of this endosymbiosis in the evolution of the eukaryotic cell. The proteins in the ancestral peroxisome that are not involved in beta‐oxidation are all of eukaryotic origin. Catalase would have been recruited to the peroxisome to function with beta‐oxidation, as the processes are tightly linked biochemically and are also coordinately lost from peroxisomes.

Based on phylogenetic analyses of peroxisomal proteins as well as a reconstruction of the evolution of the peroxisomal proteomes we propose that peroxisomes are a eukaryotic invention. A number of observations support this view. Importantly, all the Pex proteins, mostly required for peroxisome biogenesis and maintenance are of eukaryotic origin and are also the core group that is generally shared by the diverse microbody family as a whole. Most of the species variability is found in the enzymes housed in peroxisomes. These were recruited from various sources (Figure 4). A significant fraction has an alpha‐proteobacterial origin and has entered the primitive eukaryote with the mitochondrial ancestor [8]. Note that the recruitment of proteins with an endosymbiotic origin to the peroxisome is not an

141 Chapter 8 exceptional event. Nine proteins in the glycosomes of the kinetoplastida T. brucei and Leishmania mexicana are derived from chloroplasts from which they can be traced back to the cyanobacteria [34]. Other proteins are homologous to ones functioning elsewhere in the cell, either in organelles such as the ER or in the cytoplasm. Somehow it seems rather easy to (re)locate proteins to microbodies which may be related to the simplicity of the PTS1 targeting code for import of proteins into the organel. This ‘grab what you can get’ principle may well have contributed to the versatility and species variability that we observe today. We find no evidence for an endosymbiotic origin of the peroxisome. Being a eukaryotic invention, how did peroxisomes arise in the first place? Recent work indicates that a few peroxisomal proteins first enter the ER thereby capturing part of the ER membrane for subsequent formation of the organel [6]. These observations are consistent with our findings that the oldest PEX proteins are homologous to proteins of the ERAD pathway, and they indicate that evolutionarily as well as ontogenetically peroxisomes are in fact offshoots from the ER.

142 The evolutionary origin of the peroxisome

Figure 4. (Note: a color version of this figure is shown in appendix 2) Evolution of the peroxisomal proteome. Biochemical pathways reconstructed according to KEGG and annotations of peroxisomal proteins. For details on the reconstruction of ancestral estates see supplemental material. Color code: yellow, eukaryotic origin; green, alpha‐proteobacterial origin; red, actinomycetales origin; blue, cyanobacterial origin; white, origin unresolved.

143 Chapter 8

Table 1. List of proteins localized in the peroxisome in S. cerevisiae and R. norvergicus. SGD, Swissprot or GeneBank gene names are indicated. Proteins in the same row are orthologous to each other, whenever there is a “one to many” orthology relationship this is indicated by boxes containing several rows. Absence of the gene or absence of evidence of a peroxisomal localization of the encoded protein is indicated by a dash. For each orthologous group, the annotated function and the phylogenetic origin is indicated (euk: eukaryotic no bacterial homologs; euk (a.o.): ancient protein likely derived from the common ancestor of eukaryotes and archaea; alpha: alpha‐proteobacterial origin; actinomyc.: derived from the actinomycetales; cyanobac.: cyanobacterial origin; u, unresolved phylogenetic origin.

S. cerevisiae R. norvergicus origin Function PEX1 PEX1 euk (a.o) Peroxisome organization and biogenesis PEX2 PEX2 euk Peroxisome organization and biogenesis PEX3 PEX3 euk Peroxisome organization and biogenesis PEX4 PEX4 euk Peroxisome organization and biogenesis PEX5 PEX5 euk (a.o) Peroxisome organization and biogenesis PEX6 PEX6 euk (a.o) Peroxisome organization and biogenesis PEX7 PEX7 euk Peroxisome organization and biogenesis PEX8 ‐ euk Peroxisome organization and biogenesis PEX10 PEX10 euk Peroxisome organization and biogenesis ‐ PEX11 euk (a.o) Peroxisome organization and biogenesis PEX12 PEX12 euk Peroxisome organization and biogenesis PEX13 PEX13 euk Peroxisome organization and biogenesis PEX14 PEX14 euk (a.o) Peroxisome organization and biogenesis PEX15 ‐ euk Peroxisome organization and biogenesis ‐ PEX16 euk Peroxisome organization and biogenesis PEX17 ‐ euk Peroxisome organization and biogenesis PEX18 ‐ euk Peroxisome organization and biogenesis PEX19 PEX19 euk Peroxisome organization and biogenesis PEX21 ‐ euk Peroxisome organization and biogenesis PEX22 ‐ euk Peroxisome organization and biogenesis PEX25 ‐ euk Peroxisome organization and biogenesis ‐ PEX26 euk Peroxisome organization and biogenesis PEX27 ‐ euk Peroxisome organization and biogenesis PEX28 ‐ euk Peroxisome organization and biogenesis PEX29 ‐ euk Peroxisome organization and biogenesis PEX30 ‐ euk Peroxisome organization and biogenesis PEX31 ‐ euk Peroxisome organization and biogenesis PEX32 ‐ euk Peroxisome organization and biogenesis ANT1 PMP34 euk Adenine nucleotide transporter ‐ PMP24 euk Peroxisomal membrane protein ‐ PMP22 euk Peroxisomal membrane protein ‐ PAHX u Phytanoyl‐CoA dioxygenase ‐ gi‐6912418 u 2‐hydroxyphytanoyl‐CoA lyase peroxisomal long chain acyl‐CoA thioesterase ‐ PTE2B alpha Ib Peroxisomal acyl‐ thioester TES1 PTE1_MOUSE alpha hydrolase 1 CTA1 CATALASE euk (a.o) Catalase A FOX1 OXRTA2 u acyl‐CoA oxidase

144 The evolutionary origin of the peroxisome

gi‐1684747 u ʺ CAO3_RAT u ʺ peroxisomal multifunctional beta‐oxidation FOX2 gi‐13242303 alpha protein putative peroxisomal 2,4‐dienoyl‐CoA gi‐4105269 alpha reductase gi‐5052204 alpha putative short‐chain dehydrogenase/reductase FOX3 gi‐6978429 u peroxisomal 3‐oxoacyl CoA thiolase ‐ ECHP_RAT u Peroxisomal bifunctional enzyme ‐ SCP2 alpha sterol carrier protein‐2 Peroxisomal NADP‐dependent isocitrate IDP3 gi‐13928690 u dehydrogenase ECI1 gi‐6755026 alpha enoyl‐CoA isomerase DCI1 alpha bile acid‐Coenzyme A: amino acid N‐ ‐ BAAT alpha acyltransferase ‐ gi‐12002203 actinomyc. alkyl‐dihydroxyacetonephosphate synthase ‐ DAPT_RAT actinomyc. Dihydroxyacetone phosphate acyltransferase ‐ AGT cyanobac. alanine‐glyoxylate aminotransferase ‐ gi‐6679507 alpha pipecolic acid oxidase ‐ URIC_RAT u Urate oxidase PXA1 PMP70 u fatty acid transport ALDP u ATP‐binding cassette ALDPR u ATP‐binding cassette FAA1 ‐ euk FAA2 LCF2 u Long‐chain‐fatty‐acid‐‐CoA ligase LACS u ʺ FAT1 VLACS u Fatty acid transport ‐ gi‐14091775 u Hydroxyacid oxidase 3 (medium‐chain) ‐ gi‐6754156 u Hydroxyacid oxidase 1 ‐ GTK1_RAT u Glutathyhion‐S transferase ‐ AMCR u 2‐arylpropionyl‐CoA epimerase FAT2 ‐ alpha probable AMP‐binding protein CIT2 ‐ u Citrate synthase GPD1 ‐ alpha glycerol‐3‐phosphate dehydrogenase MDH3 ‐ u malate dehydrogenase Lysine biosynthesis, saccharopine LYS1 ‐ euk dehydrogenase LYS4 ‐ alpha Lysine biosynthesis PNC1 ‐ alpha NAD(+) salvage pathway NPY1 ‐ alpha NADH diphosphatase (pyrophosphatase) STR3 ‐ u Sulfur Transfer YGR154C ‐ alpha MLS1 ‐ u Malate synthase 1 MLS2 ‐ u Malate synthase 2 EMP24 ‐ euk Vesicle organization and biogenesis ERG1 ‐ u Ergosterol biosynthesis ERG6 ‐ u Ergosterol biosynthesis

145 Chapter 8

RHO1 ‐ euk GTP‐binding protein SPS19 ‐ u 2,4‐dienoyl‐CoA reductase YOR084W ‐ u Peroxisome organization and biogenesis YMR204C ‐ euk CAT2 ‐ euk Carnitine acetyltransferase PCD1 ‐ alpha Nudix hydrolase AAT2 ‐ euk Aspartate aminotransferase Peroxisomal ATP‐binding cassette, fatty acid PXA2 ‐ u transport VPS1 ‐ euk Dynamin 1

Material and Methods

Data retrieval Manually curated sets of 62 S. cerevisiae and 50 R. norvergicus proteins with experimental evidence of peroxisomal location were compiled from the literature [9‐ 11, 17] and from the Saccharomyces Genome (SGD, http://www.yeastgenome.org/) and Swiss‐Prot (http://www.ebi.ac.uk/swissprot) databases. For the purpose of this paper we consider a protein to be peroxisomal when it permanently resides in the peroxisomal matrix or membrane, or when it is a cytoplasmic protein but has a dedicated function in peroxisomal protein import and/or biogenesis. Protein sequences encoded by 144 publicly available complete genomes were obtained from Swissprot (http://www.ebi.ac.uk/swissprot), except for Plasmodium falciparum, Schizosaccharomyce pombe, Candida albicans, Encephalitozoon cuniculi (Genbank, http://www.ncbi.nlm.nih.org/Genbank), Homo sapiens, Rattus norvergicus and Mus musculus (EBI, http://www.ebi.ac.uk/IPI).

Phylogenetic reconstructions For every yeast and rat peroxisomal protein, homologous sequences (E < 0.01) were retrieved using Smith‐Waterman comparisons against the aforementioned 144 complete proteomes. Only sequences that aligned with at least one third of the query sequence were selected. Sequences were aligned using MUSCLE [29]. Neighbour Joining (NJ) trees were made using Kimura distances as implemented in ClustalW [35]. Positions with gaps were excluded and 1000 bootstrap iterations were performed. Maximum Likelihood (ML) trees were derived using PhyML v2.1b1 [16], with a four rate gamma‐distribution model, before and after excluding from the alignment positions with gaps in 10% or more of the sequences. In all cases NJ and MLA trees were manually examined to search for consistent patterns indicating the origin of the peroxisomal proteins. Trees in which eukaryotic proteins clustered together, within the Bacteria or the Archaea and with a specific prokaryotic out‐ group were classified as having that phylogenetic origin (e.g. Figure 1b).

Reconstruction of yeast, rat peroxisomal metabolisms and their ancestral states Annotated biochemical and cellular functions of the yeast and rat peroxisomal proteins were mapped onto metabolic KEGG maps [36] and are represented in Figure 4, indicating by a color‐code their phylogenetic origin. Proteins

146 The evolutionary origin of the peroxisome known or predicted to be membrane‐associated are depicted close to the membrane. The minimal ancestral ophistokont peroxisome was reconstructed by combining proteins that are present in both yeast and rat peroxisomal proteomes or that are present in only one of the two proteomes but have orthologs in plants with a peroxisomal location or that are described as putative peroxisomal proteins in Araperox database [30]. The minimal ancestral eukaryotic proteome is formed by those proteins of the ancestral ophistokont proteome that are also found in the genomes of plants, Typanosoma brucei and Leishmania major. Catalase and Fox1 that are absent from glycosomes are nevertheless included for the reasons explained in the text section.

References

1 de Duve, D. 1969. "The peroxisome: a new cytoplasmic organelle" Proc R Soc Lond B Biol Sci (173) 30:71-83 2 Subramani, S. 1998. "Components involved in peroxisome import, biogenesis, proliferation, turnover, and movement" Physiol Rev (78) 1:171-88 3 Latruffe, N. and Vamecq, J. 2000. "Evolutionary aspects of peroxisomes as cell organelles, and of genes encoding peroxisomal proteins" Biol Cell (92) 6:389-95 4 Lazarow, P. B. and Fujiki, Y. 1985. "Biogenesis of peroxisomes" Annu Rev Cell Biol (1) 489- 530 5 Erdmann, R. and Kunau, W. H. 1992. "A genetic approach to the biogenesis of peroxisomes in the yeast Saccharomyces cerevisiae" Cell Biochem Funct (10) 3:167-74 6 Hoepfner, D., et al. 2005. "Contribution of the endoplasmic reticulum to peroxisome formation" Cell (122) 1:85-95 7 Kurland, C. G. and Andersson, S. G. 2000. "Origin and evolution of the mitochondrial proteome" Microbiol Mol Biol Rev (64) 4:786-820 8 Gabaldón, T. and Huynen, M. A. 2003. "Reconstruction of the proto-mitochondrial metabolism" Science (301) 5633:609 9 Schafer, H., et al. 2001. "Identification of peroxisomal membrane proteins of Saccharomyces cerevisiae by mass spectrometry" Electrophoresis (22) 14:2955-68 10 Yi, E. C., et al. 2002. "Approaching complete peroxisome characterization by gas-phase fractionation" Electrophoresis (23) 18:3205-16 11 Kikuchi, M., et al. 2004. "Proteomic analysis of rat liver peroxisome: presence of peroxisome- specific isozyme of Lon protease" J Biol Chem (279) 1:421-8 12 Jarosch, E., Lenk, U. and Sommer, T. 2003. "Endoplasmic reticulum-associated protein degradation" Int Rev Cytol (223) 39-81 13 Rabinovich, E., et al. 2002. "AAA-ATPase p97/Cdc48p, a cytosolic chaperone required for endoplasmic reticulum-associated protein degradation" Mol Cell Biol (22) 2:626-34 14 Platta, H. W., et al. 2005. "Functional role of the AAA peroxins in dislocation of the cycling PTS1 receptor back to the cytosol" Nat Cell Biol (7) 8:817-22 15 Ye, Y., et al. 2004. "A membrane protein complex mediates retro-translocation from the ER lumen into the cytosol" Nature (429) 6994:841-7 16 Guindon, S. and Gascuel, O. 2003. "A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood" Syst Biol (52) 5:696-704 17 Marelli, M., et al. 2004. "Quantitative mass spectrometry reveals a role for the GTPase Rho1p in actin organization on the peroxisome membrane" J Cell Biol (167) 6:1099-112 18 Sickmann, A., et al. 2003. "The proteome of Saccharomyces cerevisiae mitochondria" Proc Natl Acad Sci U S A (100) 23:13207-12 19 Valadi, A., et al. 2004. "Distinct intracellular localization of Gpd1p and Gpd2p, the two yeast isoforms of NAD+-dependent glycerol-3-phosphate dehydrogenase, explains their different contributions to redox-driven glycerol production" J Biol Chem (279) 38:39677-85

147 Chapter 8

20 Janssen, U., et al. 1994. "Human mitochondrial 3,2-trans-enoyl-CoA isomerase (DCI): gene structure and localization to chromosome 16p13.3" Genomics (23) 1:223-8 21 Abdelraheim, S. R., Spiller, D. G. and McLennan, A. G. 2003. "Mammalian NADH diphosphatases of the Nudix family: cloning and characterization of the human peroxisomal NUDT12 protein" Biochem J (374) Pt 2:329-35 22 Scharfe, C., et al. 2000. "MITOP, the mitochondrial proteome database: 2000 update" Nucleic Acids Res (28) 1:155-8 23 McLean, M. P., et al. 1989. "Estradiol regulation of sterol carrier protein-2 independent of cytochrome P450 side-chain cleavage expression in the rat corpus luteum" Endocrinology (125) 3:1337-44 24 Solaas, K., et al. 2000. "Subcellular organization of bile acid amidation in human liver: a key issue in regulating the biosynthesis of bile salts" J Lipid Res (41) 7:1154-62 25 Elgersma, Y., et al. 1995. "Peroxisomal and mitochondrial carnitine acetyltransferases of Saccharomyces cerevisiae are encoded by a single gene" Embo J (14) 14:3472-9 26 Birdsey, G. M., et al. 2004. "Differential enzyme targeting as an evolutionary adaptation to herbivory in carnivora" Mol Biol Evol (21) 4:632-46 27 Danpure, C. J., et al. 2003. "Alanine:glyoxylate aminotransferase peroxisome-to- mitochondrion mistargeting in human hereditary kidney stone disease" Biochim Biophys Acta (1647) 1-2:70-5 28 Aiyar, A. 2000. "The use of CLUSTAL W and CLUSTAL X for multiple sequence alignment" Methods Mol Biol (132) 221-41 29 Edgar, R. C. 2004. "MUSCLE: a multiple sequence alignment method with reduced time and space complexity" BMC Bioinformatics (5) 1:113 30 Reumann, S., et al. 2004. "AraPerox. A database of putative Arabidopsis proteins from plant peroxisomes" Plant Physiol (136) 1:2587-608 31 Opperdoes, F. R. and Michels, P. A. 1993. "The glycosomes of the Kinetoplastida" Biochimie (75) 3-4:231-4 32 Soares, M. J. and De Souza, W. 1988. "Cytoplasmic organelles of trypanosomatids: a cytochemical and stereological study" J Submicrosc Cytol Pathol (20) 2:349-61 33 Thieringer, R. and Kunau, W. H. 1991. "The beta-oxidation system in catalase-free microbodies of the filamentous fungus Neurospora crassa. Purification of a multifunctional protein possessing 2-enoyl-CoA hydratase, L-3-hydroxyacyl-CoA dehydrogenase, and 3- hydroxyacyl-CoA epimerase activities" J Biol Chem (266) 20:13110-7 34 Hannaert, V., et al. 2003. "Plant-like traits associated with metabolism of Trypanosoma parasites" Proc Natl Acad Sci U S A (100) 3:1067-71 35 Thompson, J. D., Higgins, D. G. and Gibson, T. J. 1994. "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position- specific gap penalties and weight matrix choice" Nucleic Acids Res (22) 22:4673-80 36 Kanehisa, M. and Goto, S. 2000. "KEGG: kyoto encyclopedia of genes and genomes" Nucleic Acids Res (28) 1:27-30

148 The evolutionary origin of the peroxisome

149 Chapter 9

Cuando creíamos que teníamos todas las respuestas, de pronto, cambiaron todas las preguntas*.

Mario Benedetti

*When we thought we had all answers, suddenly, all questions were changed.

150 Summarizing discussion

Chapter 9

Summarizing discussion

151 Chapter 9

Summary and conclusions This thesis focuses on the origin and evolution of the mitochondrial proteome, with a special emphasis on its applications for prediction of protein function in mitochondrial pathways of clinical relevance. This chapter summarizes the main conclusions that can be drawn from our results.

Getting to know the proto‐mitochondrion Mitochondria are organelles of bacterial descent that play a key role in the metabolism of the modern eukaryotic cell [1, 2]. Although the exact phylogenetic position of the mitochondrial ancestor, the so called proto‐mitochondrion, is not precisely defined [3‐7], there is a consensus about the monophyletic origin of mitochondria from within the alpha‐proteobacteria. Nevertheless, the exact scenario in which the original relationship between the endosymbiont and its host was established is still a matter of debate [4‐6, 8, 9]. Knowledge of the proto‐ mitochondrion’s metabolic nature is instrumental for settling this debate and we addressed it by searching for eukaryotic genes that were likely acquired through the mitochondrial endosymbiosis [10]. This novel approach involves the automatic reconstruction of complete phylomes and the development of algorithms that search for specific tree sub‐topologies. The analysis of the resulting reconstructed metabolism suggests a (facultatively) aerobic endosymbiont catabolizing substrates provided by the host. The metabolic diversity of the reconstructed proto‐ mitochondrion is much larger than previously thought, although some theories have predicted large numbers of eukaryotic genes derived from the mitochondrial endosymbiosis [5]. Our current estimates on the size of the proto‐mitochondrial proteome range from 800 proteins to more than a thousand. In our view these are minimal boundaries of its real size. The ancestral proteome could very well double this size as suggested by the incomplete coverage of our approach, which misses proteins that have been lost from present‐day genomes and those whose alpha‐ proteobacterial ancestry cannot be clearly identified with current phylogenetic techniques. A broader genome sampling, specially if it includes early divergent taxa, could further extend the reconstructed proteome.

Symbiosis is something for two By definition [11], symbiosis must involve (at least) two organisms. Thus, any hypothesis explaining the initial symbiotic relationships between the proto‐ mitochondrion and its host must consider both parties. We are now starting to have some idea on how things were at the endosymbiont side. At the host side, however, the information is scarce. An archaeal component of the host is undisputed [5, 12‐14], but it is still unclear whether or not a fusion with a different bacterium predated the mitochondrial endosymbiosis and, in case it had, which bacterial group was involved [12, 15, 16]. In this context, in which we know little about the initial endosymbiont’s environment, it is difficult to pin‐point a single pathway that would have been instrumental for the fixation of the endosymbiosis. Indeed the detection in the eukaryotes of a variety of proto‐mitochondrial pathways not directly related with energy metabolism suggests that the benefit for the host was rather multifaceted. From all proto‐mitochondrial pathways, iron‐sulfur cluster assembly seems to be the

152 Summarizing discussion most conserved among all extant eukaryotes, including those harboring mitochondrial relicts such as mitosomes or hydrogenosomes [17‐19]. However this does not necessarily mean that this pathway was the original rationale for the endosymbiosis, it does suggest that it is the most essential function of modern mitochondria. In my view, all proto‐mitochondrial pathways that survived the initial massive gene‐loss that likely followed endosymbiosis, had an important function in the early eukaryotic cell. Perhaps, by looking for a single pathway or exchangeable molecule, we are oversimplifying the actual complex and multifaceted trade that fixed the most successful endosymbiosis in earth’s history.

From bacteria to organelle Since its endosymbiosis, the mitochondrial proteome has experienced not only an extensive loss of its original content but also major gains. The balance of both processes in terms of their distribution among functional classes differs from one eukaryotic phylum to another, resulting in a great diversity of modern mitochondrial proteomes and, therefore, its respective metabolic capacities [20]. We have exploited the availability of data from genomic and proteomic studies to draw the main lines of the mitochondrial proteome evolution. The emerging picture is that of a dynamic evolution involving a series of complete losses of some bacterial pathways, amelioration of others and gains of completely new complexes of eukaryotic origin. This renewal of proteins has been so extensive that only a 10‐16% of modern mitochondrial proteome has an origin that can be traced back to the bacterial endosymbiont. The rest consists of proteins of diverse origin that were eventually recruited to function in the organelle. This shaping of the proteome reflects the transformation of mitochondria into a highly specialized organelle that, besides ATP production, comprises a variety of functions within the eukaryotic metabolism.

Untangling the complex evolution of complex I A fascinating example of a continuous addition of eukaryotic components to a proto‐mitochondrial pathway is the evolution of the respiratory chain [21], and specifically to its first component, the NADH:ubiquinone oxidoreductase (Complex I). This multi‐protein complex increased in size from 14 subunits in the bacterial ancestor to 46 in mammals, and it is largely unclear the function performed by all these “accessory” subunits. To gain insight into the function and evolution of Complex I, we used a combination of comparative genomics tools [22]. Our analyses allow the establishment of orthology relationships among Complex I subunits of different species and the identification of new subunits in some organisms. Moreover a parsimonious reconstruction of eukaryotic Complex I evolution shows an initial increase in size that predates the separation of plants, fungi and metazoans, followed by a gradual adding and incidental losses of subunits in the various evolutionary lineages. This evolutionary scenario is in contrast to that for Complex I in the prokaryotes, for which the combination of several distinct, and previously independently functioning modules into a single complex has been proposed [23, 24]. This difference in the evolutionary mode might be related to the existence of the operon structure in bacterial genomes. The duplication of an operon, containing all components of a complex, would facilitate that one of the complete sets acquires a

153 Chapter 9

new function. In this way the modular evolution of Complex I in bacteria would parallel the modular organization of genes in the genome (operons).

Proteins on the move The comparison of the ancestral proto‐mitochondrial metabolism with that of modern mitochondria also shows that the proto‐mitochondrial contribution to the modern eukaryotic metabolism is not restricted to mitochondria. On the contrary, t seems that the re‐targeting of proteins and complete pathways to other cell compartments has been extensive during mitochondrial evolution, affecting more than half of the proteins of proto‐mitochondrial ancestry in yeast and human. This fact adds a new dimension on the influence that mitochondrial endosymbiosis could have had on the evolution of eukaryotes, and could explain the relative excess of eubacterial genes found in eukaryotic genomes [7, 12].

Are peroxisomes a side effect of mitochondrial endosymbiosis? Particularly interesting has been the recruitment of proto‐mitochondrial enzymes to the peroxisomes, a process which have deep implications for the evolution of the peroxisomes. Peroxisomes are widespread eukaryotic organelles for which an endosymbiotic origin has been proposed [25]. In the light of recent experimental data that challenged the view of peroxisomes as autonomous organelles [26], we have performed large scale analyses of rat and yeast peroxisomal proteomes. Our results show that, besides a majority of proteins of eukaryotic origin, proteins with an alpha‐proteobacterial descent represent a significant fraction of the peroxisomal proteome. Moreover, recruitment from the proto‐mitochondrion seems the most likely origin for them as suggested by some phylogenies, which indicate monophyly of peroxisomal and mitochondrial paralogs. The presence in the peroxisome of proteins of alpha‐proteobacterial descent can be traced back to the last common ancestor of all present‐day peroxisomes. This suggests that the origin of the peroxisomes might have postdated that of mitochondria, although we cannot discard a secondary loss of an ancestral peroxisomal pathway. The emerging picture is that of an ER‐derived organelle compartmentalizing catalase and H2O2‐producing long‐ chain fatty acid oxidation, which was recruited from the mitochondrion.

Protein function prediction Mitochondrial endosymbiosis can be regarded as a massive horizontal gene transfer from the alpha‐proteobacteria to the eukaryotes. We traced the evolution of the proto‐mitochondrial protein pool among the eukaryotes and detected extensive, lineage‐specific gene loss, with an average of three losses per orthologous group when a phylogeny of nine eukaryotic species is used. This gene loss likely reflects functional adaptation to different environments, since we found that within the set of proteins derived from the mitochondrion, those functioning in the same biochemical pathway tend to have a similar history of gene loss events. Using this property, in combination with other function‐prediction approaches [27], we have proposed several functional interactions between mitochondrial proteins [22, 28, 29]. The establishment of fruitful collaborations with other research groups in the NCMLS

154 Summarizing discussion and other centers has been constant during this period. One of our predictions, that RNaseL inhibitor (RLI) plays a role in rRNA maturation [27], has already been proved experimentally [30], and others are now being tested. All in all, this thesis emphasizes the dynamics of genome and organellar evolution, which is fascinating in itself but can also be used for protein function prediction.

References (At the end of the chapter)

155 Chapter 9

Nederlandse samenvatting en conclusies

Samenvatting en conclusies In dit proefschrift wordt de oorsprong en evolutie van mitochondriële eiwitten behandeld, waarbij bijzondere aandacht wordt besteed aan de implicaties met betrekking tot functievoorspelling van mitochondriële eiwitten met klinische relevantie. In dit hoofdstuk worden de belangrijkste conclusies samengevat.

Zicht krijgen op de proto‐mitochondrion Mitochondria zijn organellen van bacteriële oorsprong die een sleutelrol vervullen in het metabolisme van de moderne eukaryotische cel [1, 2]. Hoewel de fylogenetische plaats van de voorouder van het mitochondrium, het proto‐ mitochondrium, niet exact bekend is [3‐7], is er wel consensus over de monofyletische oorsprong van mitochondria, zijnde in de alpha‐protobacteria. Dat neemt niet weg dat het exacte scenario waarmee de oorspronkelijke relatie tussen de endosymbiont en de gastheercel zich heeft ontwikkeld onderwerp is van discussie [4‐ 6, 8, 9]. Kennis over de metabole achtergrond van het proto‐mitochondrium kan hierin klaarheid brengen. Wij hebben dit aangepakt door op zoek te gaan naar eukaryotische genen die met grote waarschijnlijkheid zijn verworven uit de mitochondriële endosymbiose [10]. Deze nieuwe aanpak omvat onder meer de automatische reconstructie van complete fylomen en de ontwikkeling van algoritmen voor het zoeken van specifieke sub‐topologiën. Analyse van het gereconstrueerde metabolisme suggereert dat de endosymbiont (facultatief) aëroob substraten verbrand die worden geleverd door de gastheercel. De metabole verscheidenheid van het proto‐mitochondrium is veel groter dan verwacht, hoewel een enkele theorie voorspelde dat grote aantallen eukaryotische genen afkomstig zouden zijn van de mitochondriële endosymbiont [5]. Onze huidige schatting van de grootte van het proto‐mitochondriële proteoom varieert van 800 tot meer dan duizend eiwitten, waarbij dit, naar onze mening als ondergrens moet worden gezien. Het voorouderlijke proteoom kan heel goed twee maal zo groot zijn, vanwege de niet complete dekking die eigen is aan onze aanpak. Hierdoor worden niet alleen die eiwitten gemist die verloren zijn geraakt in de huidige genomen, maar ook die waarvan de alpha‐proto bacteriële oorsprong met de huidige kennis en fylogenetische technieken niet kan worden vastgesteld. Met meer en bredere genoomdata, vooral van vroeg afgesplitste taxa, zou het gereconstrueerde genoom verder uitgebreid kunnen worden.

Symbiose doe je samen Per definitie [11] heb je voor symbiose altijd (tenminste) twee organismen nodig. Een hypothese die de vroegste symbiotische relatie tussen het proto‐ mitochondrium en zijn gastheer wil verklaren moet dus altijd rekening houden met beide partijen. We beginnen nu langzamerhand een idee te krijgen hoe het er voor de endosymbiont uitzag, maar over de kant van de gastheer hebben we nog erg weinig informatie. Dat er een archaeale component in de gastheer zat is duidelijk [5, 12‐14], maar het is nog onduidelijk of er aan de mitochondriële endosymbiose een fusie met

156 Summarizing discussion een andere bacterie vooraf ging, en, mocht dit het geval zijn, welke groep bacteriën daarbij betrokken was [12, 15, 16]. Omdat we zo weinig weten over de leefomgeving van de eerste endosymbiont, is het ook moeilijk om één enkel proces aan te wijzen dat verantwoordelijk is voor het fixeren van de endosymbiose. Het feit dat er in de eukaryoten verschillende proto‐mitochondriële processen te vinden zijn die niet direct met het energie‐metabolisme te maken hebben, suggereert dat het voordeel voor de gastheer veel verschillende facetten gekend heeft. Van alle proto‐mitochondriële processen lijkt het vormen van ijzer‐zwavel clusters het meest geconserveerd in de bestaande eukaryoten, ook als er slechts sprake is van overblijfselen van de mitochondriën zoals mitosomen of hydrogenosomen [17‐19]. Dit betekent niet dat dit proces de oorsprong van de endosymbiose geweest moet zijn, maar wel dat het de belangrijkste functie van hedendaagse mitochondriën is. Alle processen die zijn overgebleven na de ineenstorting van het proto‐mitochondriële genoom, een proces dat waarschijnlijk vlak na de endosymbiose plaatsvond, hebben mijns inziens een belangrijke rol gespeeld in de vroege eukaryote cel. Zoeken naar één enkel proces of één uitgewisseld molecuul is een te grote simplificatie van de complexe en veelzijdige overeenkomst van deze meest succesvolle endosymbiose in de geschiedenis van de aarde.

Van bacterie tot cel‐organel Sinds de endosymbiose heeft het mitochondriële proteoom niet alleen een substantieel deel van zijn inhoud verloren, maar er zijn ook veel eiwitten bij gekomen. Voor de verschillende funtionele groepen kan de balans hierin vrij sterk verschillen van fylum tot fylum. Dit heeft een grote verscheidenheid aan mitochondriële proteomen, en daarmee aan metabole mogelijkheden tot gevolg [20]. Wij hebben het aanbod aan genoom‐ en proteoomstudies gebruikt om in grote lijnen de evolutie van het mitochondriële proteoom te schetsen. Het beeld dat hieruit naar voren komt is heel dynamisch, met een reeks verdwijningen van bacteriële processen, het verbeteren van andere, en toevoegingen van volstrekt nieuwe complexen van eukaryote oorsprong. De vernieuwing van eiwitten is zo overvloedig geweest dat er maar 10‐16% van het huidige mitochondriële proteoom herleid kan worden tot de bacteriële endosymbiont. De rest zijn eiwitten van verschillende oorsprong die uiteindelijk gerecruteerd zijn voor een functie binnen het organel. Deze vorming van het proteoom weerspiegelt de verandering van de mitochondriën tot zeer gespecialiseerde cel‐organellen, die naast de productie van ATP een verscheidenheid aan functies in het eukaryote metabolisme vervullen.

Het ontrafelen van de evolutie van complex I Een fascinerend voorbeeld van de stapsgewijze toevoeging van eukaryote componenten aan een proto‐mitochondriële route is de evolutie van de ademhalingsketen [21], en wel specifiek in het eerste onderdeel hiervan, het NADH:ubiquinon oxidoreductase (Complex I). Dit uit meerdere eiwitten bestaande complex is in omvang gegroeid van 14 subunits in de bacteriële voorouder tot 46 in zoogdieren, terwijl de functie is van deze toegevoegde subunits grotendeels onbekend is. Om meer inzicht te krijgen in de functie en evolutie van Complex I, hebben we een combinatie van verschillende comparative genomics technieken gebruikt [22]. Onze analyses hebben geresulteerd in het ophelderen van orthologie‐

157 Chapter 9

relaties tussen Complex I subunits van verschillende soorten, alsmede in de identificatie van nieuwe subunits in enkele soorten. Daarnaast is ook de evolutie van Complex I gereconstrueerd waaruit blijkt dat een initiële groei in subunits voorafgaand aan de radiatie tussen planten, schimmels en metazoa wordt gevolgd door stapsgewijze toevoeging en incidenteel verlies van subunits in de verschillende evolutionaire lijnen. Dit evolutionaire scenario contrasteert met dat van Complex I in prokaryoten, waar de combinatie van verscheidene onafhankelijk functionerende modules tot één functioneel complex is voorgesteld [23, 24]. Deze verschillende evolutionaire scenario’s zijn mogelijk gerelateerd aan het bestaan van operon structuren in prokaryote genomen. De duplicatie van een operon dat alle componenten van een complex bevat, zal het ontwikkelen van nieuwe functies door deze nieuwe set genen aanzienlijk vergemakkelijken. Op deze manier weerspeigeld de modulaire evolutie van Complex I in prokaryoten de modulaire organisatie van genen op hun genomen (operons).

Eiwitten in beweging Een vergelijking van het voorouderlijke proto‐mitochondriële metabolisme met dat van moderne mitochondria laat zien dat de proto‐mitochondriële bijdrage aan het moderne eukaryote metabolisme niet beperkt blijft tot de mirochondria zelf. Integendeel, het lijkt erop dat het her‐targeten van eiwitten, en zelfs complete metabole routes, naar andere celcompartimenten op grote schaal is voorgekomen in de evolutie van mitochondria, en betrekking heeft op meer dan de helft van de eiwitten van proto‐mitochondriële afkomst in gist en in mens. Deze observatie voegt een nieuwe dimensie toe aan de invloed die de mitochndriële endosymbiose kan hebben gehad op de evolutie van eukaryoten, en het zou de relatieve overmaat van eubacteriële genen in eukaryote genomen kunnen verklaren.

Zijn peroxisomen een tweede resultaat van mitochondriële endosymbiose? De recrutering van proto‐mitochondriële enzymen naar de peroxisomen is met name interessant omdat het process vergaande consequenties heeft voor de evolutie van de peroxisomen. Peroxisomen zijn wijdverspreide eukaryotische organellen waarvoor een endosymbiotische oorsprong is voorgesteld [25]. In het licht van recente experimentele data, waarbij deze kijk op peroxisomen als zijnde zelfstandige organellen in twijfel wordt getrokken [26], hebben wij een grootschalige analyse gedaan van het proteoom van de peroxisomen van rat en muis. Onze resultaten tonen aan dat naast de eiwitten van eukaryotische origine, welke de meerderheid vormen, eiwitten met een alpha‐proteobacteriele afkomst een significant deel van het peroxisomaal proteoom uitmaken. Daarbij lijkt recrutering van het proto‐mitochondrium de meest waarschijnlijke oorsprong voor peroxisomen, zoals gesuggereerd door een aantal fylogenieën die wijzen op monofylie van peroxisomale en mitochondriële paralogen. De aanwezigheid in het peroxisoom van eiwitten van alpha‐proteobacteriële afkomst kan worden herleid tot de laatst overeenkomstige voorouder van alle huidige peroxisomen. Dit suggereert dat de oorsprong van het peroxisoom gedateerd moet worden ná dat van de mitochondrieën, ofschoon we de mogelijkheid van een secundair verlies van een voorouderlijk peroxisomaal pad niet kunnen uitsluiten. Het beeld dat zich ontvouwt

158 Summarizing discussion

is er één van een ER‐afgeleid organel dat catalase, samen met H2O2‐producerende oxidatie van lange vetzuren, in één compartiment samenbrengt hetgeen werd gerecruteerd uit het mitochondrium.

Eiwit functie voorspelling Mitochondriële endosymbiose kan beschouwd worden als een grootschalige horizontale overdracht van genen van alpha‐proteobacteriën naar eukaryoten. We hebben de evolutie van de protomitochondriële eiwitten bij de eukaryoten getraceerd. Hierbij hebben we uitgebreide, soort specifiek verlies van genen geconstateerd met een gemiddelde van drie verloren genen per orthologe groep wanneer negen eukaryotische soorten worden gebruikt voor de constructie van de fylogenie. Dit verlies van genen reflecteert waarschijnlijk een functionele aanpassing aan verschillende milieus. Dit wordt ondersteund door onze waarneming dat in de set van eiwitten die van het mitochondrium zijn afgeleid, eiwitten die functioneren in eenzelfde pad meestal een overeenkomstige historie van gen‐verlies laten zien. Gebruik makend van deze eigenschap, in combinatie met andere functie voorspellende methoden [27], hebben we een aantal functionele interacties tussen mitochondriële proteinen voorgesteld [22, 28, 29]. Het opzetten van vruchtbare samenwerkingsverbanden met andere onderzoeksgroepen binnen het NCMLS en andere centra ging in deze periode continu door. Bijgevolg kon reeds een van onze voorspellingen, dat RNaseL remmer (RLI) een rol speelt in de rijping van rRNA [27] al experimenteel worden bevestigd [27]. Andere voorspellingen worden momenteel getest. Al met al benadrukt dit proefschrift de dynamiek van genoom en organel evolutie welke facinerend is in zichzelf maar welke ook gebruikt kan worden voor de voorspelling van de functie van een eiwit.

Referenties (zie einde)

159 Chapter 9

Resumen y discusión

Resumen y discusión La presente tesis doctoral trata sobre el origen y evolución del proteoma mitocondrial. Se dedica un interés especial a las aplicaciones que este estudio evolutivo pueda tener en la predicción de función de proteínas mitocondriales de relevancia clínica. En este capitulo se resumen las principales conclusiones de nuestro trabajo.

Acerca de la proto‐mitocondria Las mitocondrias son orgánulos de origen bacteriano que desempeñan un papel central en el metabolismo de la célula eucariota [1, 2]. Aunque la posición filogenética del ancestro mitocondrial, o proto‐mitocondria, no se conoce con exactitud [3‐7], si existe un consenso sobre su origen monofiletico a partir de las alfa‐ proteobacterias. Por el contrario persisten dudas acerca de cual fue el escenario en el que se estableció la relación primordial entre el endosimbionte y la célula hospedadora [4‐6, 8, 9]. Para centrar este debate es imprescindible obtener información sobre las características metabólicas de la proto‐mitocondria. Nosotros adoptamos un enfoque genómico, encaminado a la detección de genes eucarióticos supuestamente adquiridos a través de la endosimbiosis mitocondrial [10]. Esta técnica novedosa consiste en la reconstrucción automática de filomas completos y el desarrollo de algoritmos que detectan determinadas sub‐topologías en árboles filogenéticos. El resultado del análisis permite reconstruir el metabolismo proto‐ mitocondrial. Esta reconstrucción, a su vez, sugiere que el endosimbionte era un organismo aeróbico facultativo que catabolizaba substratos proporcionados por la célula hospedadora. El metabolismo reconstruido presenta una diversidad mucho mayor de la que se le atribuía hasta ahora, pese a que algunas hipótesis habían predicho la existencia de un gran numero de genes derivados de la endosimbiosis mitocondrial [5]. Las estimaciones actuales del tamaño del proteoma proto‐ mitocondrial oscilan entre 800 y mas de 1000 proteínas. A nuestro parecer estas estimaciones representan cotas inferiores del verdadero tamaño del proteoma ancestral. Si tenemos en cuenta el nivel de cobertura alcanzado por nuestra técnica, que no identifica genes que han desaparecido de todos los genomas estudiados o aquellos cuyo origen alfa‐proteobacteriano no puede ser detectado por las técnicas filogenéticas al uso, el proteoma proto‐mitocondrial pudo contener varios miles de proteínas. La inclusión en el futuro de nuevos genomas, especialmente de taxones evolutivamente distantes de los ya utilizados, podría ampliar el tamaño y la diversidad del proteoma reconstruido.

160 Summarizing discussion

La simbiosis es cosa de dos Por definición [11], una simbiosis implica la interacción de, al menos, dos organismos. De este modo, cualquier hipótesis que pretenda explicar la relación simbiótica inicial entre la proto‐mitocondria y su hospedador ha de considerar a ambos organismos. Es ahora cuando empezamos a tener alguna idea acerca de las características del endosimbionte. Sin embargo apenas tenemos información acerca de cómo fue la célula hospedadora. Todo indica que su genoma tenía un fuerte componente arqueo‐bacteriano [5, 12‐14], pero persisten dudas acerca de si participó o no en una fusión con otra bacteria antes de la endosimbiosis mitocondrial y, en caso de que así fuera, a qué grupo perteneció dicha bacteria [12, 15, 16]. En este contexto, en el que sabemos muy poco acerca del entorno en el que se encontraba la proto‐mitocondria, es difícil señalar una única ruta metabólica como la responsable de la fijación simbiótica. Por otra parte, el hecho de que se detecte en genomas eucarióticos una gran variedad de rutas proto‐mitocondriales que no están directamente relacionadas con el metabolismo energético sugiere, más bien, que la endosimbiosis mitocondrial ofreció múltiples beneficios a la célula hospedadora. De todas las rutas metabólicas de origen proto‐mitocondrial, la del ensamblaje de grupos hierro‐azufre parece la más conservada en todos los eucariotas estudiados, incluidos aquellos que no contienen mitocondrias canónicas sino sus derivados como mitosomas o hidrogenosomas [17‐19]. Aun así, este hecho no implica necesariamente que esta ruta fuese la razón última del establecimiento de la simbiosis sino, más bien, que constituye una función esencial dentro del papel que las mitocondrias modernas representan en el metabolismo celular. A mi parecer, toda ruta metabólica proto‐ mitocondrial que sobrevivió a la masiva pérdida de genes ocurrida en los estadios iniciales de la endosimbiosis debió tener una función importante para el establecimiento de la misma. Quizás, hasta ahora, nos hayamos dejado llevar demasiado por la búsqueda de una única ruta metabólica, o el intercambio de un único metabolito, que lo explicase todo. Infravalorando así el complejo intercambio, provisto de múltiples facetas, que supuso la fijación de la relación endosimbiótica más exitosa en la historia de nuestro planeta.

De bacteria a orgánulo Desde los primeros pasos de la endosimbiosis, el proteoma mitocondrial ha experimentado grandes cambios, perdiendo una gran parte de su contenido inicial pero también adquiriendo nuevos componentes. El balance de ambos procesos, en términos de qué clases funcionales se ven más afectadas, difiere en las distintas ramas del árbol evolutivo de los eucariotas. El resultado es una gran diversidad en la composición de los proteomas mitocondriales modernos y, por tanto, en las capacidades metabólicas de las mitocondrias de especies diferentes [20]. Mediante el uso de datos genómicos y proteómicos hemos podido trazar las líneas generales de estos procesos. Se observa que la evolución del proteoma mitocondrial es muy dinámica, sucediéndose tanto pérdidas completas de algunas rutas de origen bacteriano, como la modificación de otras y la ganancia de sistemas completamente nuevos de origen eucariótico. Esta renovación del contenido proteico ha sido tan pronunciada que tan solo un 10‐16% del proteoma de las mitocondrias actuales puede considerarse derivado del endosimbionte bacteriano ancestral. El resto lo

161 Chapter 9

constituyen proteínas de origen diverso que fueron reclutadas para desempeñar distintas funciones en el orgánulo. Este modelado evolutivo del proteoma es un reflejo de la transformación de la bacteria ancestral en un orgánulo altamente especializado que, aparte de en la producción de ATP, está implicado en numerosas funciones clave del metabolismo eucariota.

Desenredando la evolución del complejo I La evolución de la cadena respiratoria mitocondrial [21], y específicamente de su primer componente, la NADH:ubiquinona oxidoreductasa (Complejo I), constituye un ejemplo fascinante de cómo una ruta de origen proto‐mitocondrial ha sido transformada mediante la continua adición de nuevos componentes eucarióticos. El Complejo I ha pasado de tener las 14 subunidades que posee en las a las 46 subunidades encontradas en mitocondrias de mamíferos. La función desempeñada por estas nuevas subunidades, llamadas “accesorias”, sigue siendo una incógnita. Para conocer más acerca de la evolución y función del Complejo I, empleamos una combinación de herramientas de genómica comparativa [22] que nos permitió establecer relaciones de ortología entre subunidades de Complejo I de diferentes especies, así como descubrir nuevas subunidades en algunos organismos. Además, una reconstrucción parsimoniosa de la evolución del Complejo I en eucariotas nos permite trazar la sucesión de eventos que lo modelaron. Se observa un gran aumento inicial en el número de componentes de complejo I que precede a la radiación evolutiva de plantas, hongos y metazoos. A este salto inicial en complejidad, le siguió una ganancia gradual de subunidades específicas en cada uno de los linajes eucariotas, combinada con perdidas incidentales de alguna subunidad o de todo el complejo. Este escenario evolutivo contrasta con lo observado para la evolución del Complejo I en bacterias que se supone surgido de la combinación de varios módulos preexistentes que funcionaban de forma autónoma e independiente [23, 24]. Esta diferencia en el modo evolutivo podría deberse a la organización en operones que presentan los genes bacterianos. La duplicación de un operón que contuviese todos los componentes de uno de los complejos, facilitaría que sus genes adquiriesen una nueva función en la formación del nuevo supra‐complejo. De ser así, la evolución modular del Complejo I bacteriano seria un reflejo de la organización estructural de los genes en el genoma (operones).

Proteínas en movimiento La comparación del metabolismo proto‐mitocondrial con el de las mitocondrias modernas revela que la contribución del endosimbionte a la célula eucariota no queda restringida a estos orgánulos. Más bien parece que la re‐ localización de rutas completas en otros compartimentos celulares ha sido un proceso común y recurrente a lo largo de la evolución mitocondrial. Este proceso ha afectado, en humanos y levadura, a más de la mitad de las proteínas derivadas de la proto‐mitocondria. Esta observación revela una nueva faceta de la contribución de la endosimbiosis mitocondrial a la evolución de los eucariotas, incluso podría explicar el paradójico exceso de genes de origen bacteriano encontrado en genomas eucariotas [7, 12].

162 Summarizing discussion

¿Son los peroxisomas un efecto secundario de la endosimbiosis mitocondrial? Entre las re‐localizaciones de proteínas proto‐mitocondriales, las que tienen como destino los peroxisomas constituyen un caso particularmente interesante, pues tienen enormes implicaciones en la formulación de una teoría acerca del origen de estos orgánulos. Los peroxisomas son orgánulos celulares eucariotas, para los cuales se ha propuesto un origen endosimbiotico [25]. Sin embargo, datos experimentales recientes ponen en duda que los peroxisomas sean realmente orgánulos autónomos [26]. A la luz de estos datos hemos realizado un estudio filogenético del proteoma de los peroxisomas de rata y levadura. Los resultados indican que las proteínas de origen alfa‐proteobacteriano representan una fracción significativa del proteoma total del peroxisoma en ambas especies. El origen más probable de este grupo de proteínas, a juzgar por filogenias que indican un origen monofiletico de proteínas peroxisomales y sus homólogos mitocondriales, seria una re‐localización a partir de la proto‐mitocondria. La presencia de proteínas de origen alfa‐proteobacteriano en el peroxisoma parece haberse iniciado ya en los tiempos del ancestro común de todos los peroxisomas. Esto sugeriría que los peroxisomas tuvieron un origen posterior a la simbiosis mitocondrial, aunque no podemos descartar una pérdida secundaria de rutas peroxisomales más antiguas. En conjunto, los resultados sugieren un orgánulo formado a partir del retículo endoplasmico y que inicialmente compartimentalizó el enzima catalasa junto con enzimas productores de H2O2 de la oxidación de ácidos grasos, reclutados estos últimos de la proto‐mitocondria.

Predicción de función de proteínas La endosimbiosis mitocondrial puede considerarse como una masiva transferencia horizontal de genes de las alfa‐proteobacterias a los eucariotas. Trazando la evolución de este pool de genes transferidos a lo largo de los linajes eucarióticos se detectan altos niveles de pérdida génica específica de linaje, con una media de tres perdidas génicas por grupo ortólogo en una filogenia de nueve especies eucariotas. Esta perdida génica podría reflejar la adaptación de las diversas especies a distintos medios, ya que se observa que genes que codifican proteínas de una misma ruta metabólica suelen tener patrones de perdida génica similares. Usando esta propiedad, en combinación con otras técnicas de predicción de función [27], hemos propuesto varias interacciones funcionales entre proteínas mitocondriales [22, 28, 29]. Durante este periodo hemos mantenido una constante colaboración con grupos experimentales del NCMLS y otros centros de investigación, lo que ha permitido un seguimiento experimental de nuestras predicciones. Una de nuestras predicciones, que el inhibidor de la RNasaL (RLI) está implicado en maduración de rRNA [27], ha sido ya comprobada experimentalmente [30], y otras predicciones se están probando actualmente. De este modo esta tesis pone de relieve que el estudio de la evolución genómica y organular es, además, útil en la caracterización de proteínas de función desconocida.

163 Chapter 9

References Referenties Bibliografía

1 Gray, M. W., Burger, G. and Lang, B. F. 1999. ʺMitochondrial evolutionʺ Science (283) 5407:1476‐81 2 Scheffler, I. E. 2001. ʺMitochondria make a come backʺ Adv Drug Deliv Rev (49) 1‐2:3‐ 26 3 Gray, M. W., Burger, G. and Lang, B. F. 2001. ʺThe origin and early evolution of mitochondriaʺ Genome Biol (2) 6:REVIEWS1018 4 Martin, W., et al. 2001. ʺAn overview of endosymbiotic models for the origins of eukaryotes, their ATP‐producing organelles (mitochondria and hydrogenosomes), and their heterotrophic lifestyleʺ Biol Chem (382) 11:1521‐39 5 Martin, W. and Müller, M. 1998. ʺThe hydrogen hypothesis for the first eukaryoteʺ Nature (392) 6671:37‐41 6 Andersson, S. G., et al. 1998. ʺThe genome sequence of Rickettsia prowazekii and the origin of mitochondriaʺ Nature (396) 6707:133‐40 7 Esser, C., et al. 2004. ʺA genome phylogeny for mitochondria among alpha‐ proteobacteria and a predominantly eubacterial ancestry of yeast nuclear genesʺ Mol Biol Evol (21) 9:1643‐60 8 Margulis, L. 1981 ʺSymbioses in Cell Evolutionʺ edited by W. H. Freeman W.H. Freeman San Francisco 9 Emelyanov, V. V. 2001. ʺRickettsiaceae, rickettsia‐like endosymbionts, and the origin of mitochondriaʺ Biosci Rep (21) 1:1‐17 10 Gabaldón, T. and Huynen, M. A. 2003. ʺReconstruction of the proto‐mitochondrial metabolismʺ Science (301) 5633:609 11 Douglas, A. E. 1994 ʺSymbiotic interactionsʺ edited by Oxford Univ. Press New York 12 Rivera, M. C. and Lake, J. A. 2004. ʺThe ring of life provides evidence for a genome fusion origin of eukaryotesʺ Nature (431) 7005:152‐5 13 Moreira, D. and López‐García, P. 1998. ʺSymbiosis between methanogenic archaea and delta‐proteobacteria as the origin of eukaryotes: the syntrophic hypothesisʺ J Mol Evol (47) 5:517‐30 14 Brown, J. R. and Doolittle, W. F. 1997. ʺArchaea and the prokaryote‐to‐eukaryote transitionʺ Microbiol Mol Biol Rev (61) 4:456‐502 15 Gupta, R. S. and Golding, G. B. 1996. ʺThe origin of the eukaryotic cellʺ Trends Biochem Sci (21) 5:166‐71 16 Emelyanov, V. V. 2003. ʺMitochondrial connection to the origin of the eukaryotic cellʺ Eur J Biochem (270) 8:1599‐618 17 Tovar, J., et al. 2003. ʺMitochondrial remnant organelles of Giardia function in iron‐ sulphur protein maturationʺ Nature (426) 6963:172‐6 18 LaGier, M. J., et al. 2003. ʺMitochondrial‐type iron‐sulfur cluster biosynthesis genes (IscS and IscU) in the apicomplexan Cryptosporidium parvumʺ Microbiology (149) Pt 12:3519‐30 19 Lill, R. and Muhlenhoff, U. 2005. ʺIron‐sulfur‐protein biogenesis in eukaryotesʺ Trends Biochem Sci (30) 3:133‐41 20 Tielens, A. G., et al. 2002. ʺMitochondria as we donʹt know themʺ Trends Biochem Sci (27) 11:564‐72 21 Berry, S. 2003. ʺEndosymbiosis and the design of eukaryotic electron transportʺ Biochim Biophys Acta (1606) 1‐3:57‐72

164 Summarizing discussion

22 Gabaldón, T., Rainey, D. and Huynen, M. A. 2005. ʺTracing the Evolution of a Large Protein Complex in the Eukaryotes, NADH:Ubiquinone Oxidoreductase (Complex I)ʺ J Mol Biol (348) 4:857‐70 23 Friedrich, T. 2001. ʺComplex I: a chimaera of a redox and conformation‐driven proton pump?ʺ J Bioenerg Biomembr (33) 3:169‐77 24 Friedrich, T. and Weiss, H. 1997. ʺModular evolution of the respiratory NADH:ubiquinone oxidoreductase and the origin of its modulesʺ J Theor Biol (187) 4:529‐40 25 Latruffe, N. and Vamecq, J. 2000. ʺEvolutionary aspects of peroxisomes as cell organelles, and of genes encoding peroxisomal proteinsʺ Biol Cell (92) 6:389‐95 26 Tabak, H. F., et al. 2003. ʺPeroxisomes start their life in the endoplasmic reticulumʺ Traffic (4) 8:512‐8 27 Gabaldón, T. and Huynen, M. A. 2004. ʺPrediction of protein function and pathways in the genome eraʺ Cell Mol Life Sci (61) 7‐8:930‐44 28 Huynen, M., Snel, B. and Gabaldón, T. 2005 ʺReliable and Specifc Protein Function Prediction by combining homolgy with genomic(s) contexʺ in Discovering Biomolecular Mechanisms with Computational Biology edited by Landes Biosciences Georgetown

29 Huynen, M. A., et al. 2005. ʺCombining data from genomes, Y2H and 3D structure indicates that BolA is a reductase interacting with a glutaredoxinʺ FEBS Lett (579) 3:591‐596 30 Karcher, A., et al. 2005. ʺX‐ray structure of RLI, an essential twin cassette ABC ATPase involved in ribosome biogenesis and HIV capsid assemblyʺ Structure (Camb) (13) 4:649‐59

165 Appendix 1

166 An anaerobic mitochondrion that produces hydrogen

Appendix 1

An anaerobic mitochondrion that produces hydrogen

Brigitte Boxma, Rob M. de Graaf, Georg W. M. van der Staay, Theo A. van Alen, Guénola Ricard, Toni Gabaldón, Angela H. A. M. van Hoek, Seung Yeo Moon‐van der Staay, Werner J.H. Koopman, Jaap J. van Hellemond, Aloysius G. M .Tielens, Thorsten Friedrich, Marten Veenhuis, Martijn A. Huynen & Johannes H. P. Hackstein.

Nature. 2005 Mar 3;434(7029):74‐9

167 Appendix 1

Abstract Hydrogenosomes are organelles that produce ATP and hydrogen [1], and are found invarious unrelated eukaryotes, such as anaerobic flagellates, chytridiomycete fungi and ciliates [2]. Although all of these organelles generate hydrogen, the hydrogenosomes from these organisms are structurally and metabolically quite different, just like mitochondria where large differences also exist [3]. These differences have led to a continuing debate about the evolutionary origin of hydrogenosomes [4, 5]. Here we show that the hydrogenosomes of the anaerobic ciliate Nyctotherus ovalis, which thrives in the hindgut of cockroaches, have retained a rudimentary genome encoding components of a mitochondrial electron transport chain. Phylogenetic analyses reveal that those proteins cluster with their homologues from aerobic ciliates. In addition, several nucleus‐encoded components of the mitochondrial proteome, such as pyruvate dehy‐drogenase and complex II, were identified. The N. ovalis hydrogenosome is sensitive to inhibitors of mitochondrial complex I and produces succinate as a major metabolic end product —biochemical traits typical of anaerobic mitochondria [3]. The production of hydrogen, together with the presence of a genome encoding respiratory chain components, and biochemical features characteristic of anaerobic mitochondria, identify the N. ovalis organelle as a missing link between mitochondria and hydrogenosomes.

Results and discussion Hydrogenosomes and their highly reduced relatives, mitosomes, generally lack an organelle genome [5‐8], hampering clarification of their origin. Two models for the origin of hydrogenosomes are currently debated. The first posits that the ancestral mitochondrial endosymbiont gave rise to aerobically functioning mitochondria, which subsequently evolved into hydrogenosomes by the acquisition of genes encoding enzymes essential for an anaerobic metabolism [9‐13]. The second hypothesis presumes that hydrogenosomes and mitochondria originated from one and the same ancestral ‐ facultatively anaerobic ‐ (endo)symbiont, followed by specialization to aerobic and anaerobic niches during eukaryotic evolution [11, 14].

To address this issue we investigated DNA in hydrogenosomes of N. ovalis, which was previously identified by immunocytochemical methods [15]. Intact N. ovalis hydrogenosomes isolated by cell fractionation contained DNA between 20 and 40 kilobases (kb) long. Long‐range polymerase chain reaction (PCR) with this DNA with the use of specific primers for the hydrogenosomal small‐subunit (SSU) ribosomal RNA [15] and nad7 (obtained earlier by PCR with degenerated primers) yielded a 12‐kb fragment of the organellar genome. It encodes four genes of a mitochondrial complex I (nad2, nad4L, nad5 and nad7), two genes encoding tyr mitochondrial ribosomal proteins RPL 2 and RPL 14, and a tRNA gene (Fig. 1). Nad2 and nad4L, which are generally poorly conserved among ciliates, could be identified by using multiple sequence alignments and an analysis of their membrane‐spanning domains as described previously [16].

168 An anaerobic mitochondrion that produces hydrogen

Figure 1 A 14,027‐bp fragment (mtg 1) of the hydrogenosomal genome of N. ovalis var. Blaberus Amsterdam. Black boxes, RNA coding genes; shaded boxes, genes significant similarity to mitochondrial genes; white boxes, unknown ORFs (named according to the number of codons); arrows, cDNAs identified so far. The numbers the nucleotide positions on the 14‐kb clone (mtg 1). The longest ORF (4,179– 9,728) contains a stretch with significant similarity to nad5. A potential start codon for a putative nad5 transcript is marked with an asterisk.

Phylogenetic analysis revealed clustering of these genes with their homologues from the mitochondrial genomes of aerobic ciliates (Fig. 2, and Supplementary Information). All genes exhibit a characteristic mitochondrial codon‐ usage and lack amino‐terminal extensions that could function as a mitochondrial targeting signal (Table 1). Complementary DNAs isolated for nad5 and nad7 show that they are transcribed. Translation with a nuclear genetic code from N. ovalis, rather than the ciliate mitochondrial code, leads to numerous stop codons (not shown). Five additional open reading frames (ORFs 236, 262, 71, 161 and 199) do not show significant sequence similarity to ORFs from the mitochondrial genomes accessible in the EMBL database. Two ORFs overlap with neighbouring ORFs as in other mitochondrial genomes [17].

Macronuclear gene‐sized chromosomes encoding the 24‐kDa, 51‐kDa and 75‐ kDa subunits of mitochondrial complex I and the Fp and Ip subunits of mitochondrial complex II were cloned with a PCR‐based approach. These have a nuclear codon usage, are transcribed (Table 1), encode a putative N‐terminal mitochondrial targeting signal and branch with their mitochondrial homologues from aerobic ciliates in phylogenetic analyses (Fig. 2, Table 1 and Supplementary Information). They are similar to the two complex I‐like Ndh51 and Ndh24 proteins discovered in Trichomonas vaginalis [18, 19], because a phylogenetic analysis including the mitochondrial homologues from N. ovalis and certain aerobic ciliates reveals that all these proteins belong to a cluster of mitochondrial complex I homologues (see Supplementary Information). Thus, N. ovalis, 7 of the 14 genes encoding core proteins of mitochondrial complex I, and two of the four proteins of mitochondrial complex II, have been identified so far. They are well conserved, are transcribed, and cluster with the mitochondrial homologues of their aerobic (ciliate) relatives, indicating that the hydrogenosomes of N. ovalis have retained parts of a functional mitochondrial electron‐transport chain.

169 Appendix 1

Table1. N, nucleus; H, hydrogenosome; nuc, nuclear, mt, mitochondria; mtg1, 12‐kb clone of hydrogenosomal genome (AJ871267); ?, low‐probability support so far because full‐

length cDNAs have not yet been isolated (The N terminus might be incomplete, or it might contain an in‐frame intron or alternative start codons). For a complete table see

Supplementary information.

Hydrogenosomes of N. ovalis have typical mitochondrial cristae and contain cardiolipin [13]. They are closely associated with endosymbiotic methanogens, which are biomarkers for hydrogen formation by the N. ovalis hydrogenosomes [20] (Fig. 3a). The organelles stain with Mitotracker Green FM and fluoresce with rhodamine 123, indicating the presence of a membrane potential (Fig. 3). Carbonyl cyanide p‐trifluoromethoxyphenylhydrazone (FCCP) (5 mM) prevented staining with rhodamine 123, indicating the possible presence of a proton gradient. Moreover, staining of the hydrogenosomes with rhodamine 123 was also prevented after incubation of the ciliates with rotenone, piericidin, fenazaquin and 1‐methyl‐4‐ phenylpyridinium (MPPþ) (classical inhibitors of mitochondrial complex I [21]), but not with cyanide (1 mM) or antimycin A (inhibitors of mitochondrial complex III and IV; Fig. 3). Similarly, treatment with cyanide and salicylhydroxamic acid (SHAM), inhibitors of mitochondrial complex IV of the respiratory chain and the plant‐like alternative oxidase known from certain mitochondria [3], respectively, neither killed N. ovalis nor interfered with its oxygen consumption under aerobic conditions (not shown). These observations not only indicate the absence of a functional complex III and IV and the absence of a terminal (plant‐like) alternative oxidase, but also reveal the presence of a functional mitochondrial complex I as the source of the organellar proton gradient [3]. The oxygen consumption of N. ovalis observed under aerobic conditions is most probably a detoxification mechanism, and longer exposure to

170 An anaerobic mitochondrion that produces hydrogen atmospheric oxygen kills the ciliates effectively.

Figure2: Phylogenetic analysis of hydrogenosomal genes. Both the organellar 12S (SSU) rRNA gene (b) and the nuclear hsp60 (c) reveal a ciliate ancestry for the hydrogenosome of N. ovalis. The same is true for the components of a ʹmitochondrialʹ complex I, the nad7 (49 kDa; organellar, a) and 51 kDa (nuclear, d) genes. The phylogenies were derived using MrBayes and neighbour joining: the topologies correspond to the maximum‐likelihood (MrBayes) approach, and the values at the nodes indicate the posterior probability for the partition and its bootstrap value, respectively. Only values higher than 50% are indicated. See Supplementary Information. EB, Eubacteria.

171 Appendix 1

Figure3 Hydrogenosomes of N. ovalis exhibit complex I activity. a, b, Electron micrographs of a hydrogenosome of N. ovalis (a) and a mitochondrion of Euplotes sp. (b). White arrowheads mark cristae; m, endosymbiotic methanogenic archaeon [15, 20]. c, Fluorescence picture of N. ovalis hydrogenosomes (bright dots), which were released from the cell by gentle squashing after being stained in vivo with ethidium bromide. d, Rhodamine 123 (R123) also stains the hydrogenosomes, the only organelles matching the expected size (compare a, and Supplementary Information). e, Mitotracker green FM stains the same organelles. The inserts in d and e (outside the ciliate) show the organelles seen in the box inside the ciliate. f, h, i, Incubation of living cells with inhibitors of mitochondrial complex I (MPP+ (f), piericidin (h) and rotenone (i))21 completely prevents staining of the organelles by R123. g, Incubation with cyanide (1 mM) or antimycin A (not shown) does not interfere with staining by R123. For additional information see the text and Supplementary Information. Scale bars, 1 μm (a, also applies to b); 10 μm (c‐i).

14 Metabolic experiments using tracer amounts of uniformly labelled (U‐) C‐glucose revealed that N. ovalis catabolizes glucose predominantly into acetate, lactate, succinate and smaller amounts of ethanol, in addition to CO2 (Table 2). The presence of oxygen did not cause significant changes in the pattern of excreted end products

172 An anaerobic mitochondrion that produces hydrogen

14 (not shown). Notably, incubations in the presence of [6‐ C]glucose did not result in 14 the formation of labelled CO2. Because C‐labelled CO2 is released from [6‐ 14 C]glucose by successive decarboxylations through multiple rounds in the Krebs 14 cycle, the absence of labelled CO2 after application of [6‐ C]glucose indicates the 14 absence of a complete Krebs cycle. The observed excretion of C‐labelled CO2 after 14 incubation with [U‐ C]glucose could be the result of either pyruvate dehydrogenase (PDH) activity, as in typical aerobic mitochondria, or pyruvate:ferredoxin oxidoreductase (PFO) activity, as in the hydrogenosomes of T. vaginalis. A third possibility for pyruvate catabolism, pyruvate formate lyase activity [22, 23], can be excluded because no detectable amounts of formate were produced from [U‐ 14 C]glucose (Table 2). We failed to identify genes for PFO but succeeded in isolating three of the four PDH genes, namely the E1a, E1b and E2 subunits, which are expressed as cDNA, indicating that N. ovalis uses a mitochondrial PDH for oxidative 14 decarboxylation. Significant amounts of C‐labelled succinate from both [U‐ 14 14 C]glucose and [6‐ C]glucose (Table 2) indicate that endogenously produced fumarate is used as a terminal electron acceptor, as in some anaerobic mitochondria [3]. Fumarate reduction in N. ovalis (to account for the production of succinate) is most probably catalysed by a membrane‐bound complex II (see above; Table 1, and Supplementary Information), which is coupled to complex I through electron transport mediated by quinines [3]. Mass spectrometry coupled to liquid chromatography of lipid extracts from N. ovalis revealed the presence of small amounts of quinones (rhodoquinone 9 and menaquinone 8) at a concentration of about 1 pmol per mg protein (Supplementary Information). This amount is at least two orders of magnitude lower than in other eukaryotes known to possess anaerobic mitochondria producing succinate [24]. The low concentration of quinones in N. ovalis cells might reflect the intermediate state of their hydrogenosomes, occupying a position between mitochondria (which contain a membrane‐bound electron transport chain) and previously characterized hydrogenosomes (which do not)[1, 3‐ 5, 11, 18]. Although an FoF1‐ATP synthase has not yet been identified, the hydrogenosome of N. ovalis has retained certain basal energy‐generating functions of an aerobically functioning mitochondrion [3]. To explore the presence of additional ‘mitochondrial’ traits in N. ovalis, we performed a reciprocal Smith– Waterman sequence comparison between about 2,000 six‐frame‐translated clones from our genomic DNA library of N. ovalis and the yeast [25] and human [26] mitochondrial proteins. We identified 53 additional nuclear genes encoding potential mitochondrial proteins in addition to components of the mitochondrial import machinery (Table 1, and Supplementary Information).

In contrast, the hydrogenase of N. ovalis does not exhibit any mitochondrial traits. This hydrogenase is rather unusual in comparison with other eukaryotic hydrogenases because it seems to be a fusion of a [Fe] hydrogenase with two accessory subunits of different evolutionary origin [4,

173 Appendix 1

15]. These subunits should allow NADH reoxidation in combination with the [Fe] hydrogenase, because they exhibit a significant sequence similarity to the hox F and hox U subunits of b‐proteobacterial [Ni–Fe] hydrogenases, in contrast to similar, recently described hydrogenosomal proteins (24 and 51 kDa) of putative mitochondrial origin from T. vaginalis [4, 15]. The ‘mitochondrial’ 24‐kDa and 51‐kDa genes of N. ovalis are clearly different from the above‐mentioned hydrogenase modules and are likely to function in mitochondrial complex I (Fig. 3, Table 1, and Supplementary Information). Moreover, the catalytic centre of the hydrogenase (the H‐cluster) clusters neither with any of the hydrogenases of Trichomonas, Piromyces and Neocallimastix studied so far, nor with any of the hydrogenase‐related Nar proteins, which seem to be shared by all eukaryotes. Rather, the N. ovalis hydrogenase is more closely related to [Fe] hydrogenases from d‐proteo‐ bacteria [4, 15, 27]. These bservations suggest that the hydrogenase of N. ovalis has been acquired by lateral gene transfer.

It should be realized that the hydrogenosome of N. ovalis is so far unique and not representative of all hydrogenosomes, which seem to have evolved repeatedly and independently—albeit from the same ancestral mitochondrial‐ type organelle. All our data identify the hydrogenosome of N. ovalis as a ciliate‐type mitochondrion that produces hydrogen. The presence of respiratory‐chain activity of mitochondrial complex I and II, in combination with hydrogen formation, characterizes the N. ovalis hydrogenosome as a true missing link in the evolution of mitochondria and hydrogenosomes.

Methods

Strains Nyctotherus ovalis ciliates were isolated from the hindgut of the cockroach Blaberus sp. (strain Amsterdam), taking advantage of their unique (anodic) galvanotactic swimming behaviour [28] After the ciliate’s arrival at the anode, cells were picked up with a micropipette, inspected individually under a dissecting microscope at x40 magnification, collected in an Eppendorf tube and washed three times with anaerobic electromigration buffer. Ciliates belonging to the Euplotes crassus/vannus/minuta complex were cultured in artificial sea water, feeding on bacteria growing on an immersed small piece of meat (approx. 1 g of beef steak). In addition, the ciliates were fed with Escherichia coli, which were added at weekly intervals. Ciliates were harvested by centrifugation.

174 An anaerobic mitochondrion that produces hydrogen

Microscopy Electron microscopy of N. ovalis and Euplotes sp. was performed as described previously [13, 15]. Fluorescence microscopy was performed with a Noran OZ video‐ rate confocal microscope as described previously [29]. Inhibitors were used in concentrations of 1 mM. They were dissolved in N. ovalis culture medium [28]. The rotenone solution contained 10% dimethyl sulphoxide, the fenazaquin solution 1% dimethyl sulphoxide.

Metabolite and quinone determinations Micro‐aerobic incubations with N. ovalis were performed in rotating (20 r.p.m.) sealed incubation flasks containing 5 ml incubation medium (containing 10,000–15,000 cells). All incubations were performed for 48 h and contained either 10 mCi [U‐14C]glucose or 10 mCi [6‐14C]glucose (2.07 GBq mmol [21]), both from Amersham. Incubations were terminated by the addition of 300 ml 6 M HCl to lower the pH from 7.2 to 2.0. N. ovalis cells were separated from the medium by centrifugation (5 min at 500g and 4 8C); excreted end products were analysed by anion‐exchange chromatography on a Dowex 1X8 column. Quinones were separated by liquid chromatography and detected with a Sciex API 300 triple quadrupole mass spectrometer (see Supplementary Information).

Isolation of organellar DNA N. ovalis cells were washed once with isolation buffer (0.35 M sucrose, 10 mM Tris‐HCl pH 7, 2 mM EDTA) and disrupted in a Dounce homogenizer. Nuclei were centrifuged at 3,000g for 5 min, and organelles were pelleted from the supernatant at 10,000g for 30 min. Genomic DNA was isolated by using standard procedures or after lysis of the cells in 8 M guanidinium chloride.

Genomic DNA library Gene‐sized chromosomes were randomly amplified by PCR with telomere‐ specific primers, size‐fractionated in agarose gels, and cloned in pGEM‐T‐easy (Promega). Clones with sizes between 0.5 and 5 kb were end‐sequenced and analysed manually by TBLASTX (http://www.ncbi.nlm.nih.gov/BLAST). Searches were conducted with BLASTN and FASTA. c‐DNA library RNA was isolated with the RNeasy Plant minikit (Qiagen). cDNA was prepared with the Smart Race cDNA amplification kit (Clontech). Expressed sequence tags were amplified by PCR with the universal adapter primer provided with the kit and the various, specific internal primers.

Complete macronuclear gene‐sized chromosomes Telomere‐specific primers in combination with internal gene sequences allow a straightforward recovery of the complete gene [30]. The specific (internal) primers were based on the DNA sequences of internal fragments of the various genes, which were recovered previously by PCR with degenerated primers for conserved parts of the various genes.

175 Appendix 1

Phylogenetic analysis Protein sequences were aligned with ClustalW and Muscle; unequivocally aligned positions were selected with Gblocks or manually. Phylogenies were inferred with maximum likelihood by using a discrete gamma‐distribution model with four rate categories plus invariant positions and the Poisson amino acid similarity matrix, and neighbour joining as implemented in ClustalW, correcting for multiple substitutions with the Gonnet amino acids identity matrix, and bootstrapping with 100 samples. ORFs with a lower size limit of 100 nucleotides were identified with ORF Finder (http://www.ncbi.nlm.nih.gov/gorf/gorf.html). tRNAs were identified with tRNAscan‐SE (http://www.genetics.wustl.edu/eddy/tRNAscan‐SE). Potential mitochondrial import signals were detected with MITOP (http://mips.gsf.de/cgi‐ bin/proj/medgen/mitofilter). Sequence searches were performed with BLASTX (http://www.ncbi.nlm.nih.gov/BLAST), BLASTN and FASTA. For references on phylogenetic analysis see Supplementary Information.

Supplementary Information accompanies the paper on www.nature.com/nature.

Acknowledgements We thank L. Landweber, J. Wong and W.‐J. Chang for advice on the cloning of complete minichromosomes and for sharing the first sequence of a PDH gene in N. ovalis; S. van Weelden and H. de Roock for help in the metabolic studies; J. Brouwers for analysis of the quinones; G. Cremers, L. de Brouwer, A. Ederveen, A. Grootemaat, M. Hachmang, S. Huver, S. Jannink, N. Jansse, R. Janssen, M. Kwantes, B. Penders, G. Schilders, R. Talens, D. van Maassen, H. van Zoggel, M. Veugelink and P. Wijnhoven for help with the isolation of various N. ovalis sequences; and K. Sjollema for electron microscopy. G.W.M.v.d.S., S.Y.M.‐ v.d.S. and G.R. were supported by the European Union 5th framework grant ‘CIMES’. This work was also supported by equipment grants from ZON (Netherlands Organisation for Health Research and Development), NWO (Netherlands Organisation for Scientific Research), and the European Union 6th framework programme for research, priority 1 “Life sciences, genomics and biotechnology for health” to W.J.H.K.

References

1 Muller, M. 1993. ʺThe hydrogenosomeʺ J Gen Microbiol (139) 12:2879‐89 2 Roger, A. J. 1999. ʺReconstructing Early Events in Eukaryotic Evolutionʺ Am Nat (154) S4:S146‐S163 3 Tielens, A. G., et al. 2002. ʺMitochondria as we donʹt know themʺ Trends Biochem Sci (27) 11:564‐72 4 Embley, T. M., et al. 2003. ʺHydrogenosomes, mitochondria and early eukaryotic evolutionʺ IUBMB Life (55) 7:387‐95 5 Dyall, S. D., Brown, M. T. and Johnson, P. J. 2004. ʺAncient invasions: from endosymbionts to organellesʺ Science (304) 5668:253‐7 6 Clemens, D. L. and Johnson, P. J. 2000. ʺFailure to detect DNA in hydrogenosomes of

176 An anaerobic mitochondrion that produces hydrogen

Trichomonas vaginalis by nick translation and immunomicroscopyʺ Mol Biochem Parasitol (106) 2:307‐13 7 León‐Ávila, G. and Tovar, J. 2004. ʺMitosomes of Entamoeba histolytica are abundant mitochondrion‐related remnant organelles that lack a detectable organellar genomeʺ Microbiology (150) Pt 5:1245‐50 8 van der Giezen, M., et al. 1997. ʺHydrogenosomes in the anaerobic fungus Neocallimastix frontalis have a double membrane but lack an associated organelle genomeʺ FEBS Lett (408) 2:147‐50 9 Embley, T. M., Horner, D. A. and Hirt, R. P. 1997. ʺAnaerobic eukaryote evolution: hydrogenosomes as biochemically modified mitochondria?ʺ Trends Ecol. Evol. (12) 437‐441 10 Fenchel, T. F., B. 1995 ʺEcology and Evolution in Anoxic Worldsʺ edited by Oxford University Press Oxford, UK 11 Martin, W., et al. 2001. ʺAn overview of endosymbiotic models for the origins of eukaryotes, their ATP‐producing organelles (mitochondria and hydrogenosomes), and their heterotrophic lifestyleʺ Biol Chem (382) 11:1521‐39 12 van der Giezen, M., et al. 2002. ʺConserved properties of hydrogenosomal and mitochondrial ADP/ATP carriers: a common origin for both organellesʺ EMBO J (21) 4:572‐9 13 Voncken, F., et al. 2002. ʺMultiple origins of hydrogenosomes: functional and phylogenetic evidence from the ADP/ATP carrier of the anaerobic chytrid Neocallimastix spʺ Mol Microbiol (44) 6:1441‐54 14 Martin, W. and Müller, M. 1998. ʺThe hydrogen hypothesis for the first eukaryoteʺ Nature (392) 6671:37‐41 15 Akhmanova, A., et al. 1998. ʺA hydrogenosome with a genomeʺ Nature (396) 6711:527‐8 16 Brunk, C. F., et al. 2003. ʺComplete sequence of the mitochondrial genome of Tetrahymena thermophila and comparative methods for identifying highly divergent genesʺ Nucleic Acids Res (31) 6:1673‐82 17 Burger, G., Gray, M. W. and Lang, B. F. 2003. ʺMitochondrial genomes: anything goesʺ Trends Genet (19) 12:709‐16 18 Dyall, S. D., et al. 2004. ʺNon‐mitochondrial complex I proteins in a hydrogenosomal oxidoreductase complexʺ Nature (431) 7012:1103‐7 19 Hrdy, I., et al. 2004. ʺTrichomonas hydrogenosomes contain the NADH dehydrogenase module of mitochondrial complex Iʺ Nature (432) 7017:618‐22 20 van Hoek, A. H., et al. 2000. ʺMultiple acquisition of methanogenic archaeal symbionts by anaerobic ciliatesʺ Mol Biol Evol (17) 2:251‐8 21 Degli Esposti, M. 1998. ʺInhibitors of NADH‐ubiquinone reductase: an overviewʺ Biochim Biophys Acta (1364) 2:222‐35 22 Boxma, B., et al. 2004. ʺThe anaerobic chytridiomycete fungus Piromyces sp. E2 produces ethanol via pyruvate:formate lyase and an alcohol dehydrogenase Eʺ Mol Microbiol (51) 5:1389‐99 23 Akhmanova, A., et al. 1999. ʺA hydrogenosome with pyruvate formate‐lyase: anaerobic chytrid fungi use an alternative route for pyruvate catabolismʺ Mol Microbiol (32) 5:1103‐14 24 Van Hellemond, J. J., et al. 1995. ʺRhodoquinone and complex II of the electron transport chain in anaerobically functioning eukaryotesʺ J Biol Chem (270) 52:31065‐ 70 25 Sickmann, A., et al. 2003. ʺThe proteome of Saccharomyces cerevisiae mitochondriaʺ Proc Natl Acad Sci U S A (100) 23:13207‐12 26 Cotter, D., et al. 2004. ʺMitoProteome: mitochondrial protein sequence database and

177 Appendix 1

annotation systemʺ Nucleic Acids Res (32 Database issue) D463‐7 27 Voncken, F. G., et al. 2002. ʺA hydrogenosomal [Fe]‐hydrogenase from the anaerobic chytrid Neocallimastix sp. L2ʺ Gene (284) 1‐2:103‐12 28 Van Hoek, A. H., et al. 1999. ʺVoltage‐dependent reversal of anodic galvanotaxis in Nyctotherus ovalisʺ J Eukaryot Microbiol (46) 4:427‐33 29 Koopman, W. J., et al. 2001. ʺMembrane‐initiated Ca(2+) signals are reshaped during propagation to subcellular regionsʺ Biophys J (81) 1:57‐65 30 Curtis, E. A. and Landweber, L. F. 1999. ʺEvolution of gene scrambling in ciliate micronuclear genesʺ Ann N Y Acad Sci (870) 349‐50

178 Color figures and tables

Appendix 2

Color figures and tables

Figures from Chapters 2 and 3

179 Appendix 2

Chapter 2, Figure 4: Gene fusions within the Tryptophan synthesis pathway: A) L‐ Tryptophan synthesis pathway and enzymatic activities associated to each step: E (yellow): Anthranilate/para‐aminobenzoate component I. G (red): Anthranilate/para‐aminobenzoate component II. D (blue): Anthranilate phosphorybosil transferase. F (pink): Phosphorybosilanthranilate isomerase. C (green): Indole‐3‐glycerol phosphate synthase. A (cream): Tryptophan synthase alpha chain. B (brown): Tryptophan synthase beta chain. B) Gene fusions observed in several fully sequenced genomes as provided by STRING database [43], only one species for each fusion is chosen as an example. C) Functional network derived from the fusion events within genes of the Tryptophan synthesis pathways, enzymatic functions are linked if they are observed in the same polypeptidic chain in at least one genome.

180 Color figures and tables

Chapter 3, figure 1. An overview of metabolism and transport in the proto‐mitochondrion, as deduced from the orthologous groups present in its estimated proteome. Yellow boxes indicate the number of groups in that COG functional class. Boxes, arrows, and cylinders indicate pathways, enzymes, and transporters, respectively. Blue: proteins are mitochondrial in yeast or human. Orange: human and yeast orthologs have not been observed in mitochondria. Black: there is no human or yeast representative of this orthologous group.

181 Appendix 2

182 Color figures and tables

Figures from Chapters 4 and 5

183 Appendix 2

Chapter 4, Fig. 1. Schematic representation of the analyses: a) For every protein encoded in one of the six alpha‐proteobacterial genomes (blue rectangle), b) a Smith‐Waterman search was performed against the proteins encoded in a total of 75 genomes (rectangle) including the 6 alpha‐proteobacteria, 9 eukaryotes (in red) and other bacteria and archaea (in yellow), c) The selected homologs were aligned and d) a phylogenetic tree was derived. Repeating this procedure the phylomes of the six alpha‐proteobacteria were constructed and e) scanned for partitions indicating an alpha‐protomitochondrial origin of the eukaryotic proteins, f) alpha‐ proteobacterial and eukaryotic proteins of this partition constitute an orthologous group whose phylogenetic pattern is derived, g) a Hidden Markov Model is built from an alignment of the members of each orthologous group and used to, h) search against the genomes of the eukaryotic species absent in the profile to detect unpredicted genes and i) search against the eukaryotic proteomes and j) eventually correct the phylogenetic profile, k) this profile is mapped on the species tree to infer the pattern of gene loss events.

184 Color figures and tables

Chapter 5, Figure 1. ClustalX [31] view of the multiple sequence alignment of the NB5M/NUVM orthologous group. The alignment was performed with MUSCLE [32] and then refined manually. The numbers at the start and the end of the sequences indicate the first and last amino acid positions of the alignment. The bar plot at the bottom indicates the level of conservation of the positions in the alignment.

Chapter 5, Figure 5. Schematic structure of human/bovine Complex I, indicating which subunits were gained at every evolutionary stage. Circles and ovals represent different subunits, with their size being roughly proportional to their molecular weight. The position of the different subunits within the structure is in accordance with their presence/absence in the different elution fractions and sub‐complexes, to gene fusions and to other data described in the literature, see Methods for details. The color code indicates the evolutionary stage in which a certain subunit was gained. The figure illustrates that subunits have been gained in the evolutionary lineage leading to H.sapiens in all fractions of Complex I and in all stages.

185 Appendix 2

186 Color figures and tables

Figures from Chapters 6 and 8 (I)

187 Appendix 2

Chapter 6, Figure 5: Comparison of the reconstructed human and yeast mitochondrial metabolisms. Pathways shared by the two species are depicted in the middle region of the figure, pathways at the extremes of the dashed lines are exclusive for human (left) or yeast (right) mitochondria. Color codes indicate whether the pathway was likely present in the proto‐mitochondrion (blue) or has a different origin (red). Only pathways with two or more consecutive steps are depicted. Symbols as in figure 2.

188 Color figures and tables

Chapter 8, Figure 1 C: Pie chart showing the relative distribution of peroxisomal proteins according to their phylogenetic origin in yeast (left) and rat (right). Proteins that do have prokaryotic homologs but for which no reliable tree can de constructed, e.g. due to short stretches of homology, are considered “unresolved”.

Chapter 8, Figure 2. The various scenarios for the origin of peroxisomal proteins. Proteins are derived from the alpha‐proteobacterial ancestor of the mitochondria, including the transfer of their genes to the nucleus (scenario I). Proteins without a (detectable) alpha‐proteobacterial ancestry have been retargeted from the mitochondria (scenario II). Proteins have been retargeted from other compartments of the cell like the Endoplasmatic Reticulum (scenario III).

Chapter 8, Figure 3 ClustalX [28] view of the multiple sequence alignment of several fungal members of the Cit1/2p orthologous group. The alignment was performed with MUSCLE [29], only the N‐terminal region is shown. Amino acids around the signal‐peptide cleavage‐sites, as predicted by Mitoprot [22] are marked with a rectangle (white arrow indicates the position in the alignment) they correspond to a YS (YA in C. tropicalis) that is missing in Cit2p. No mitochondrial localization nor a cleavage‐site is predicted for Cit2p consistent with its peroxisomal location.

189 Appendix 2

190 Color figures and tables

Figure from Chapter 8 (II)

191 Appendix 2

Chapter 8, Figure 4. Evolution of the peroxisomal proteome. Biochemical pathways reconstructed according to KEGG and annotations of peroxisomal proteins. For details on the reconstruction of ancestral estates see supplemental material. Color code: yellow, eukaryotic origin; green, alpha‐proteobacterial origin; red, actinomycetales origin; blue, cyanobacterial origin; white, origin unresolved.

192 Color figures and tables

193 Acknowledgements

194 Acknowledgments

Acknowledgements

Although I am one of those who consider one should enjoy the traveling and not the destination, I cannot avoid feeling that the completion of my PhD constitutes an important hallmark, not only of my scientific career but also of my life. The chapters of this book do not only contain facts and hypotheses, for me each paragraph also tells a story between the lines. Stories of important events that have been contemporary with this four‐years travel towards the PhD. Looking back I realize I did not walk through this path alone, many persons helped me along the way. Here I would like to express my gratitude to all of them.

First of all I wish to thank Martijn for giving me the opportunity to work with him. Not only has he acted as a starter of my career in bioinformatics but also as a permanent catalyst, being always a very supportive and inspiring supervisor. I have enjoyed so much working with him that I would certainly be a happy post‐doc in his group if he would consider moving to sunnier latitudes ;‐). I really hope the distance will just be relative and, for the years to come, we will still be solving challenging puzzles together.

The “comics” (computational genomics) group was a very good environment for doing research. Instead of listing all their names, I have decided to dedicate to them the last tree made for this Thesis (the comics tree of life. figure 1). I will only mention here those who are direct contributors of some of the chapters: Berend, besides working together with me on the peroxisome evolution, he has always been closely following my research and providing important insights; The two Master students that I have supervised during this time, Daphne and Frank, have also contributed to fundamental parts of my research and working with them has been a great pleasure.

Researchers from other groups have provided valuable input to this thesis, either by directly collaborating or through discussions which have inspired my work. Jan Smeitink, Cristina Ugalde, Leo Nijtmans and other Complex I group members, for willing to test our multiple predictions; Erik Schon, for providing us with a list of mitochondrial proteins before large‐scale subcellular proteomics came to existence; Henk Tabak, for helping us elucidating the origin of the peroxisome; Johannes Hackstein, for his interesting conversations and contagious enthusiasm. Charles Kurland for critically reading some of my manuscripts and for forgiving me for the horizontal transfer to my thesis of the title of one of his papers. Finally, Thorsten Friedrich, William Martin, Victor Emelyanov, John F. Allen, Alfonso Valencia and David Liberles are prominent scientist in my field which I have been fortunate to meet personally on different circumstances, with them all I have had interesting conversations which have fuelled my interest for what I was doing.

195 Acknowledgements

Figure 1. comics tree of life. Extant (and extinct) species that belong (or belonged to) the comics group. Names were downloaded from the comics group website and edited manually. Tree was reconstructed using mL (minimal Likelihood) method because Neighbor Joining was clustering all species from Utrecht to the exclusion of those from Nijmegen. For once, bootstrap values are not shown. M. huynen is used as an out‐group, comics group is rooted at him. Other members are grouped according to the niche (office) they inhabit, except student species which clustered together probably due to their high‐ GC ([post]Graduate courses) content. The clustering of the hypertechnophile bacterium B. dutilh and B. snel is probably caused by a long‐leg attraction artifact.

My department, the CMBI, is full of bright and nice people from whom you can ask help at any time in any situation. I am thankful to all of them (you can see all their names here: www.cmbi.ru.nl), and especially those with whom I have interacted most: Our secretaries Esther and Barbara; Our system managers Wim, Wilmar and Maarten; My fellow PhD candidates Marc, Sander, Elmar, Robert, Jos, Simon, Michiel…; The post‐docs Manu, Chris, Frank... Profesors, students, visiting scientists and other staff Martin, Karin, Celia, Roland, Rolando… And especially Gert, director of CMBI, who always found the time to personally care about how I was doing in Nijmegen, not only regarding research but also life in general.

I want to thank NWO for providing the money for my salary. I want to thank as well The Netherlands and Dutch people in general for making it so easy for me to come, live and work here, for having an open scientific system which easily recognizes foreign diplomas and which dignifies the scientific career by providing PhD candidates with proper working contracts. I do not want to forget all developers of freely‐available/open‐source software, operative systems and programming languages.

196 Acknowledgments

So far the acknowledgements to those who have been (in)directly involved in this thesis. In the following lines I wish to thank friends and family, for they have been fundamental for me during these four years.

Nola, you have been a perfect colleague and a very good friend outside the office, thank you for that; René, thank you for sharing with us your friendship and endless enthusiasm. Manu, you have been my best friend here, I let you know in our own words: gniiiiiiiiiiiiaaaaaaaaauurrrr. Other work colleagues have also become good friends of mine: Tim, Marc, Wilco, Wim. Cuca, Cristina, Cristina, Daniel, Pili, Valeria, Arjan, Aurelie and all my friends in Nijmegen. These four years would have not been so nice without you, thank you all. All my housemates at Museumkamstraat, especially the “kitchen family”: Gerben, Marcus and Natalia, with you I have also shared great experiences. Verder wil ik ook alle genoten van de Volleyball ploeg ’t Orakel bedanken voor alle de mooie spelen. You are all invited to Valencia and I hope we see us soon and often.

Cuando hice la maleta me di cuenta que las cosas más importantes nunca podrían caber en ella, mi familia y amigos quedaban atrás. Durante estos cuatro años su constante apoyo y cariño han sido fundamentales para mí, gracias a todos. A los amigotes, por ser ese grumo imposible de disolver que espero que siempre seamos. A las amigotas, que no se quedan atrás. A tod@s, pero especialmente a los que vinieron a verme en alguna ocasión: Javi, Hugo, Miguel Ángel, Isabel, José María, Maite, Lucía, Antonio. A Elena, que no vino físicamente pero sus cartas me visitaron regularmente. A mis amigos y compañeros de Facultad, de Bioquímica, de Alemania, del colegio, a todos los que siempre confiaron en mi: Tani, Puri, Gema, Carmen, Espe, Sari, Nelo, Laura…; A los compañeros de Joves, eurodoc y Precarios; A mis amigos y a mi familia de Córdoba, en especial a Wilson y Concha, por hacerme sentir uno más entre ellos.

A mis padres, por su apoyo incondicional siempre. A Marta, por reír y llorar conmigo. A mis hermanos Alicia y Daniel, por estar siempre ahí, cuando dejamos de jugar para tomar decisiones importantes en la vida. A mí querida Yaya que ha seguido con enorme interés toda mi tesis, por estar tan orgullosa de sus nietos. A todos mis tíos y tías, primos y primas.

A los que solo viven en nuestros corazones y nunca olvido.

Toni, Nijmegen August 2005

197 Curriculum Vitae

198 Curriculum Vitae

Curriculum Vitae

Toni Gabaldón was born in Valencia (Spain) the eighth of November of 1973. He received his basic and intermediate education at the public school “Jaime Balmes” and the public high‐school “San Vicente Ferrer”, respectively. From 1991 he studied biology and biochemistry at the Johannes Gutemberg University in Mainz (Germany) and at the University of Valencia, where he graduated with an honors degree in 1996. From 1997 to 2000 he was involved in teaching assistance and research at the department of Biochemistry and Molecular Biology of the University of Valencia and at the Institute National des Sciences Apliquees in Toulouse (France). During this period he obtained an Advanced Studies Degree (DEA) in Biochemistry and Molecular Biology. From 2001 he has been working as a PhD candidate in the computational genomics group of the Center for Molecular and Biomolecular Informatics (CMBI) and the Nijmegen Center for Molecular Life Sciences (NCMLS), both at the Radboud University of Nijmegen (The Netherlands). He has now been awarded a two years EMBO post‐doctoral fellowship to work at the Bioinformatics department of the Principe Felipe Research Center in Valencia.

199 List of Publications

200 List of Publications

List of Publications

ƒ Gabaldón, T. & Huynen, M. A. (2003) Reconstruction of the protomitochondrial metabolism. Science. 2003 Aug 1;301(5633):609 ƒ Gabaldón, T. & Huynen, M. A. (2004) Prediction of protein function and pathways in the genome era. Cellular and Molecular Life Sciences. 2004 Apr 61;(7‐8):930‐44 ƒ Gabaldón T. & Huynen M. A. (2004) Shaping the mitochondrial proteome. Biochimica et Biophysica Acta – Bioenergetics. Vol 1659/2‐3 pp 212‐220 ƒ Huynen, M.A., Spronk C.A.E.M., Gabaldón, T. and Snel B (2005) Combining genomes, Y2H and 3D data indicates that BolA is a reductase interacting with a glutaredoxin. . FEBS‐Letters 31 January 2005 (Vol. 579, Issue 3, Pages 591‐596) ƒ Huynen M.A. Snel B and Gabaldón T. (2005) Reliable and specific protein function prediction by combining homology with genomic(s) context. In Discovering of Biomolecular mechanisms with theoretical data analyses. Landes Bioscience Ed. Frank Eisenhaber (ISBN TBA134) ƒ Boxma B, Graaf RM, van der Staay GWM, vanAlen TA, Ricard G, Gabaldón T, van Hoek AHAM, Moon‐van der Staay SY, Koopman WJH, van Hellemond JJ, Tielens AGM, Friedrich T, Veenhuis M , Huynen MA, Hackstein JHP (2005) An anaerobic mitochondrion that produces hydrogen. Nature. (2005) Mar 3;434(7029):74‐9. ƒ Huynen M.A. , Gabaldón T and Snel B. (2005) Variation and evolution of biomolecular systems: searching for functional relevance. FEBS‐ Letters 2005. 579(8):1839‐1845 ƒ Gabaldón, T, Rainey D and Huynen, M. A. (2005) Tracing the evolution of a large protein complex in the eukaryotes, NADH:ubiquinone oxidoreductase (Complex I) . Journal of Molecular Biology. 13;348(4):857‐ 70 ƒ Gabaldón T. & Huynen M.A. (2005) Gene loss following mitochondrial endosymbiosis and its applications for function prediction in the eukaryotes. Bioinformatics (In press). ƒ Gabaldón T, Snel B., vanZimmeren F., Hemrika W., Tabak H. and Huynen M.A.. The evolutionary origin of the peroxisome (Submitted). ƒ Gabaldón T. & Huynen M.A. (2005). Evolution of the mitochondrial metabolism Bioinformatics (In preparation).

201