Genòmica comparada a l’origen dels metazous

Comparative genomics at the origins of Metazoa

Alexandre De Mendoza Soler

ADVERTIMENT. La consulta d’aquesta tesi queda condicionada a l’acceptació de les següents condicions d'ús: La difusió d’aquesta tesi per mitjà del servei TDX (www.tdx.cat) i a través del Dipòsit Digital de la UB (diposit.ub.edu) ha estat autoritzada pels titulars dels drets de propietat intel·lectual únicament per a usos privats emmarcats en activitats d’investigació i docència. No s’autoritza la seva reproducció amb finalitats de lucre ni la seva difusió i posada a disposició des d’un lloc aliè al servei TDX ni al Dipòsit Digital de la UB. No s’autoritza la presentació del seu contingut en una finestra o marc aliè a TDX o al Dipòsit Digital de la UB (framing). Aquesta reserva de drets afecta tant al resum de presentació de la tesi com als seus continguts. En la utilització o cita de parts de la tesi és obligat indicar el nom de la persona autora.

ADVERTENCIA. La consulta de esta tesis queda condicionada a la aceptación de las siguientes condiciones de uso: La difusión de esta tesis por medio del servicio TDR (www.tdx.cat) y a través del Repositorio Digital de la UB (diposit.ub.edu) ha sido autorizada por los titulares de los derechos de propiedad intelectual únicamente para usos privados enmarcados en actividades de investigación y docencia. No se autoriza su reproducción con finalidades de lucro ni su difusión y puesta a disposición desde un sitio ajeno al servicio TDR o al Repositorio Digital de la UB. No se autoriza la presentación de su contenido en una ventana o marco ajeno a TDR o al Repositorio Digital de la UB (framing). Esta reserva de derechos afecta tanto al resumen de presentación de la tesis como a sus contenidos. En la utilización o cita de partes de la tesis es obligado indicar el nombre de la persona autora.

WARNING. On having consulted this thesis you’re accepting the following use conditions: Spreading this thesis by the TDX (www.tdx.cat) service and by the UB Digital Repository (diposit.ub.edu) has been authorized by the titular of the intellectual property rights only for private uses placed in investigation and teaching activities. Reproduction with lucrative aims is not authorized nor its spreading and availability from a site foreign to the TDX service or to the UB Digital Repository. Introducing its content in a window or frame foreign to the TDX service or to the UB Digital Repository is not authorized (framing). Those rights affect to the presentation summary of the thesis as well as to its contents. In the using or citation of parts of the thesis it’s obliged to indicate the name of the author. Genòmica comparada a l’origen dels metazous Comparative genomics at the origins of Metazoa

Alexandre de Mendoza Soler

D’esquerra a dreta i d’adalt a baix, fotografia de microscopi electrònic de Capsaspora owzcarzaki feta per Arnau Sebé-Pedrós, fotografia d'una medusa feta per Alexandre de Mendoza, fotografia d'un fòssil del Edicara del museu d'Historia Natural de Chicago fet per Alexandre de Mendoza, fotografia de microscopi electrònic de Creolimax fragrantissima.

Programa de Doctorat de Genètica Departament de Genètica Facultat de Biologia Universitat de Barcelona

Genòmica comparada a l’origen dels metazous

Comparative genomics at the origins of Metazoa

Memòria presentada per Alexandre de Mendoza Soler per tal d'optar al títol de Doctor per la Universitat de Barcelona

Dr. Iñaki Ruiz-Trillo Dr. Jaume Baguñà Monjo Alexandre de Mendoza Soler Director de la tesi Tutor Autor

Barcelona, Setembre de 2013 Table of Contents

INTRODUCTION ...... 5

TYPES OF MULTICELLULARITY, MULTIPLE INDEPENDENT ORIGINS AND TIMELINE ...... 5 ADAPTIVE VALUE AND STRUCTURAL CONSTRAINTS OF MULTICELLULARITY ...... 8 THE MULTICELLULAR TOOLKIT ...... 11 EARLY METAZOAN EVOLUTION: PHYLOGENY, MORPHOLOGY AND PALEONTOLOGY ...... 14 METAZOAN GENOMES AND THE DEVELOPMENTAL TOOLKIT ...... 17 THE PROTISTAN ANCESTRY OF METAZOANS: THE ...... 18 PRE-METAZOAN ORIGIN OF THE DEVELOPMENTAL TOOLKIT ...... 21

OBJECTIVES ...... 25

RESULTS ...... 27

INFORME DEL DIRECTOR DE TESIS SOBRE ELS ARTICLES PUBLICATS ...... 27 RESULTS R1 ...... 31 RESULTS R2 ...... 43 RESULTS R3 ...... 55 RESULTS R4 ...... 81 RESULTS R5 ...... 93 RESULTS R6 ...... 109

DISCUSSION ...... 127

GENOMIC EVOLUTIONARY PATTERNS IN THE PROTOZOAN/METAZOAN BORDER: GENE INNOVATION AND CO-OPTION ...... 127 MAGUKS AND THE POST-SYNAPTIC DENSITY COMPLEX ASSEMBLY ...... 130 THE EMERGENCE OF TYROSINE KINASES AND THE “WRITER-READER-ERASER” SYSTEM ...... 133 SIGNALING PATHWAYS AND BIMODAL EVOLUTION: CYTOPLASMIC CONSERVATION VERSUS EXTRACELLULAR DIVERGENCE ...... 135 TRANSCRIPTION FACTORS AND THE EMERGENCE OF GENE REGULATORY NETWORKS . 138 GENOMIC ARCHITECTURE AND EARLY METAZOAN EVOLUTION ...... 141 THE PROTISTAN PERSPECTIVE AND THE PRE-ADAPTED GENOME ...... 146

CONCLUSIONS ...... 152

BIBLIOGRAPHY ...... 155

ANNEX ...... 167

ANNEX A1 ...... 167 ANNEX A2 ...... 175

RESUMEN EN CASTELLANO ...... 185

ACKNOWLEDGEMENTS ...... 203

Figures

FIGURE 1.TREE OF EUKARYOTES ...... 7 FIGURE 2.HISTORY OF LIFE TIMELINE ...... 8 FIGURE 3.EXPERIMENTAL EVOLUTION OF MULTICELLULARITY ...... 9 FIGURE 4. UNICELLULAR-MULTICELLULAR FUNCTIONS ...... 12 FIGURE 5. METAZOAN FOSSIL RECORD AND PHYLOGENY ...... 16 FIGURE 6. HOLOZOAN PHYLOGENY ...... 20 FIGURE 7. UNICELLULAR HOLOZOAN LIFE CYCLES...... 21 FIGURE 8. DOMAIN-SHUFFLING ...... 22 FIGURE 9. ORTHOLOG EVOLUTION ...... 128 FIGURE 10. DOMAIN EVOLUTION ...... 130 FIGURE 11. NEURONAL NETWORK AND CO-EXPRESSION ...... 133 FIGURE 12. TYROSINE KINASE SYSTEM AND PHOSPHORYLATION ...... 135 FIGURE 13. GENOMIC HOURGLASS MODEL ...... 141 FIGURE 14. INTERGENIC REGIONS ANALYSIS ...... 144 FIGURE 15. MACROSYNTENY AND MICROSYNTENY ...... 146

1

2

“Se ha producido una separación entre el disfrute y el trabajo, el medio y el fin, el esfuerzo y la retribución. El hombre, eternamente atado a un pequeño fragmento particular del todo, se forma sólo como fragmento; eternamente con el ruido monótono en el oído de la rueda que él mueve, nunca desarrolla la armonía de su esencia, y, en lugar de expresar la humanidad en su naturaleza, se convierte en una mera copia de su trabajo.” Friederich Schiller, Cartas sobre la educación estética del hombre

3

4

INTRODUCTION

Major evolutionary transitions are defined by a common theme: lower-level units of organization assembled into higher-level entities (Maynard-Smith & Szathmáry 1995; Michod 2000). Hierarchically nested levels of organization are what defines biological systems, thus, the relationship between those levels is what helps us to understand their evolutionary histories. Thus, genes are parts of chromosomes, chromosomes of genomes, genomes of the cells and cells of multicellular organisms. This Matryoshka-doll pattern is evident in the study of one of those major evolutionary transitions: the transition from singled celled into multicellular animals or metazoans. The merged information from multicellular organisms, cell types, genomes and genes is necessary to explain what where the mechanisms involved. This is the topic of my Thesis, the origins of metazoan multicellularity from a comparative genomics perspective, aiming to connect genotype evidences to phenotypic evolution.

Types of multicellularity, multiple independent origins and timeline

Multicellularity is a particular organization of cellular life where different cells stick together to form a larger organism. This general definition does not really account for the diversity of ways to reach that organization. There are two main ways to achieve a multicellular lifestyle, one is clonal division and the other is aggregation (Bonner 2001). In clonal division when cells divide they remain attached, forming a complex of genetically identical units. In aggregative behaviors different cells gather and form a multicellular complex, therefore they may have or not the same genetic material. Irrespectively of the way of achieving multicellularity there are different levels of organization within multicellular organisms. There are colonial multicellular forms, where all composing cells are undifferentiated; so there is no division of labor but a set of identical cells attached to each other. In the other side of the spectrum there are complex multicellular forms. Complex multicellularity is defined by the presence of different cell types, some of which are not directly in contact with the external environment (Knoll 2011). Series of intermediate multicellular phases are also characterized; thus all multicellularities are not the same. Consequently, different multicellular lifestyles are scattered along the tree of life. Prokaryotes, including eubacteria and archaea, contain some multicellular groups: 5 some of them are aggregative forms of multicellularity (e.g. myxobacteria) and others develop by clonal division (e.g. cyanobacteria) (Fisher et al. 2013). But eukaryotes are richer in multicellular lineages than prokaryotes, and this may be due to the flexibility of the cytoskeleton of the eukaryotic cell and the bioenergetics of eukaryotes, boosted by their mitochondrial and plastid endosymbionts (Knoll 2011; Lane & Martin 2010). Using a loose definition of multicellularity, multicellularity evolved in at least 26 independent events within the eukaryotes, covering all major eukaryotic supergroups (Figure 1, Brown et al. 2012; Grosberg & Strathmann 2007). Using the more restrictive definition of complex multicellularity, there are 7 lineages with independent origins: metazoans, land plants, red algae (bangiales and florideophytes), brown algae (phaeophytes) and fungi (some lineages within ascomycota and basidiomycota) (Figure 1, Niklas & Newman 2013). Among those, only plants and animals have well characterized embryonic development, although some phaeophytes also possess some sort of embryonic stages (Bouget et al. 1998). Overall, the independent origins show that the transition to multicellularity, simple or complex, is not an event that happened once in the history of life but a rather common one, a “minor major transition” as stated in Grosberg & Strathmann, 2007. In fact different transitions to complex multicellularity have occurred in very disparate moments in the history of life (Figure 2). The first paleontological evidences of multicellular life are photosynthetic bacteria about 2500 Ma (Tomitani et al. 2006). Not much later, in 2100 Ma sediments from Gabon, putative eukaryotic fossils show colonial multicellular organization, but the interpretation of those structures (even as biotic structures) remain controversial (El Albani et al. 2010; Knoll 2011). The first clear eukaryotic lineage to be found in the fossil record are bangiomorpha red algae about 1200 Ma, already presenting cell differentiation and a simple developmental pattern (Butterfield 2000). Scarce and undetermined multicellular taxa are deposited in the pre-Ediacara (Butterfield 2009), but major eukaryotic diversity does not appear until 800 Ma. Later on other fossils occur in a more progressive way, finding the bona fide metazoans around 600 Ma, embryophytes around 400 Ma and fungi in 350 Ma (Knoll 2011). Some multicellular lineages do not have an easily interpretable fossil record, or their time of emergence is thought to be older than current evidences. Therefore molecular clock techniques are used to infer their origins. For example, macrophytic streptophyte algae are estimated to emerge around 750 Mya and multicellular phaeophytes around 30 Mya (Knoll 2011). It is interesting to note that the first traces and the predicted origins of multicellular lineages appear usually much

6

earlier than later diversification events usually found in the rocks, suggesting that complex multicellularity needs a long range of time to assemble (Knoll 2011). The recurrence of the origins of multicellularity and their appearance in disparate times stress a major fact: if it has happened so many times, once it is achieved it must have an undeniable adaptive value.

S

c

h

iz C

o a h Myxogastria p o la a CorallochytriumIcht a lliid

Breviatea s n e e m o M it a h F m v ila o b yo e telia d e o ta to s

stere iid o spo n c o a z a m o a ra ty d avosteliid a onas

a Gracilipodida F RozellFungi re a Protosteliida ic la A C ulticilia A a Protosporangiida ty pusom Fonticulap a D rch h M s N e A o ucl lid Tubuline x A Mantamonas a Discosea a onas nc onadidae a Malawimre asalia C yro ari P ollodictyom a Fornicataarab a a on P bid zo ad Tsukubamko no sea id a le PhaeophyceaeRigifilidnidae a J g Eu onas eteroloboom tes PhaeothamS H hy chizo a alpit ilip ae P ob hyce niophyceaeclad Pic Xa ia ryptop ntho C R phy aphid ceae Glaucophyta oph llales A ycea Rhodophyceaeophy ctinophryidaee Palm Ping uiochrysida Prasinophytae les ephroselmis Eustigm N atales Mamiellophyceae Synurales Pedinophyceae Chrysophyceae Chlorodendrophycea yceae e choph Ulvophyc Dictyo Chlorophyceaeeae Pelagophyceae Trebouxiophyceae ea C Bolidomonastom ha Dia les Embryophytaroph tria yta chy tes C ho e entro yp myc 1 Telonemh H ro H elid o T 2 a a osp MASTS s Foraminiferapto ron A p a Pe M Acantharia hy 2 P ta 1 Phytomyxeao Blastocysti T 3 ly OpalinataS Vampyrellidcis A a A tin M id 0 Filoretas e MAST G c a id 1 Guttulinopsis e c Chlorarachniophytro to la M s / 1 G m is Thecofilosea p P P e Bicosoecida C ia o Im h G ra a cetes a tro / 7 a a re y rr x erco lis ra n n 4 bricatea a m o o MASTT 8 / 1 xy le rid s s filo h o p e o S O o m m p na A m m m

o s M o o o o e ro de ili n n Dinoflagellata Syndiniales ic n a h a a Perkinsidae C p olpodellida a a C d d

A d C id ida Opisthokonta Labyrinthulom id a a

a e Amoebozoa Excavata Archaeplastida Rhizaria Aggregative Clonal Alveolata All species are multicellular Heterokonta Some species are multicellular incertae sedis

Figure 1. Eukaryotic Tree of Life. A consensus view on phylogenetic relationships among eukaryotic groups. Yellow and orange dots indicate presence of multicellular species in the group. In bold complex multicellular lineages. Backbone tree of Eukaryotes courtesy of Javier del Campo.

7

Eukaryote diversi!cation & !rst heterotrophic eukaryotes Bangiomorpha (!rst multicellular eukaryotes) Gabon fossils Ediacara biota Earliest stromatolites 0 Atmosphere

Great Oxidation –2 Event

–4

–6 (log scale) level Oxygen

Oceans Anoxic Sulphidic Oxic

4 3 2 1 0 Time (billions of years ago) ! Figure 3. Geochemistry and the evolution of eukaryotes. Several events in the evolution of life are depicted, including! the first evidence for stromatolites, the dating of Gabon fossils of putative multicellular organisms, the first multicellular eukaryotes and the big eukaryote diversification; together with major geochemical changes in atmosphere and oceans. Modified from Donoghue and Antcliffe 2010.

Although fossils of putative macroscopic multicellular organisms have also been described from 2100 Ma rocks in Gabon (Figure 3)(El Albani et al. 2010), it is not still clear whether these fossils record true multicellularity or colonies, eukaryotes or bacteria, or even if they are fossils or to abiotic structures (Knoll 2011). The oldest unequivocal multicellular eukaryotes appeared around 1200 Ma (Figure 3), these are the Bangiomorpha red algae, that have differentiated holdfasts and reproductive cells (Butterfield 2009; Knoll 2011). However, the vast majority of multicellular eukaryote clades appeared much later, after the oxygen increase 800Ma. This includes macrophyte green algae (Charales, Coeocharaetales, Zygnematales) 750 Ma (Becker 2012; Laurin- Lemay et al. 2012), metazoans c. 600 Ma (see below), embryophytes c. 450 Ma (Sanderson 2003), multicellular fungi c. 300 Ma (Stajich et al. 2009) and phaeophytes c. 130 Ma (Silberfeld et al. 2010)(Figure 4).

4600 Ma 3900 Ma 2100 Ma 1200 Ma 800750 Ma Ma635 575Ma 542Ma Ma450 Ma 300 Ma 130 Ma Present

Origin life Origin Earth Eukaryotic cell Embryophyta Ediacaran fauna Multicellular fungi Cambrian explosion Multicellular red algae Multicellular green algae HeterotrophicFirst evidences eukaryotes of Metazoa Multicellular brown algae Figure 2. Timeline of the history of earth and the origin of multicellular lineages. Adapted Figure 4. Timeline of the origins of multicellular eukaryotic clades. from de Mendoza et al. 2013.

10

Adaptive value and structural constraints of multicellularity

Cooperation between cells to form a multicellular entity is a dramatic change in the way an organism deals with its environment, and therefore how its fitness is affected by natural selection. In evolutionary biology there has been a long debate between adaptationist theories, that predict that the gain of any character is due to adaptive functional reasons, and its contrary position, which consider neutral evolution, constraints and spandrels to explain how characters can originate (Gould & Lewontin 1979; Nielsen 2009). As stated before, the recurrence of the emergence of multicellularity may mean that its appearance is positively selected, as it opens the door to new ecological niches and escapes from functional constraints of unicellular lifestyle (King 2004; Grosberg & Strathmann 2007). The opposite view states that the emergence of multicellularity is a non-adaptive process that emerges recurrently by neutral evolution, because of small population size and drift, which in some cases became a selective advantage (Lynch 2007; Niklas & Newman 2013). All in all, some of the multicellular groups have radiated vastly and changed earth ecosystems, therefore at some point, multicellularity has an adaptive value (Carroll et al. 2005; Butterfield 2010). Here I refer to some of the adaptive hypotheses presented so far to explain the benefits of multicellularity. In a primitive world where most cells were unicellular and phagocytosis the only predating mechanism, escaping from predation may have been an important cause for acquiring multicellularity. Under the principle of “too big to be eaten” a cluster of cells could avoid predation simply by increasing their size (King 2004). Supporting this hypothesis, the unicellular algae Chlorella vulgaris when cultured with the predatory flagellate Ochromonas vallescia, evolved multicellularity in only 10-20 generations, in order to develop immunity to predation (Boraas et al. 1998). Even

8 after removing the predator from the culture, multicellularity was retained in photosynthetic conditions. Nevertheless, size gain can be achieved by multinucleated syncytia such as slime molds or the ulvophytes, or by multiple copies of the genome in bacteria, such as Thiomargaritha namibiensis (Lane & Martin 2010). In another line of thinking, the public goods theory also predicts adaptive simple multicellularity, as clusters of cells can perform more efficient nutrient usage than single cells, by controlling the nearby environment and joining efforts. To test that, Koschwanez et al. designed an experiment where yeast strains were grown in a very low sugar concentration medium (Figure 3). Theoretically they had three paths to overcome the problem of nutrient deprivation, some of which involved gene duplication, but all 10 yeast strains independently evolved undifferentiated multicellularity as a solution to increase their efficiency into sucrose intake (Koschwanez et al. 2013). Interestingly, if those evolved multicellular strains where grown with the parental unicellular strain, they outcompeted their ancestors in the limiting medium. By using the usual properties of the cell in a cluster of cells, the ability for sucrose import was improved thanks to the spatial organization.

A

B

Figure 3. Adaptive strategies for yeast to grow in low sucrose. A) Experiment design from a wild type population (single-celled), iterative cycles of selection in low sucrose media. B) One of the 10 resulting clone lines that adopted multicellularity. In the image two populations of the same evolved clonal line where independently transformed with different fluorescent protein labels and mixed again. As colonies are from a unique color and not a mix of both, it demonstrates that the clumps do not form by aggregation but by incomplete cell division. Figures from Koschwanez et al. 2013.

9

These causes are primary reasons to evolve a simple undifferentiated multicellularity, but differentiated multicellularity has even more obvious benefits. A classic example is the structural trade-off between the cilia and cell division. The single celled cannot divide at the same time than having cilia because the Microtubule-Organizing Center can only support the basal body or the mitotic spindle, but not both at the same time. Therefore division implies lack of motility. In a multicellular context a cell can be actively dividing while others beat the cilia, escaping from the structural trade-off (Margulis 1981). Another classic example of how differentiation solves structural constraints of unicellular lifestyle comes from cyanobacteria. Cyanobacteria need low levels of oxygen to be able to fix nitrogen; thereby single celled cyanobacteria produce oxygen through photosynthesis by day and fix nitrogen by night. Multicellular cyanobacteria have invented heterocysts, a differentiated cell type with impermeable cell walls that allow nitrogen fixation by day (Bonner 2001; Tomitani et al. 2006). At the same time they have another cell type, the akinetes, which is a resistant form when conditions are not suitable. Overall, differentiation allows nutrient intake, motility and reproduction to occur at the same time, increasing efficiency and opening the gates of new ecological niches. But at the same time it has some new challenges not present in unicellular species. The differentiation of soma and germ-line is one of the basic features of most multicellular forms, highlighting the inherent conflict within cells belonging to a multicellular entity (Michod 2000). Only a few cells will transmit their genes to the descendants, while the soma cells do not. Cellular slime molds, such as Dictysotelium discoideum, are one of the classic examples to study that conflict. D. discoideum lives as bacterivorous single-celled amoeba in vegetative state, but when nutrients are scarce they aggregate and form a multicellular slug. This slug forages until it settles and develops into a fruiting body, which will release the spores from a certain height, improving the chances of the progeny to arrive to disperse (Bonner 2001). The fruiting body is composed of two cell types, the spores and the infertile stalk. The differentiation into one cell type or another is normally random. But there are deceptive strains that never participate in the stalk cells, only becoming reproductive spores. The deceptive strains in fact cannot form fruiting bodies when cultured in isolation (Bonner 2001; Grosberg & Strathmann 2007). As stated above many eukaryotes do form aggregative multicellular entities, but none of them has obligate multicellularity phase to complete their life cycles, and the structures that they form are not as complex as in the case of some clonal developers (Fisher et al. 2013). To

10 avoid deception, relatedness is key to sustain aggregative multicellularity (Michod 2007; Fisher et al. 2013). Clonal multicellularity generally overcomes the relatedness conflict because all cells in the organism share common descent by mitotic divisions; therefore they share the same genome, plus some possible mutations. Most clonal multicellular organisms also have an obligate unicellular stage in the life cycle, which acts as a bottleneck to assure genotypic homogeneity (Grosberg & Strathmann 2007). Additionally, sexual reproduction involves the fusion of two gametes, thus a unicellular stage is required to benefit from sexual reproduction. To avoid mutational load in the cells that will become the next generation, some lineages separate the germ-line from the rest of the soma and protects it from transposable elements or foreign DNA actively, as in the case of most metazoan phyla (Leclère et al. 2012). Protecting the germ-line from parasitic DNA is not the only specific need to make a multicellular entity viable; there are several other functions crucial to that goal.

The multicellular toolkit

To evolve multicellular forms there are a handful of needs that organisms must address. The set of molecular tools basic for multicellularity are: adhesion, communication, differentiation and control of cell-proliferation. Adhesion is a sine qua non of any type of multicellularity, as cells need to be attached to form a multicellular assemblage. Different eukaryotic lineages deal with that problem in different ways, as the nature of cell architectures is very different. Among species with cell wall such as plants, algae and fungi, cell adhesion is relatively easy to evolve (Abedin & King 2010; Niklas & Newman 2013). Sugars and glycoproteins already re-cover the single-celled species; therefore to keep them attached is only a matter of sharing the cell wall. Volvocales are a good example, as they include single- celled algae such as Chlamydomonas reinhardtii and multicellular colonies with cell differentiation such as Volvox cartieri and many intermediate states, forming a nice pattern of step-wise evolution towards multicellularity. In terms of gene gains, it did not involve many innovation (Prochnik et al. 2010), only the re-use of the cell-wall components into the engulfment of the colony cells (Abedin & King 2010). Interestingly, most cases of experimental evolution of multicellularity, are members of cell-walled lineages such as yeast or green algae (Boraas et al. 1998; Koschwanez 11 et al. 2013; Ratcliff et al. 2012). Other cell types that lack cell wall need specific proteins to bind to each other. Dictyostelids and metazoans belong to this kind of cells, and have invented many types of homophilic and heterophilic proteins to deal with the problem. In the case of D. discoideum, calcium dependent adhesion molecules (DdCAD1) are the ones involved in adhesion. On the other hand metazoans have a big repertoire including desmosomes, adherens and gap junctions mediating cell-cell adhesion and hemidesmosomes and focal adhesion to bind to the extracellular matrix (Abedin & King 2010). Not only cell architecture is important for the evolution of adhesion, but also the medium where the organism lives. Land-living species, for example, need more rigid structures for keeping cells together and being able to support heavy weights as well as developing specialized transport structures (Bonner 2001; Knoll 2011). Finally, interaction between cells is important for the formation of multicellular patterns and structures, therefore is not strange that many of the components that are involved in cell adhesion are also important for mediating cell-cell communication. Signaling among cells is a rather common need for eukaryotes. Regardless of the lifestyle, cells have to communicate, to mate and to response to environmental fluctuations in an organized manner (Crespi 2001). Despite the fact that unicellular species also have signaling systems, coordinating all the cell types and tissues of a

ADHESION SIGNALING CELL DIFFERENTIATION

Predator

Cilia Collar Prey

Cadherin Bacteria

Communication

UNICELLULAR CONTEXT Envrionment

Cadherin Epithelial cells MULTICELLULAR CONTEXT MULTICELLULAR Gastrula

Figure 4. The molecular functions required for multicellularity and their putative previous roles in unicellular eukaryotes. The case of cadherins is extracted from Abedin & King 2008, where cadherins seem to have a role in prey capture in choanoflagellates while they are key genes for cell adhesion in Metazoa. In the second panel is shown how signaling pathways can be used in different contexts, while in the third panel the role of transcription factors in unicellular cell cycles and multicellular patterning is represented by a blue and white nuclei. Adapted from de Mendoza et al. 2013

12

multicellular organism needs a more tight regulation. Each multicellular lineage has its own machinery, highly dependent on the structural limitations it has to overcome. Fungi or plants, for instance, need mechanisms to surpass their cell-walls, therefore they have evolved cell-cell direct cytoplasmic communication or hormones (Stajich et al. 2009; Sparks et al. 2013). Metazoans have to deal with long-range signals that have to trespass extracellular matrix as well as close cell-cell interactions (Gerhart 1999; Pires-daSilva & Sommer 2003). Dictyostelids have to release signal to the environment to start the aggregation process (Bonner 2001). The more signals a cell receives, the more complex has to be its transduction machinery to respond in different ways to various inputs. Cell differentiation implies that only a specific subset of the genome is functional and combines in order to produce a specific phenotype, different from other possible combinations of the same genome. Responsible for cell differentiation are the multiple layers of gene regulation, which link between out-side signaling and cell response as well as internal processes. Control over gene transcription, mRNA translation and protein posttranslational modifications are common to both unicellular and multicellular organisms, but the deployment of differentiation mechanisms into a developmental program requires much more complex systems. For instance, the more cell-types an organism has the richer must be its number of transcription factors (Levine & Tjian 2003). A special case of cell differentiation is the control of cell proliferation, as it regulates dividing and non-dividing cell fates. Any sort of deregulation in cell proliferation within clonal multicellular forms could derive in cancer-like diseases. Additionally, mechanisms controlling programed cell death are also crucial to development and multicellularity, preventing damages by viral infection or guiding morphogenetic processes (Grosberg & Strathmann 2007). Unicellular organisms also have mechanisms to deal with controlled cell death, individual cells self-destroy to help the community when attacked by virus or other stresses (Ramsdale 2012). To avoid genetic conflicts between individuals and invasion of parasites, multicellularity has to deal with self-recognition and immunity. Multicellularity is an almost irreversible state, but few cases of reversibility have been documented. Transmissible cancers in dog and Tasmanian devil involve deception of immune response and escape from proliferation control, becoming parasites of the original multicellular organism where those parasites come from (Murchison et al. 2012;

13

Murgia et al. 2006). Finally multicellularity also involves many other related functions, such as epigenetic memory to maintain differentiated cells, cell polarity that directs direction within a tissue/epithelia or transport of nutrients between different parts of the organism (Knoll 2011; Bonner 2001; Maynard-Smith & Szathmáry 1995; Grosberg & Strathmann 2007). All those functions linked to multicellular transitions had to evolve in the several independent origins; therefore to understand their origins they have to be tracked back to ancestors of multicellular groups. To infer how the ancestors looked like there are two main methods: looking at the fossil record and comparing extant living species.

Early Metazoan evolution: phylogeny, morphology and paleontology

To figure out the nature of the ancestors of metazoans we must first have a solid phylogenetic framework in which to polarize characters, and discern synapomorphies from lineage specific autapomorphies. From the advent of molecular phylogeny metazoans have been recovered as a monophyletic group, confirming a single origin of metazoan multicellularity. Nevertheless the relationships between early diverging metazoan phyla are still unclear (Edgecombe et al. 2011). Phylogenomics, the use of several genetic markers concatenated into a supermatrix, tells different stories depending on the taxa, the genes and the methods used (Philippe et al. 2011; Roure et al. 2013). Despite the fact that results are still inconclusive, here I will overview the up-to-date positions of the main metazoan lineages. Bilateria is the group that includes both deuterostomes and protostomes, which comprise most metazoan phyla and species diversity. Bilateral symmetry, three embryonic layers (endoderm, mesoderm and ectoderm), cephalization, centralized nervous system and a gut with a mouth and an anus characterize the Bilateria morphologically (Nielsen 2008). All phylogenomic analyses recover bilaterians as a monophyletic group and as the crown group of metazoans (except Schierwater et al., 2009 but see Philippe et al., 2011; Roure, Baurain, & Philippe, 2013). Diploblasts are the remaining non-bilaterian metazoan phyla: cnidarians, placozoans, ctenophores and sponges. The sister group to bilaterians are cnidarians, forming the group known as the Eumetazoa. Cnidarians are characterized by radial symmetry (although this view has 14 been challenged by molecular developmental data (Martindale et al. 2002)), a digestive system with only one opening, a nerve net as nervous system and two embryonic layers (ectoderm and endomesoderm) (Nielsen 2008; Putnam et al. 2007). They constitute an undisputed monophyletic group that includes Anthozoa, Medusozoa and a very divergent class, the Myxozoa (secondarily adapted to parasitism (Philippe et al. 2009; Pick et al. 2010; Hejnol et al. 2009; Kayal et al. 2013). The three remaining non-bilaterian phyla have much more contentious phylogenetic positions. Placozoa are a monospecific phylum most likely a sister group to the Eumetazoa (Pick et al. 2010; Torruella et al. 2012; Philippe et al. 2009; Nosenko et al. 2013; Srivastava et al. 2008). Its only known representative, Trichoplax adhaerens, is characterized by two layers of cells and only 5 cell types, with neither nervous system nor musculature. It has no clear symmetry and its mode of development remains unknown till now (Eitel et al. 2011). Ctenophores have been usually placed as sister group to cnidarians, forming the super-phylum known as coelenterates (Philippe et al. 2009; Pick et al. 2010). But some recent analyses have situated them instead as the earliest branching metazoan phyla (Hejnol et al. 2009; Dunn et al. 2008). The positioning of this group is quite important, because ctenophores are morphologicaly complex compared to placozoans and sponges; they have nervous system, striated musculature and a closed digestive system similar to that of cnidarians. If ctenophores were the first metazoan lineage to diverge, it would mean that sponges and placozoans are secondarily derived and simplified, or that ctenophores have reached this level of complexity as convergent evolution to the rest of metazoans. Nevertheless, a re-analysis of the multi-gene phylogenies identified this position of ctenophores as a systematic error known as long-branch attraction produced by missing data (Philippe et al. 2011; Roure et al. 2013). Finally, sponges are the most likely earliest branching metazoan lineage. The four sponge classes are not always recovered as a monophyletic group, but most recent analysis using richer taxon sampling and more complex evolutionary models tend to discard the idea of sponge paraphyly (Pick et al. 2010; Philippe et al. 2009; Nosenko et al. 2013). Sponges have a very specific feeding cell type, the choanocyte, which traps food using cilia and a collar of microvilli. Choanocyte chambers are connected to the exterior of the sponges by canals, in usually amorphous bodyplans. For some authors, the pinacoderm that recovers the choanocyte chambers is homologous to the

15 ectoderm of eumetazoans, implying that epithelia is a synapomorphy of metazoans (Adamska et al. 2011; Leys & Riesgo 2012). Sponge embryonic development is quite diverse across the classes including several larval types, but some authors defend it is basically homologous to the rest of metazoans and it has a bona fide gastrulation (Maldonado 2004). Additional supporting evidences of sponge basal position within metazoans come from the fossil record.

Figure 5. Schematic phylogenetic tree of major metazoan phyla and their origins. In thick black lines, the known fossil record. In thick grey line, the biomarker fossil record. Usually, the molecular clocks precede fossil record. In blue and yellow the known fossils at phylum and class level, where most diversification appears after the Cambrian. In green, the ediacaran genera, that expand and collapse in the late Ediacara. Figure from Erwin et al. 2011.

16

The earliest metazoan traces identified to date are C30 sterol remains, 24- isopropylcholestanes which are thought to be specific of demosponges, about 635 Ma (Love et al. 2009; Kodner et al. 2008). Spicules of sponges are found in 554 Ma sediments (Brasier et al. 1997), while a recent study identifies a putative sponge fossil in Australian rocks around 660 Ma (Maloof et al. 2010). Most of the other pre- Cambrian fossils belong to a very divergent group of metazoans known as the Ediacaran fauna. They look segmented and with bilateral symmetry, though no clear phylogenetic assignment has been accepted to date, the most representative of them is the genus Dickinsonia. Some authors think Ediacaran taxa are stem or derived cnidarian based on their movement traces (Menon et al. 2013). Even more dubious are the bilaterian embryos found in Doushantuo, nowadays reinterpreted as colonial protists (Yin et al. 2007; Huldtgren et al. 2011). Interestingly, not many species are observed in the Ediacaran fauna, while in the Cambrian most metazoan phyla emerge at once in what is known as the Cambrian explosion (Figure 5, Erwin et al. 2011). From there on metazoan fossils become more abundant and diverse, representing most nowadays living phyla. As fossil record is incomplete, molecular clock estimates are used to estimate the origin of metazoans. Most studies point towards 800 Ma, preceding the first fossils and supporting a long stem between the origin of metazoans to their modern radiation (Knoll 2011; Erwin et al. 2011).

Metazoan genomes and the developmental toolkit

With the sequence of first metazoan genomes it became clear that an important part of their protein complement was conserved among metazoans and, most of it largely conserved to other eukaryotes (Koonin et al. 2004). Developmental biology already had shown that developmental genes were profoundly conserved, not only in sequence but also in function. In a landmark study, the orthologous transcription factor Eyeless/PAX6 could be interchanged between mouse and fruit fly, separated by more than 500 Mya of evolution, without any visible phenotypic defects (Halder et al. 1995). The disentangling of the molecular circuitry that allows morphogenesis and body plan patterning in metazoans opened the emerging field of Evo-Devo, focused in understanding the deployment of developmental genes and differential cis-regulatory regulation that explain morphological change (Carroll et al. 2005). Once the genes unique to metazoans with important developmental functions were characterized, the 17 next step was to look for those same genes in the newly sequenced genomes of diploblasts, organisms with theoretically simpler body plan organizations compared to bilaterians. Again, most of the developmental toolkit was prematurely conserved, and many times the gene origins preceded the origin of the structure to which it was commonly associated (Putnam et al. 2007; Srivastava et al. 2010; Srivastava et al. 2008). For example sponge and placozoan genomes showed that most of the genes involved in post-synaptic scaffolding and neurogenesis were there before the existence of any neuron (Sakarya et al. 2007; Richards et al. 2008). Not only the developmental genes were present in non-bilaterians, but they also had important roles in their development. Oral-aboral Wnt (Wingless related) expression patterns in the cnidarian Nematostella vectensis showed very similar patterns to those in the antero-posterior axis in bilaterians (Kusserow et al. 2005). Similarly WNT and TGF-Beta are involved in the axis determination of sponge larvae (Adamska et al. 2011). Moreover, transcription factors such as Brachyury were governing similar processes, such as gastrulation and blastopore formation in Cnidaria (Scholz & Technau 2003). Adhesion structures were also homologous between diploblasts and bilaterians, such as epithelial basal lamina containing collagen-IV in homoscleromorph sponges or adherens and septate junctions in cnidaria (Leys & Riesgo 2012; Magie & Martindale 2008). Among the genes that were specific to the pan-metazoan developmental toolkit there were many genes involved in cell adhesion, cell-cell communication and a whole repertoire of transcription factors (Rokas 2008). Thus metazoans shared embryonic development and the basic molecules that govern it, a common morphogenetic logic handled by a set of genes. To further understand the origins of that gene repertoire, the focus was put on metazoan unicellular closest relatives.

The protistan ancestry of Metazoans: the Holozoa

Already in the XIX century, James Clark proposed that the closest relatives of metazoans where a concrete group of protozoans: the choanoflagellates (King 2004). The choanoflagellate feeding system, a flagellum surrounded by a collar of microvilli, was strikingly similar to the sponge chonaocytes, and ultrastructural evidences seem to support their homology (King 2004; Karpov & Leadbeater 1998). Recent molecular phylogeny clearly place chonaoflagellates as a sister group to all metazoan phyla 18

(Figure 6,(Torruella et al. 2012; Shalchian-Tabrizi et al. 2008; Carr et al. 2008; Ruiz- Trillo et al. 2008), confirming previous hypothesis and discarding some others, such as choanoflagellates being simplified sponges (Maldonado 2004). Choanoflagellates are a very diverse group of heterotrophic flagellates that live in many different environments. Moreover, there is quite a lot of phenotypic plasticity within the group, not in the cell type but in the recovering structures such as theca and lorica, as well as colonial and solitary forms (Carr et al. 2008). Colony formation is quite spread among the choanoflagellates, and recent studies in Salpingoeca rosetta life cycle have shown some interesting information on its formation process (Figure 7). S. rosetta colonies are formed when cultured with a specific bacterial prey, Algoriphagus machipongonensis. In fact, the signal is mediated by a sulfonolipid secreted by the bacteria (Alegado et al. 2012). During colony development cells divide clonally and asynchronously, but remain attached by electrodense cytoplasmic bridges (Fairclough et al. 2010; Fairclough et al. 2013). Recent genome sequencing and transcriptomic analyses have shown that different stages in S. rosetta life cycle shift their gene expression profiles, suggesting a role of septins in the incomplete cell division that promotes colony formation (Fairclough et al. 2013). Choanoflagellates, together with metazoans and fungi, formed the eukaryotic supergroup known as Opisthokonta (Figure 6, (Adl et al. 2005; Cavalier-Smith 2012). But in fact those lineages are just a tiny part of the hidden diversity that belongs to Opisthokonta. Molecular phylogenetics have placed additional protistan taxa as members of the Opisthokonta, and more specifically, to the Holozoa, the group that comprises all the lineages closely related to metazoans (Adl et al. 2005). At least two other clades have been identified and characterized apart from choanoflagellates, that is the filastereans and the ichthyosporeans. Nevertheless there are emerging views on environmental uncultured hidden diversity among the Holozoa (del Campo & Ruiz- Trillo 2013). Filastereans are the sister group to metazoans and choanoflagellates, and have only two genera described to date, Ministeria and Capsaspora (Torruella et al. 2012; Shalchian-Tabrizi et al. 2008). They are amoebas recovered by long filopodia, actin- based structures that they use to feed (Shalchian-Tabrizi et al. 2008). Ministeria vibrans is a free-living protist that feeds on bacteria, while Capsaspora owczarzaki was found in the hemolymph of a snail putatively feeding on Schistosoma mansoni cysts (Hertel et al. 2002; Paps & Ruiz-Trillo 2010; Shalchian-Tabrizi et al. 2008). C. owczarzaki life cycle has been studied in culture conditions, and interestingly it has an

19 aggregative stage, where cells stick to each other secreting an extracellular matrix (Figure 7, Sebé-Pedrós et al. in preparation). The most basal branching group of holozoans are the ichthyosporeans, a rather diverse group of parasites of fishes and marine invertebrates (Glockling et al. 2013; Paps & Ruiz-Trillo 2010). Most of them share a life cycle that involves cyst formation and hypertrophic growth. During the development of the cysts, sometimes there is palintomic cell divison or nuclear division within a coenocyte, which later on transforms into cells. When mature the colony explodes releasing the progeny of spores either flagellated or amoeboid (Figure 7, Glockling et al. 2013; Sumathi et al. 2006). Most species are osmotrophs and can be cultured in axenic conditions. The free-living osmotrophic species Corallochytrium limacisporum, once thought to be an independent lineage of holozoans (Paps et al. 2013; Sumathi et al. 2006), seem to branch as a basal ichthyosporean lineage (Guifré Torruella personal communication).

Eumetazoa Bilateria Cnidaria Metazoa Nematostella vectensis Placozoa Monosiga Trichoplax adhaerens Ctenophora Mnemiopsis leyidi Porifera Amphimedon queenslandica Filozoa Choanoflagellata Monosiga brevicollis Capsaspora Salpingoeca rosetta Holozoa Filasterea Capsaspora owczarzaki Ministeria vibrans Opisthokonta Ichthyosporea Creolimax fragrantissima Corallochytrium limacisporum Nucleariida Nuclearia simplex Holomycota Fonticula alba Sphaeroforma Fungi Obazoa Apusozoa Unikonta/Amorphea Techamonas trahens Breviates LECA Amoebozoa Bikonta

Figure 6. Phylogenetic relationships of the Opisthokonta and their allies. In italics the species with genomic information available used in this thesis, being all the unicellular holozoan taxa comprised in the UNICORN initiative or sequenced in the lab. Groups and topology from Torruella et al. 2012, Brown et al. 2013. Images of choanoflagellates from King et al. 2008. C. owczarzaki and Sphaeroforma arctica images courtesy of Arnau Sebé-Pedrós.

20

Figure 7. Life cycles of different unicellular holozoan species. Arrows indicate the directionality of the cycle transition. S. rosetta from Dayel et al. 2011, C. owczarzaki from Sebé-Pedrós et al. in prep and C. fragrantissima form Suga & Ruiz-Trillo 2013.

Pre-metazoan origin of the developmental toolkit

The genome sequence of the solitary choanoflagellate Monosiga brevicollis (King et al. 2008), and more recently, of the closely related colonial species S. rosetta (Fairclough et al. 2013), has reinforced the vision of their close relation to metazoans at a genomic level. The choanoflagellate genome possesses many important genes of the developmental toolkit, most importantly cadherins, one of the basic adhesion molecules of metazoan adherens junctions (Nichols et al. 2012). Moreover some other signaling pathways were found in the choanoflagellate genomes, such as Receptor type Tyrosine kinases and Hedgeling, a protein containing the Hedgehog signaling domain (King et al. 2008; Manning et al. 2008). Nevertheless orthology relationships to metazoan genes were not easy to trace; the homology is conserved at the level of protein domains but not at the level of whole protein domain architecture (Figure 8). For example, the protein domains that compose Notch genes were present in Monosiga brevicollis but in separate proteins as compared to metazoan orthologs (Gazave et al. 2009; King et al. 2008). Thus the process known as domain shuffling was proposed as the general evolutionary force that drove the transition from the unicellular ancestor to the metazoan last common ancestor, as well as many 21

a Ankyrin repeats N1

e NL TM N2 Mbr Signal EGF repeats N3

NOD

oa Nvec Notch

z

eta Hsap Notch M 250 amino acids b N-hh VWA EC H1 e

Mbr H2 Hint

oa

z Nvec Hh Nvec Hedgling

eta Hsap Hh M 500 amino acids

Figure 8. Domain-shuffling events involved in the origins of Notch and Hedgehog proteins comparing the proteins from M. brevicollis, N. vectensis and H. sapiens. From King et al. 2008. innovation in gene families such as transcription factors (King et al. 2008; Rokas 2008). Increasing amount of genomic data is revealing the profound impact of secondary gene loss along the tree of life (Zmasek & Godzik 2011; Wolf & Koonin 2013). This is why, in order to have a clear picture on gene content evolution, it is important to cover as many taxa as possible. Broadening taxon sampling around the origins of metazoan multicellularity was the main aim of the UNICORN initiative (Ruiz-Trillo et al. 2007). The sequencing of various protistan species closely related to metazoans and also to fungi allows a deep phylogenetic perspective on the emergence of the metazoan gene repertoire, and its relationship to the fungal gene repertoire. Pilot projects using Expressed Sequence Tags (EST) already provided some interesting insights into gene content of non-choanoflagellate holozoan lineages. C. owczarzaki revealed a MAGI gene, member of the MAGUK and involved in tight- junction formation, and fascin genes, involved in filopodia formation (Ruiz-Trillo et al. 2008). A study on M. vibrans EST also reported some other interesting genes, such as the putative presence of integrins, absent from choanoflagellates, and Tyrosine kinases (Shalchian-Tabrizi et al. 2008). Thus the farther cousins of metazoans promised interesting genomic insights into the origins of multicellularity. Moreover, the ever increasing genomic data of many previously unsampled positions of the eukaryotic tree of life help to situate the holozoan genomic landscape into a broader framework.

22

In the study here presented we focus in the origins of metazoan developmental machinery under the scope of the newly available data. We have focused in the evolution of the Holozoans, centrally in the genomic content of C. owczarzaki and the ichthyosporeans, in order to have a clearer picture on the evolutionary mechanisms that drove the transition to multicellularity. Globally we have found that co-option of preexisting machinery into developmental processes is more common than expected, we reinforce the importance of domain-shuffling and we extend the knowledge in other properties characteristic to metazoans, as interactome complexity and genomic architecture.

23

24

OBJECTIVES

The genomes of many newly sequenced taxa in key positions of metazoan and holozoan phylogeny permit a new and detailed approach to the origins of multicellularity from a comparative genomics perspective. The main theoretical framework during my thesis has been the comparison of genomic features of unicellular holozoans to the genomic content of the metazoans. The genotypic evolutionary scenario obtained allows a clearer phenotypic interpretation of that transition, thanks to the protozoan perspective to the origins of multicellularity. The concrete objectives of my thesis are: • To reconstruct the evolutionary history of the genetic toolkit involved in metazoan multicellularity and development focusing in the evolution of several gene families involved in signaling and transcriptional regulation. • To describe the genome content and genome architecture of the filasterean Capsaspora owczarzaki.

25

“The age of blastology has finished” Franz B Lang EMBO Conference of Comparative Genomics of Microbial Eukaryotes 2009

26

RESULTS

Informe del director de Tesis sobre els articles publicats

Director: Dr. Iñaki Ruiz-Trillo Els articles que conformen aquesta Tesi doctoral (6 en l’apartat de resultats i 2 en l’annex) han estat publicats o estan en vies de ser-ho en revistes d’alt impacte en els camps de biologia evolutiva, la genètica o la biologia, totes elles indexades a la base de dades bibliogràfiques ISI web of knowledge. 6 dels articles han estat firmats com a primer autor (4 de resultats i 2 de l’annex), mentre que a dos d’ells és co-autor amb funcions destacades en l’elaboració del article. L’índex d’impacte i la posició en els rànquings dins de les disciplines respectives estan indicats. Els articles han estat duts a terme en col·laboració amb altres grups internacionals i amb companys del laboratori. Així l’article R5 ha estat inclòs en la tesis doctoral del altre primer co-autor, Arnau Sebé-Pedrós, i està inclosa en aquesta tesis amb el seu consentiment.

Article R1 De Mendoza A, Suga H, Ruiz-Trillo I. 2010. Evolution of the MAGUK protein gene family in premetazoan lineages. BMC evolutionary biology 10:93.

Factor d’impacte (2010): 3.702 Posició dins l’àrea (2010): Evolutionary biology 15/45 (Q2) Genetics and Heredity 49/156 (Q2)

El doctorand va participar activament en la concepció del projecte, la discussió i la escriptura del article. Tots els anàlisis filogenètics i de composició de dominis proteics van ser duts a terme pel doctorand. L’anàlisi de la família Guanilat cinasa es va fer amb l’ajuda del segon co-autor, Hiroshi Suga.

Article R2

27

Suga H, Dacre M, de Mendoza A, Shalchian-Tabrizi K, Manning G, Ruiz-Trillo I. 2012. Genomic Survey of Premetazoans Shows Deep Conservation of Cytoplasmic Tyrosine Kinases and Multiple Radiations of Receptor Tyrosine Kinases. Science Signaling 5:ra35–ra35.

Factor d’impacte (2012): 7.648 Posició dins l’àrea (2012): Biochemistry & Molecular Biology 29/290 (Q1) Cell Biology 28/184 (Q1)

El doctorand va participar activament en la obtenció i anàlisis de les dades, la discussió i la interpretació dels resultats. El projecte va ser liderat pel primer autor, Hiroshi Suga, qui va escriure el manuscrit juntament amb els dos últims autors. Les dades per PCR de Ministeria vibrans, anàlisi filogenètics complementaris i d’arquitectura gènica de les tirosina cinases de M. vibrans i C. owczarzaki van ser duts a terme pel doctorand amb l’ajuda dels altres co-autors.

Article R3 de Mendoza A, Sebé-Pedrós A, Ruiz-Trillo I. The evolution of the GPCR signalling system in eukaryotes: modularity, conservation and the transition to metazoan multicellularity.

En preparació.

El doctorand va participar activament en la concepció del projecte, la major part dels anàlisis de dades, la discussió i la escriptura del article.

Article R4 Suga H, Chen Z, de Mendoza A, Sebé-Pedrós A, Brown MW, Kramer E, Carr M, Kerner P, Vervoort M, Sánchez-Pons N, Torruella G, Derelle R, Manning G, Lang FB, Russ C, Haas BJ, Roger AJ, Nusbaum C, Ruiz-Trillo I. 2013. The Capsaspora genome reveals a complex unicellular prehistory of animals. Nature Communications 4:1–9.

Factor d’impacte (2012 (2013 no disponible)): 10.015 Posició dins l’àrea (2012): Multidisciplinary sciences 3/56 (Q1)

28

El doctorand va participar activament en la concepció del projecte, l’anàlisi de dades, la discussió i la interpretació de les dades. També va contribuir a discutir el text principal i va escriure parts del material suplementari. Al ésser un anàlisis genòmic molts autors van participar-hi, però el doctorand va contribuir significativament en l’anàlisi quantitatiu de dominis, l’anàlisi de regions intergèniques, l’anàlisi de les famílies gèniques involucrades en senyalització, cicle cel·lular, formació de cili i factors de transcripció.

Article R5 Sebé-Pedrós A, de Mendoza A, Lang BF, Degnan BM, Ruiz-Trillo I. 2011. Unexpected Repertoire of Metazoan Transcription Factors in the Unicellular Holozoan Capsaspora owczarzaki. Molecular biology and evolution 28:1241–1254.

Factor d’impacte (2011): 5.510 Posició dins l’àrea (2011): Biochemistry & Molecular Biology 49/286 (Q1) Evolutionary Biolgy 7/45 (Q1) Genetics & Heredity 20/156 (Q1)

El doctorand va participar activament en la concepció del projecte, del que n’és primer co-autor. Juntament amb l’altre primer co-autor van realitzar tots els anàlisis inclosos, va participar en la discussió i en la escriptura del article.

Article R6 de Mendoza A, Sebé-Pedrós A, Sebastijan Sestaǩ M, Marija Matejcič , Torruella G, Ruiz-Trillo I. Transcription factor evolution in eukaryotes and the assembly of the regulatory toolkit in multicellular lineages.

En segona revisió a Proceedings of the National Academy of Sciences USA.

El doctorand va participar activament en la concepció del projecte i en la majoria dels anàlisis, del que n’és primer co-autor. Juntament amb l’altre primer co-autor van dur a terme els anàlisis genòmics de contingut de factors de transcripció a genomes eucariotes i la descripció de llurs patrons evolutius. Va participar en la interpretació

29 dels resultats d’anàlisis filostratigràfic i dades d’expressió gènica a organismes model. Tanmateix va participar activament en la escriptura del article. Annex: Article A1 Mendoza A De, Ruiz-Trillo I. 2011. The mysterious Evolutionary Origin for the GNE gene and the root of Bilateria. Molecular biology and evolution 28:2987– 2991.

Factor d’impacte (2011): 5.510 Posició dins l’àrea (2011): Biochemistry & Molecular Biology 49/286 (Q1) Evolutionary Biolgy 7/45 (Q1) Genetics & Heredity 20/156 (Q1)

El doctorand va participar activament en la concepció del projecte, va realitzar tots els anàlisis i va escriure l’article.

Article A2 Martín-Durán JM, de Mendoza A, Sebé-Pedrós A, Ruiz-Trillo I, Hejnol A. 2013. A broad genomic survey reveals multiple origins and frequent losses in the evolution of respiratory hemerythrins and hemocyanins. Genome biology and evolution 5:1435–1442.

Factor d’impacte (2012 (2013 no disponible)): 4.759 Posició dins l’àrea (2012): Evolutionary biology 10/47 (Q1) Genetics & Heredity 27/161 (Q1)

El doctorand va participar activament en la concepció del projecte, del que n’és primer co-autor. Juntament amb els altres primers co-autors, va realitzar tots els anàlisis inclosos, l’anàlisi de les famílies gèniques hemeritrina i hemocianina pròpia d’artròpodes. També va participar en la discussió i escriptura del article.

Signat;

30

Results R1

Resum de l’article R1: Evolució de la família gènica MAGUK als llinatges anteriors als metazous

La comunicació intercel·lular és un procés clau en els organismes multicel·lulars. En els animals, les proteïnes de la família de les guanilat cinases associades a la membrana (MAGUK) estan involucrades en l’assemblatge d'unions cel·lulars i estructures de senyalització. Les MAGUK es creien exclusives dels metazous, no obstant, un membre de les MAGUK va ser identificat a Capsaspora owczarzaki, un organisme unicel·lular proper als metazous. Una recerca més àmplia d'aquesta família de gens ens mostra que les proteïnes MAGUK són presents no només en metazous. Quatre tipus diferents MAGUKs es troben en els genomes dels holozous unicel·lulars Monosiga brevicollis i Capsaspora owczarzaki: DLG , MPP, CACNB i MAGI . D'altra banda, M. brevicollis ha patit una diversificació única del seu llinatge. El repertori de MAGUKs a Placozous i la resta d’eumetazous és molt similar i més complex que els seus parents unicel·lulars, mentre que les esponges tenen un repertori MAGUK més simple. Finalment, els vertebrats han estat objecte de diverses duplicacions independents i presenten dues famílies exclusives MAGUK. La diversificació de les proteïnes MAGUK es va produir abans del origen de la multicel·lularitat, on va ser reclutades per a noves funcions i es van tornar a diversificar per duplicació i adquisició de nous dominis proteics, donant lloc a nous membres que van adoptar funcions implicades en adhesió cel·lular.

31

32 de Mendoza et al. BMC Evolutionary Biology 2010, 10:93 http://www.biomedcentral.com/1471-2148/10/93

RESEARCHARTICLE Open Access Evolution of the MAGUK protein gene family in premetazoan lineages Alex de Mendoza1, Hiroshi Suga1, Iñaki Ruiz-Trillo1,2*

Abstract Background: Cell-to-cell communication is a key process in multicellular organisms. In multicellular animals, scaffolding proteins belonging to the family of membrane-associated guanylate kinases (MAGUK) are involved in the regulation and formation of cell junctions. These MAGUK proteins were believed to be exclusive to Metazoa. However, a MAGUK gene was recently identified in an EST survey of Capsaspora owczarzaki, an unicellular organism that branches off near the metazoan clade. To further investigate the evolutionary history of MAGUK, we have undertook a broader search for this gene family using available genomic sequences of different taxa. Results: Our survey and phylogenetic analyses show that MAGUK proteins are present not only in Metazoa, but also in the choanoflagellate Monosiga brevicollis and in the protist Capsaspora owczarzaki. However, MAGUKs are absent from fungi, amoebozoans or any other eukaryote. The repertoire of MAGUKs in Placozoa and eumetazoan taxa (Cnidaria + Bilateria) is quite similar, except for one class that is missing in Trichoplax, while Porifera have a simpler MAGUK repertoire. However, Vertebrata have undergone several independent duplications and exhibit two exclusive MAGUK classes. Three different MAGUK types are found in both M. brevicollis and C. owczarzaki: DLG, MPP and MAGI. Furthermore, M. brevicollis has suffered a lineage-specific diversification. Conclusions: The diversification of the MAGUK protein gene family occurred, most probably, prior to the divergence between Metazoa+choanoflagellates and the Capsaspora+Ministeria clade. A MAGI-like, a DLG-like, and a MPP-like ancestral genes were already present in the unicellular ancestor of Metazoa, and new gene members have been incorporated through metazoan evolution within two major periods, one before the sponge- eumetazoan split and another within the vertebrate lineage. Moreover, choanoflagellates have suffered an independent MAGUK diversification. This study highlights the importance of generating enough genome data from the broadest possible taxonomic sampling, in order to fully understand the evolutionary history of major protein gene families.

Background complexes at cell or synaptic junctions (for a review see The emergence of multicellular animals from their pro- [2]). The MAGUKs have a wide variety of biological tist ancestors brought evolutionary novelties together roles, such as regulating cell polarity [3], connecting with some significant genetic challenges. For example, it transmembrane proteins (or actin filaments) with the is believed that the genes involved in cell-cell communi- cytoskeleton in tight junctions [4-6], and regulating cation, cell adhesion and cell differentiation probably synapse formation and plasticity [7-9]. Therefore, arose before or concomitantly with the origins of multi- MAGUKs are of critical importance to the development cellularity [1]. One of the protein families involved in of multicellular animals. cell-to-cell communication in Metazoa is the family of The MAGUK family have been divided into different scaffolding proteins known as membrane-associated classes or groups, according to phylogenetic position and guanylate kinases (MAGUKs), which organize protein protein domain architecture (see for example [2] and [10], and Figure 1 for our own MAGUK classification). * Correspondence: [email protected] These MAGUKs classes are known as calcium/calmodu- 1Departament de Genètica & Institut de Recerca en Biodiversitat (Irbio), lin-dependent proteins kinase (CASK), palmitoylated Universitat de Barcelona, Barcelona, Spain

© 2010 de Mendoza et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. de Mendoza et al. BMC Evolutionary Biology 2010, 10:93 Page 2 of 10 http://www.biomedcentral.com/1471-2148/10/93

Figure 1 Domain architectures of the different MAGUK classes. Domain architectures of the different MAGUK classes. Only the canonical form is shown for each class. membrane protein (MPP), zona occludens (ZO), caspase MAGI class, on the other hand, have two WW (con- recruitment domain family (CARMA), Disc Large Homo- served Trp residue) domains instead of the SH3 domain. log (DLG), Calcium channel b subunit (CACNB), and These WW domains are situated downstream of the membrane-associated guanylate kinase with an inverted GUK domain (unlike all other MAGUKs). All of these repeat (MAGI). All classes contain the following: one or modular motifs in MAGUK mediate protein-protein several PDZ domains (except CACNB), a catalytically interactions. inactive guanylate kinase (GUK) domain with homology The MAGUK protein gene family had been consid- to yeast guanylate kinase and a Src Homology 3 (SH3) ered to be exclusive to Metazoa [10,11] and, hence, a domain (except for MAGI) (Figure 1). Members of the key gene family for determining metazoan origins. de Mendoza et al. BMC Evolutionary Biology 2010, 10:93 Page 3 of 10 http://www.biomedcentral.com/1471-2148/10/93

However, a recent EST survey showed that an homolog DLG5, DLG1-4, MPP1, MPP2-7, MPP5, and CASK. of MAGI is present in the protist C. owczarzaki [12], Those MAGUK classes with their corresponding protein which seems to be the sister-group to the Choanoflagel- domain architecture are shown in Figure 1. Although the lata+Metazoa clade. This finding led to new questions, protein domain architectures of MAGUKs are well con- such as whether other MAGUKs were already present served among the taxa within each class, some proteins in the ancestor of Metazoa, the time of divergence of have lost some of their domains, or their sequences are this gene family, and whether choanoflagellates also had highly divergent. Figure 1 shows the canonical protein MAGUK homologs. To answer these questions we have domain architecture; additional details and particularities undertaken a taxon-wide search of the MAGUK family are shown in Additional file 1. in eukaryotes. Our search included the complete gen- ome sequence of the choanoflagellate M. brevicollis and Phylogenetic analyses of the Guanylate Kinase domain the genome trace sequence data of C. owczarzaki, from Broad phylogenetic sampling of the guanylate kinase which we have completed the full gene annotation by (GUK) domain was performed in order to check the RACE PCR. Our data reveals that the MAGUK protein monophyly of MAGUK within the larger GUK super- gene family already diverged in premetazoan lineages. family. Although MAGI and CACNB homologs are con- sidered to be members of MAGUK, their GUK domains Results are very divergent and the alignment quality decreases Types of MAGUKs considerably when both of them are included. Thus, Previous genomic comparisons have used different phylogenetic trees have been inferred either with MAGI names to classify the distinct MAGUK classes [2,10]. Our or with CACNB representatives. The Maximum Likeli- data reveals that there are at least ten different types of hood (ML) phylogenetic tree that includes CACNB MAGUKs. In the interest of clarity we have classified the homologs is shown in Figure 2 (see Additional file 2 for MAGUK into 10 classes: MAGI, CACNB, ZO, CARMA, the complete tree), while the analysis including MAGI is

Figure 2 Unrooted phylogenetic tree of GUK domain sequences. The topology and branch lengths were obtained by maximum likelihood analysis performed in raxml. The schematic triangles length has been calculated to represent the average branch length of the tree. Statistical support obtained by 500-bootstrap raxml replicates, and bayesian posterior probability is shown for the main clades. The tree with all the taxa is shown in the Additional file 2. de Mendoza et al. BMC Evolutionary Biology 2010, 10:93 Page 4 of 10 http://www.biomedcentral.com/1471-2148/10/93

shown in Additional file 3. Statistical support for this plus ZO and CARMA. The other, which we name as last tree is very low, as it has a truncated GUK domain the “MPP super class”, comprises all MPPs and CASK. that leaves very few amino acid positions left for phylo- The different MAGUK classes are supported in our phy- genetic analysis. In any case, both topologies suggest logenetic tree. Thus, ZO, CARMA, DLG5, DLG1-4, that the “core MAGUKs” group (which comprises all MPP5, MPP1, and CASK all appear as monophyletic MAGUK classes except for the MAGI and the CACNB groups, with varying bootstrap support. MPP 2-7 is the classes) is monophyletic (75% ML bootstrap support). only group that appears as a paraphyletic group. The trees also suggest that either the CACNB or the Within the DLG super class, ZO and CARMA group MAGI group is the sister-group of the “core MAGUKs”. together and they branch as a sister-group to DLG5. An Which of those classes, MAGI or CACNB, is more clo- A. queenslandica protein with a domain architecture sely related to the “core MAGUK” group remains elu- that is typical of DLG5 (it has the CARD domain, see sive. In fact, whether any of them should really be Additional file 1) appears as the sister-group to the considered MAGUKs is unclear even though MAGI and DLG5+ZO+CARMA clade. The DLGs of both C. owc- CACNB share the PDZ domain and the SH3 domain zarzaki and M. brevicollis group together as the sister- with the rest of MAGUKs respectively (see Figure 1). group to the whole DLG super class. Within the MPP Their domain architecture may as well be a product of super class, MPP1 groups within CASK, MPP5 forms a convergence. clear clade, while the remaining MPP genes do not appear to be monophyletic. Capsaspora-MPP branches Survey of MAGUKs and phylogenetic analyses as the sister-group of the entire MPP super class, Our survey and protein domain analysis of MAGUKs whereas Monosiga-MPP appears within the metazoan show that they are only present in Metazoa, choanofla- MPP group. gellates and the protist C. owczarzaki. We did not find Finally, three putative MAGUK-like genes from M. any MAGUKs in any of the available fungi, amoebozoan, brevicollis branch in an intermediate position between and other eukaryotic genomes. Eumetazoan (i.e. Cni- the DLG and the MPP super classes. daria and Bilateria) taxa have at least one homolog To check the possibility than one or several lateral representative of seven of the ten main classes of gene transfer (LGT) events may have occurred from MAGUK, namely MAGI, CACNB, DLG 1-4, ZO, metazoans to choanoflagellates and C. owczarzaki, we MPP2-7, MPP5, and CASK (see Figure 3). CARMA and performed a neighbour-net analysis to see whether alter- MPP1 groups appear to be exclusive to Vertebrata. The native trees may hint to a potential LGT event. The ana- placozoan Trichoplax adhaerens has a similar MAGUK lysis clearly shows that the homologs of C. owczarzaki repertoire as Cnidaria, as it only lacks the MPP5 class and M. brevicollis do not come from LGT from metazo- and it has instead a DLG5 homolog, which is actually ans since they clearly group outside the major metazoan missing in Cnidaria. However, the poriferan Amphime- MAGUK types, as we would expect if LGT between don queenslandica, lacks homologs for MPP5, ZO, and metazoans and those protists had taken place (Addi- CASK and some of its proteins branch in unclear posi- tional file 4). Instead their homologs branch deep into tions within the tree. the root of the tree. An ML tree was inferred from the SH3 + GUK domains of the “core MAGUK” proteins. The MAGI Protein domain architecture of MAGUKs in and the CACNB classes were not included in the analy- premetazoan taxa sis. MAGI were excluded as they do not have the SH3 Several MAGUKs are present in some non-metazoan domain and their GUK domain is truncated. Similarly, taxa. Both the protist C. owczarzaki and the choanofla- although CACNB have SH3 and GUK, their sequences gellate M. brevicollis present putative MAGI, DLG and are very divergent and leave few amino acid left for phy- MPP homologs. The DLG and MPP proteins of C. logenetic analyses. We searched, among other eukar- oczarzaki and M. brevicollis cluster basal to the DLG yotes, the genome sequences of M. brevicollis, C. and MPP super classes respectively (except for the owczarzaki, the cnidarian Nematostella vectensis, the Monosiga-MPP, see Figure 3; Additional File 5 contains poriferan A. queenslandica, the placozoan T. adhaerens, the taxa data). M. brevicollis has three additional plus a few representative bilaterians. When homologs MAGUK-like proteins, which branch in intermediate from taxa with a relevant phylogenetic position, such as positions between the DLG and MPP super classes. The Hydra magnipapillata, were found, they were also protein domain organization of M. brevicollis and C. included in the phylogenetic analysis. owczarzaki MAGUKs are shown in Figure 4. The puta- The topology of the ML tree shows two statistically tive Monosiga-DLG, which branches as a sister-group to supported main clades (see Figure 3). One clade, which the Capsaspora-DLG, has the canonical DLG protein we name as the “DLG super class”, consists of all DLGs domain architecture, with an L27 domain, the three de Mendoza et al. BMC Evolutionary Biology 2010, 10:93 Page 5 of 10 http://www.biomedcentral.com/1471-2148/10/93

Figure 3 Phylogeny of SH3+GUK domain sequences. The topology and branch lengths were obtained by maximum likelihood analysis performed in raxml. The tree is rooted in the branch between the MPP and the DLG+DLG-like clades. Statistical support obtained by 100- bootstrap raxml replicates, 100-bootstrap phyml replicates, and bayesian posterior probability is shown for the main clades. The canonical domain architecture is shown for each MAGUK class.

PDZ domains and an SH3 and a GUK domain. On the has an additional C4 Zinc Finger domain (C3HC4) in other hand Capsaspora-DLG lacks the N-terminal L27 the C-terminal end. The C3HC4 domain is a ring finger domain and the first two PDZ domains (Figure 4). With that plays a key role in the ubiquitination pathway of regard to the MPP homologs, both M. brevicollis and C. Metazoa. Interestingly, the DLG1 of vertebrates, which owczarzaki present the standard MPP protein domain are also known as PSD-95 [2], also has an N-terminal architecture: two L27 domains, one PDZ, and the SH3 PEST domain between L27 and the first PDZ domain, and GUK domains. We found that the Monosiga-MPP which is also involved in polyubiquitination [13]. de Mendoza et al. BMC Evolutionary Biology 2010, 10:93 Page 6 of 10 http://www.biomedcentral.com/1471-2148/10/93

Figure 4 Protein domain architectures for the Capsaspora and Monosiga MAGUK and MAGUK-like homologs, except for MAGI. E-values as retrieved on PFAM are shown below for canonic MAGUK homologs protein domains.

The Capsaspora-MAGI has the canonical protein Discussion domain architecture of the MAGI class at the N-term- Reconstruction of MAGUK diversity in the metazoan inal end, with the PDZ, the GUK and the two WW ancestor domains. However, it lacks the PDZ domains at the C- Our survey of MAGUK proteins shows that three cano- terminal end as previously shown in [12]. Two putative nical MAGUKs are present in both the protist C. owc- MAGI homologs from M. brevicollis (Monosiga-MAGI- zarzaki and the choanoflagellate M. brevicollis (Figure like) were identified in the GUK phylogenetic analysis 5). Both organisms have homologs of the metazoan (Additional file 3). They present diverging protein DLG and MPP super classes. Additionally, C. owczarzaki domain architectures, lacking the two WW domains has a MAGI homolog as previously stated [12], whereas seen in the typical MAGI proteins. Moreover, we have M. brevicollis has two putative MAGI-like homologs, also identified three MAGUK-like proteins in M. brevi- although with divergent and unique protein domain collis with unique protein domain architectures, in architecture (Additional file 1). Since C. owczarzaki which the core SH3 and GUK domains are wrapped is most likely the sister-group of choanoflagellates around several consecutive PDZ domains both at the N- and Metazoa [12,14-16], our results suggest that and C-terminal ends (see Figure 4). the common ancestor of Metazoa, C. owczarzaki, and de Mendoza et al. BMC Evolutionary Biology 2010, 10:93 Page 7 of 10 http://www.biomedcentral.com/1471-2148/10/93

choanoflagellates already had three types of MAGUK: a Whether these MAGUK-like protein domain arrange- DLG-like, an MPP-like and a MAGI-like protein (Figure ments are also present in other choanoflagellates or in 5). A canonical MAGI was, thus, either lost or drasti- other opisthokont protists remains unclear. In any case, cally diverged in the choanoflagellate lineage. Alterna- our data shows that choanoflagellates, or at least M. bre- tively, the Capsaspora-MAGI may represent an vicollis, underwent an independent lineage-specific independent acquisition of the PDZ-GUK-WW-WW diversification of the MAGUK protein gene family. domain architecture. Additional genomic data from Functional analysis on these genes may clarify what other choanoflagellates will be needed to draw definitive roles the different genes are playing in M. brevicollis. conclusions. Although in theory one can not rule out In fact, the role of MAGUK proteins in these preme- the possibility than one or several lateral gene transfer tazoan taxa remains an open question. In the sponge A. (LGT) events may have occurred from metazoans to queenslandica both the DLG and MAGI homologs are choanoflagellates and C. owczarzaki, we favour the specifically expressed only in epithelia-like tissue [18]. hypothesis that the MAGUK protein family appeared Moreover, it has been shown that A. queenslandica has prior to the divergence between C. owczarzaki and and expresses several other components of the synaptic choanoflagellates. In fact, the neighbour-net analysis junction [18]. Interestingly, those components are shows that the homologs of C. owczarzaki and M. brevi- expressed in the flask cells and the epidermis. It is collis do not have a clear relationship with any of the therefore possible that the interaction between metazoan MAGUK types, as we would expect if LGT MAGUKs and the complex protein scaffolds found in between metazoans and those protists had taken place eumetazoan post-synaptic scaffolds or in tight-junctions, (see Additional file 4). was already present in sponge. To which components Interestingly, we have also identified three unique M. brevicollis or C. owczarzaki MAGUKs interact is MAGUK-like proteins in M. brevicollis. Two of them unknown. Answering this is beyond the scope of this present the canonical PDZ-SH3-GUK domains of manuscript. Such information, however, may be crucial MAGUK, but with several consecutive PDZ domains at to decipher the role of MAGUKs in unicellular taxa. both the N-terminal and C-terminal ends. One of those Moreover, additional data from colonial choanoflagel- M. brevicollis MAGUK proteins homolog has several lates or ichthyosporeans should yield important insights GUK duplications, and the third one has an additional into this question, since they may harbor some scaffold- protein domain: a phosphatase kinase (see Figure 4). All ing assemblages that enable the different cells in the col- these MAGUK-like proteins constitute novel domain ony to communicate. architectures that had not been found in any other organism up to now. Novel protein domain arrange- Evolutionary history of MAGUK ments that include several consecutive copies of a Our data show that although the MAGUK family most domain have already been found in M. brevicollis [17]. probably originated within choanozoans (i.e. unicellular

Figure 5 Distribution of MAGUKs in . Black circles means presence, whereas white circles means the homology assignment is not clear. The DLG and MPP columns indicate presence within that major clade (the super class) but without clear assignment to any of the inclusive classes (see text for further explanations). de Mendoza et al. BMC Evolutionary Biology 2010, 10:93 Page 8 of 10 http://www.biomedcentral.com/1471-2148/10/93

lineages that are most closely related to Metazoa), it homolog was lost in the cnidarian lineage. However, if clearly expanded within metazoans. In fact, it seems that placozoans are indeed the most basal metazoans or sis- the expansion of MAGUKs in metazoans occurred at ter-group to sponges [21,22], then the lineage leading to least in two distinctive periods in the course of evolu- sponges (or A. queenslandica) did lose their ZO, CASK tion, as observed in many other protein families [19]. and DLG5 representatives. Finally, and even though the The first expansion occurred at the very early metazoan topology of the MAGUK tree remain the same with evolution before the divergence between sponges (or whatever method used, some of the nodes of our tree placozoans) and the rest of metazoans. The second do not have high bootstrap values. Thus, some of the expansion is likely to have happened at the early evolu- implications regarding the evolutionary history of tion of vertebrates. Accordingly, both Porifera and Pla- MAGUK within metazoans may need to be revisited cozoa have several extra MAGUKs classes, and with additional data. Vertebrata has two exclusive classes, CARMA and MPP1, that were most probably generated by indepen- Conclusions dent duplications in the vertebrate lineage. In this study we have identified several MAGUK and The analysis of the protein domain organization MAGUK-like homologs in premetazoan taxa. Overall, mapped onto our phylogenetic analysis sheds some light our data show that the MAGUK protein gene family is on the evolutionary history of MAGUKs. For example, a not exclusive to Metazoa. This gene family most prob- duplication of a protein kinase domain followed by its ably diversified within the opisthokonts before the diver- recruitment in the N-terminal site of one MPP protein gence between Capsaspora and the Choanoflagellata + brought about a new type of MAGUK, the CASK class. Metazoa clade. Thus, DLG, MPP and MAGI were This event probably took place either in the common already present in the last common ancestor of the ancestor of Placozoa and Eumetazoa, or at the origin of Metazoa, choanoflagellates and Capsaspora+Ministeria Metazoa followed by the loss of this MAGUK type in (Filasterea) clade. The family further diversified within Porifera (Figure 5). Moreover, the vertebrate-specific metazoans, most probably in two major episodes (early MPP1 class seems to have derived from CASK, most metazoan and early vertebrate evolution) and up to ten probably by an additional duplication of the CASK different MAGUK types evolved with different roles. homolog in vertebrates followed by the loss of the pro- The choanoflagellate M. brevicollis has, independently tein kinase domain and the two L27 domains. Verte- of Metazoa, undergone a lineage-specific diversification brates have the largest repertoire of MAGUKs. Not only of MAGUK-like proteins that have unique domain do they have the vertebrate-specific MAGUK classes architectures. This represents a diversification indepen- such as CARMA and MPP1, but they also have several dent from the one occurred in Metazoa. Additional genes that appear to have diverged in the early vertebrate genomic data from other choanoflagellate taxa will help evolution (e.g. in the DLG1-4, MAGI and ZO classes, elucidate whether this diversification is specific to choa- vertebrates have three or four such genes). Although the noflagellates or just to the M. brevicollis species. More- sponge A. queenslandica has a MAGUK with the typical over, M. brevicollis has different and unique MAGI-like CARMA domain architecture (a CARD domain, a PDZ homologs, not found in Metazoa. Functional analyses domain, an SH3 domain and a GUK domain), it branches will be needed to better understand the roles of as the sister-group to the entire DLG5, CARMA and ZO MAGUKs in the unicellular relatives of Metazoa. group. We hypothesize this A. queenslandica gene repre- sents the ancestral form of this whole group. In fact, Methods DLG5 and ZO are already present both in the placozoan Database searching T. adhaerens and in all other eumetazoans, except for the All potential MAGUK sequences were obtained by per- cnidarian N. vectensis which does not have a DLG5 forming blast searches (blastp, tblastn and psi-blast) homolog. The class CASK is also present both in T. against the Protein, Genome, and EST databases at the adhaerens and in other eumetazoans. This indicates that NCBI (National Center for Biotechnology Information) CASK, DLG5 and ZO arose prior to the divergence and against completed or on-going genome projects between placozoans and the rest of eumetazoans. database at the JGI (Joint Genome Institute) and the It is worth mentioning that the interpretation of these Broad Institute. The amino acid sequences of the Homo results needs additional data, since the phylogenetic sapiens orthologs were used as a query. The sequences position of the basal metazoans is still under debate (see retrieved that were not annotated as MAGUK were then [20-24]). If the placozoan T. adhaerens is indeed the sis- blasted against NCBI CDD (Conserved Domain Data- ter-group of cnidarians and Bilateria [20,24], then base). Only those that retrieved a MAGUK protein CASK, DLG5, and ZO appeared in the common ances- domain architecture (a PDZ, SH3, GUK) were consid- tor of both placozoans and eumetazoans; and the DLG5 ered positives. Complete protein domain architectures de Mendoza et al. BMC Evolutionary Biology 2010, 10:93 Page 9 of 10 http://www.biomedcentral.com/1471-2148/10/93

were inferred by searching the PFAM and SMART data- Amplification of Capsaspora MAGUKs and annotation of bases with the complete sequences. Moreover, the puta- Monosiga MAGUKs tive positives were individually incorporated into a basic Capsaspora owczarzaki data were obtained from gen- alignment of annotated sequences. Only those sequences ome sequence scaffolds from NCBI and the proteins that could unambiguously be aligned were used in the were predicted both with AUGUSTUS [29] and GEN- phylogenetic analysis. SCAN [30]. Incomplete predictions were checked by In order to study the early evolution of the guanylate rapid amplification of cDNA ends (RACE). Purified kinase domain, we performed additional phylogenetic polyA+ messenger RNA and cDNA from C. owczarzaki analyses using sequences from a broad range of taxa were obtained as in [31]. The full sequence of the 5’ and including the Eubacteria. The same databases as above, 3’ ends of C. owczarzaki DLG and MPP were obtained together with the Genbank amino acid database, were by RACE, using a nested PCR and with primers searched with the HMMER3.0a2 program. In this case, designed from the original genome trace data. Primers the genes that did not show the typical MAGUK used were as follows: CaDLG-3’RACE-1: 5’ GACATG domain architecture were also included in the analyses. AAGGAGGAAAAGTTTATGG 3’; CaDLG-3’RACE-2: As the analyses were focused on the divergence between 5’ ACAAAGGTGTTTGACACTGCAC 3’; CaDLG-5’RA distinct subfamilies of the guanylate kinase family, the CE-1: 5’ GTCGTTGACCTTCAAAATTTCATC 3’; CaD genes comprising each subfamily were properly removed LG-5’RACE-2: 5’ GAATCTTTGACACGTAAATGTT in order to reduce the complexity of the tree. GG 3’; CaMPP-3’RACE-1: 5’ AAGCTCCAAGAAGT CGTCAAGG 3’; CaMPP-3’RACE-2: 5’ CAACGTT Phylogenetic analyses GGCTTCTTCTTTGAC 3’; CaMPP-5’RACE-1: 5’ GTCT Alignments of the SH3 + GUK domains were con- CGGTCAAAAAGTCCTTGAC 3’; CaMPP-5’RACE-2: 5’ structed using the Muscle [25] plug-in of the Geneious GTAAAGTCCTTGTTGGCGATCTT 3’. Sequences software (Biomatters Ltd, Auckland, New Zealand), were obtained and analyzed as in [31]. These two before they were manually inspected and edited. Only sequences have been deposited at GenBank under the those positions that were unambiguously aligned were accession numbers GQ290472-GQ290473. included in the final analysis, which resulted in a total Monosiga brevicollis data were obtained from the pre- of 272 amino acid positions respectively. The final pro- dicted protein database. They were also confirmed by tein alignments can be downloaded from the webpage the AUGUSTUS and GENSCAN predictions from gen- http://www.multicellgenome.com. ome sequence scaffolds downloaded from JGI. Maximum likelihood phylogenetic trees were esti- mated by RAxML [26] using a PROTGAMMAWAG Checking the network-like structure of MAGUK protein model of evolution and with a gamma distribution (8 gene family categories) (WAG+Γ). Statistical support was obtained In order to check whether the ML tree follows a net- from 100 bootstrap replicates using the Phyml program work-like or a tree-like structure, SplitsTree4 [32] was [27] following a LG+Γ+I model of evolution [28] with 4 run to construct a neigbor-net using the alignment that rate categories, and from 100-bootstrap replicates in includes the SH3 + GUK domains. RAxML, using the PROTGAMMAWAG model of evo- lution and with a gamma distribution (4 categories). Additional file 1: Schematic MAGUK domain organization among Bayesian trees were estimated on the MrBayes plug-in Holozoa. Domain structures of the key taxa are shown here to reveal available on Geneious software. We ran four different the differential conservation in this gene family. Additional file 2: Unrooted phylogenetic tree of the GUK domain Monte Carlo Markov chains using a WAG+Γ model of sequences, including the Calcium channel B subunit. The topology evolution with 4 rate categories. A total of 1,1 million and branch lengths were obtained by maximum likelihood analysis generations were calculated with trees sampled every performed in raxml. Statistical support was obtained by 500-bootstrap 200 generations and with a burn-in of 110,000. raxml replicates. Additional file 3: Unrooted phylogenetic tree of the GUK domain The alignment of the sole guanylate kinase domain sequences, including MAGI. The topology and branch lengths were sequences from diverse taxa, including the Eubacteria, obtained by maximum likelihood analysis performed in raxml. Statistical was generated manually as the sequences are divergent. support was obtained by 500-bootstrap raxml replicates. The phylogenetic tree was inferred by RAxML with 500 Additional file 4: A neigbor-net of MAGUKs. A neigbor-net constructed from the MAGUK alignment that includes the SH3 + GUK bootstrap replicates. We inferred two phylogenetic trees domains. Major groupings and the homologs of C. owczarzaki and M. with different repertoires of subfamilies, as the inclusion brevicollis are indicated. of both the MAGI and CACNB subfamilies in a single Additional file 5: Taxa used in the phylogenetic analysis of Figure 3. dataset results in a short alignment, that is insufficient for reliable tree inference. Both alignments can be downloaded from our webpage. de Mendoza et al. BMC Evolutionary Biology 2010, 10:93 Page 10 of 10 http://www.biomedcentral.com/1471-2148/10/93

Acknowledgements 14. Ruiz-Trillo I, Inagaki Y, Davis LA, Sperstad S, Landfald B, Roger AJ: Preliminary sequence data were obtained from the Department of Energy Joint Capsaspora owczarzaki is an independent opisthokont lineage. Curr Biol Genome Institute, the Broad Institute, and NCBI websites. Genome sequencing 2004, 14(22):R946-947. of Capsaspora owczarzaki is being undertaken at the Broad Institute under the 15. Steenkamp ET, Wright J, Baldauf SL: The protistan origins of animals and auspices of the National Human Genome Research Institute (NHGRI) and within fungi. Mol Biol Evol 2006, 23:93-106. the UNICORN initiative. We thank both the JGI and the Broad Institute for 16. Shalchian-Tabrizi K, Minge MA, Espelund M, Orr R, Ruden T, Jakobsen KS, making data publicly available. We thank Bernard Degnan for accession to the Cavalier-Smith T: Multigene phylogeny of choanozoa and the origin of Amphimedon proteome data. We also thank Andrew J. Roger, Arnau Sebé, animals. PLoS ONE 2008, 3:e2098. Jordi Paps, and Romain Derelle for helpful insights. This work was supported by 17. Abedin M, King N: The premetazoan ancestry of cadherins. Science 2008, an ICREA contract, an ERC Starting Grant (206883), and a grant (BFU2008-02839/ 319:946-948. BMC) from MICINN to IR-T. AdM’s salary was supported by a pre-graduate grant 18. Sakarya O, Armstrong KA, Adamska M, Adamski M, Wang IF, Tidor B, from MICINN, and HS’s salary by an ERC Starting Grant to IR-T and by a Marie Degnan BM, Oakley TH, Kosik KS: A post-synaptic scaffold at the origin of Curie Intra European Fellowship to HS. the animal kingdom. PLoS ONE 2007, 2:e506. 19. Miyata T, Suga H: Divergence pattern of animal gene families and Author details relationship with the Cambrian explosion. Bioessays 2001, 23:1018-1027. 1Departament de Genètica & Institut de Recerca en Biodiversitat (Irbio), 20. Srivastava M, Begovic E, Chapman J, Putnam NH, Hellsten U, Kawashima T, Universitat de Barcelona, Barcelona, Spain. 2Institució Catalana per a la Kuo A, Mitros T, Salamov A, Carpenter ML, Signorovitch AY, Moreno MA, Recerca i Estudis Avançats (ICREA); Parc Científic de Barcelona, Baldiri Reixach Kamm K, Grimwood J, Schmutz J, Shapiro H, Grigoriev IV, Buss LW, 15, 08028 Barcelona, Spain. Schierwater B, Dellaporta SL, Rokhsar DS: The Trichoplax genome and the nature of placozoans. Nature 2008, 454:955-960. Authors’ contributions 21. Schierwater B, Eitel M, Jakob W, Osigus HJ, Hadrys H, Dellaporta SL, AdM led on data collection, analysis, and data interpretation, contributed to Kolokotronis SO, Desalle R: Concatenated analysis sheds light on early writing. HS oversaw data collection, contributed to analysis and the writing. metazoan evolution and fuels a modern “urmetazoon” hypothesis. PLoS IR-T designed the experiments, oversaw the project and led on much of the Biol 2009, 7:e20. writing. All authors read and approved the manuscript. 22. Dellaporta SL, Xu A, Sagasser S, Jakob W, Moreno MA, Buss LW, Schierwater B: Mitochondrial genome of Trichoplax adhaerens supports Received: 24 August 2009 Accepted: 1 April 2010 Placozoa as the basal lower metazoan phylum. Proc Natl Acad Sci USA Published: 1 April 2010 2006, 103:8751-8756. 23. Miller DJ, Ball EE: Animal evolution: trichoplax, trees, and taxonomic References turmoil. Curr Biol 2008, 18:R1003-5. 1. Ruiz-Trillo I, Burger G, Holland PW, King N, Lang BF, Roger AJ, Gray MW: 24. Philippe H, Derelle R, Lopez P, Pick K, Borchiellini C, Boury-Esnault N, The origins of multicellularity: a multi-taxon genome initiative. Trends Vacelet J, Renard E, Houliston E, Queinnec E, Da Silva C, Wincker P, Le Genet 2007, 23:113-118. Guyader H, Leys S, Jackson DJ, Schreiber F, Erpenbeck D, Morgenstern B, 2. Funke L, Dakoji S, Bredt DS: Membrane-associated guanylate kinases Worheide G, Manuel M: Phylogenomics Revives Traditional Views on regulate adhesion and plasticity at cell junctions. Annual Review of Deep Animal Relationships. Curr Biol 2009, 19(8):706-12. Biochemistry 2005, 74:219-245. 25. Edgar RC: MUSCLE: a multiple sequence alignment method with reduced 3. Tanentzapf G, Tepass U: Interactions between the crumbs, lethal giant time and space complexity. BMC Bioinformatics 2004, 5:113. larvae and bazooka pathways in epithelial polarization. Nat Cell Biol 2003, 26. Stamatakis A: RAxML-VI-HPC: maximum likelihood-based phylogenetic 5:46-52. analyses with thousands of taxa and mixed models. Bioinformatics 2006, 4. Fanning AS, Jameson BJ, Jesaitis LA, Anderson JM: The tight junction 22:2688-2690. protein ZO-1 establishes a link between the transmembrane protein 27. Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate occludin and the actin cytoskeleton. J Biol Chem 1998, 273:29745-29753. large phylogenies by maximum likelihood. Systematic Biology 2003, 5. Wittchen ES, Haskins J, Stevenson BR: NZO-3 expression causes global 52:696-704. changes to actin cytoskeleton in Madin-Darby canine kidney cells: 28. Le SQ, Gascuel O: An improved general amino acid replacement matrix. linking a tight junction protein to Rho GTPases. Mol Biol Cell 2003, Mol Biol Evol 2008, 25:1307-1320. 14:1757-1768. 29. Stanke M, Morgenstern B: AUGUSTUS: a web server for gene prediction 6. Patrie KM, Drescher AJ, Goyal M, Wiggins RC, Margolis B: The membrane- in eukaryotes that allows user-defined constraints. Nucleic Acids Res 2005, associated guanylate kinase protein MAGI-1 binds megalin and is 33:W465-7. present in glomerular podocytes. J Am Soc Nephrol 2001, 12:667-677. 30. Burge C, Karlin S: Prediction of complete gene structures in human 7. Lahey T, Gorczyca M, Jia XX, Budnik V: The Drosophila tumor suppressor genomic DNA. J Mol Biol 1997, 268:78-94. gene dlg is required for normal synaptic bouton structure. Neuron 1994, 31. Ruiz-Trillo I, Lane CE, Archibald JM, Roger AJ: Insights into the evolutionary 13:823-835. origin and genome architecture of the unicellular opisthokonts 8. Aoki C, Miko I, Oviedo H, Mikeladze-Dvali T, Alexandre L, Sweeney N, Capsaspora owczarzaki and Sphaeroforma arctica. Journal of Eukaryotic Bredt DS: Electron microscopic immunocytochemical detection of PSD- Microbiology 2006, 53:1-6. 95, PSD-93, SAP-102, and SAP-97 at postsynaptic, presynaptic, and 32. Huson DH, Bryant D: Application of phylogenetic networks in nonsynaptic sites of adult and neonatal rat visual cortex. Synapse 2001, evolutionary studies. Mol Biol Evol 2006, 23:254-267. 40:239-257. doi:10.1186/1471-2148-10-93 9. Schnell E, Sizemore M, Karimzadegan S, Chen L, Bredt DS, Nicoll RA: Direct Cite this article as: de Mendoza et al.: Evolution of the MAGUK protein interactions between PSD-95 and stargazin control synaptic AMPA gene family in premetazoan lineages. BMC Evolutionary Biology 2010 receptor number. Proc Natl Acad Sci USA 2002, 99:13902-13907. 10:93. 10. Te Velthuis AJ, Admiraal JF, Bagowski CP: Molecular Evolution of the MAGUK Family in Metazoan Genomes. BMC Evol Biol 2007, 7:129. 11. Adell T, Gamulin V, Perovic-Ottstadt S, Wiens M, Korzhev M, Muller IM, Muller WE: Evolution of metazoan cell junction proteins: the scaffold protein MAGI and the transmembrane receptor tetraspanin in the demosponge Suberites domuncula. J Mol Evol 2004, 59:41-50. 12. Ruiz-Trillo I, Roger AJ, Burger G, Gray MW, Lang BF: A phylogenomic investigation into the origin of metazoa. Mol Biol Evol 2008, 25:664-672. 13. Xu W, Schluter OM, Steiner P, Czervionke BL, Sabatini B, Malenka RC: Molecular dissociation of the role of PSD-95 in regulating synaptic strength and LTD. Neuron 2008, 57:248-262.

Results R2

Resum de l’article R2: L’estudi genòmic als pre-metazous ens mostra una profunda conservació de les tirosina cinases citoplasmàtiques i múltiples radiacions de les tirosina cinases de tipus receptor.

Les tirosina cinases son proteïnes que juguen un paper important en la comunicació cel·lular, l'adhesió i la diferenciació als metazous. Entendre el seu origen i evolució primerenca és crucial per entendre l'origen dels metazous, tenint en compte que aquest sistema és bàsic per la multicel·lularitat animal. Encara que ja s’han descrit tirosina cinases als coanoflagel·lats, hi ha poques dades disponibles sobre la seva presencia en altres llinatges anteriors als metazous. Per esbrinar l'origen de les tirosina cinases, vam realitzar una busca exhaustiva al genoma i al transcriptoma de les dues espècies descrites de filasteris: Capsaspora owczarzaki i Ministeria vibrans. En aquest estudi presentem el repertori de 103 de tirosina cinases codificades al genoma de C. owczarzaki i 15 tirosina cinases de M. vibrans obtingudes mitjançant PCR. A través d'anàlisis filogenètics i comparant-les a nivell de dominis proteics, demostrem que el repertori bàsic de tirosina cinases citoplasmàtiques de metazous es va establir abans de la divergència dels filasteris de la resta d’holozous (Metazous i choanoflagel·lats). En canvi, les tirosina cinases de tipus receptor son diversificacions pròpies en cadascun dels llinatges. Aquesta diferència en els patrons de divergència entre les tirosina cinases citoplasmàtiques i les tirosina cinases de tipus receptor suggereix que les de tipus receptor s’utilitzaven per a la recepció de senyals ambientals en un context unicel·lular, i van ser posteriorment reclutades com una eina de comunicació intercel·lular a l'inici de la multicel·lularitat animal.

43

44

RESEARCH ARTICLE

EVOLUTION Genomic Survey of Premetazoans Shows Deep Conservation of Cytoplasmic Tyrosine Kinases and Multiple Radiations of Receptor Tyrosine Kinases

Hiroshi Suga,1,2*MichaelDacre,3 Alex de Mendoza,1,2 Kamran Shalchian-Tabrizi,4 Gerard Manning,3* Iñaki Ruiz-Trillo1,2,5*

The evolution of multicellular metazoans from a unicellular ancestor is one of the most important advances in the history of life. Protein tyrosine kinases play important roles in cell-to-cell communication, cell adhe- sion, and differentiation in metazoans; thus, elucidating their origins and early evolution is crucial for understanding the origin of metazoans. Although tyrosine kinases exist in choanoflagellates, few data are available about their existence in other premetazoan lineages. To unravel the origin of tyrosine kinases, we performed a genomic and polymerase chain reaction (PCR)–based survey of the genes that encode tyrosine kinases in the two described filasterean species, Capsaspora owczarzaki and Ministeria vibrans,

the closest relatives to the Metazoa and Choanoflagellata clades. We present 103 tyrosine kinase–encoding Downloaded from genes identified in the whole genome sequence of C. owczarzaki and 15 tyrosine kinase–encoding genes cloned by PCR from M. vibrans.Throughdetailedphylogeneticanalysis,comparisonoftheorganizations of the protein domains, and resequencing and revisionoftyrosinekinasesequencespreviouslyfoundin some whole genome sequences, we demonstrate that the basic repertoire of metazoan cytoplasmic tyrosine kinases was established before the divergence of filastereans from the Metazoa and Choanoflagellata clades. In contrast, the receptor tyrosine kinases diversified extensively in each of the filasterean, choanoflagellate,

and metazoan clades. This difference in the divergence patterns between cytoplasmic tyrosine kinases and stke.sciencemag.org receptor tyrosine kinases suggests that receptor tyrosine kinases that had been used for receiving environmental cues were subsequently recruited as a communication tool between cells at the onset of metazoan multicellularity.

INTRODUCTION the catalytic (or kinase) domain (KD). Such divergent architectures are Multicellularity evolved several times in eukaryotes (1–3). Of particular considered to have been generated by gene duplication and domain interest is the multicellular system of metazoans, composed of highly shuffling (9, 10, 12). The extracellular regions of RTKs, which are com- on May 2, 2012 specialized cells that perform coordinated interaction and communication, posed of various domains and motifs, are thought to bind to specific ligands which leads to various highly complex and mobile forms. Although it has directly, and their expression is mostly restricted to specific cell types in been hypothesized that the evolution of metazoan multicellularity was the organism (10). Moreover, KDs show an increased versatility (that is, driven by the emergence of genes whose products were involved in cell they are frequently combined with other domains), especially in the Metazoa adhesion, cell differentiation, and cell-to-cell communication, it is now (13–15). Thus, the diversity of TKs, which consist of approximately 30 evident that several of these key genetic modifications occurred well families [which correspond to subfamilies in some of our previous pub- before the origin of animals (2, 4–8). lications (16–19)] with different protein domain organizations, may re- Many multicellular-specific functions, including cell-to-cell communi- flect the complexity of the phosphotyrosine-based signaling system in cation and control of cell proliferation and differentiation, are served by metazoans. protein tyrosine kinases (TKs or PTKs) (9–11). TKs are divided into two TKs are members of the eukaryotic protein kinase (ePK) superfamily types: receptor TKs (RTKs) and nonreceptor or cytoplasmic TKs (CTKs). and are likely to have evolved from one of the ancestral ePK groups RTKs mostly receive their specific ligands through their extracellular do- (20–22). TKs phosphorylate substrates only on tyrosine residues, where- mains and initiate signal transduction cascades that are mediated by phos- as other ePKs phosphorylate mostly on serine and threonine residues and phorylated tyrosine residues, whereas CTKs act within cells and transmit are thus classified as serine-threonine kinases (STKs) (21). Although the phosphotyrosine signals initiated by receptors (9, 10). One of the most some STKs can also phosphorylate tyrosine residues (23), the TK group remarkable features of TKs is their high degree of structural diversity. is discriminated from other groups of ePKs by their overall sequence sim- Most TKs are made up of multiple protein domains and motifs other than ilarity and by the presence of a characteristic catalytic loop motif (24). The repertoire of CTK and RTK families is highly conserved across all 1Institut de Biologia Evolutiva (UPF-CSIC), Passeig Marítim de la Barceloneta 37-49, metazoans, including sponges (17, 18, 25). In accordance with their sub- 08003 Barcelona, Spain. 2Departament de Genètica, Universitat de Barcelona, stantial involvement in intercellular communication and the control of cell 08024 Barcelona, Spain. 3Razavi Newman Center for Bioinformatics, Salk Institute proliferation and differentiation, TKs had been thought to be exclusive to 4 for Biological Studies, La Jolla, CA 92037, USA. Microbial Evolution Research metazoans (21); however, this notion was dispelled by the discovery of an Group, Department of Biology, University of Oslo, 0316 Oslo, Norway. 5Institució Catalana per a la Recerca i Estudis Avançats, 08010 Barcelona, Spain. elaborate collection of TKs in the unicellular and colonial choanoflagel- *To whom correspondence should be addressed. E-mail: hiroshi.suga@ibe. lates (7, 19, 24, 26–28), which are the closest relatives to the Metazoa. upf-csic.es (H.S.); [email protected] (G.M.); [email protected] (I.R.-T.) Moreover, previous reports had suggested the presence of TKs in some

www.SCIENCESIGNALING.org 1 May 2012 Vol 5 Issue 222 ra35 1 RESEARCH ARTICLE non-opisthokont lineages, including the amoebozoans Dictyostelium C. owczarzaki Cytoplasmic discoideum and Entamoeba histolytica,thegreenalgaChlamydomonas reinhardtii, and the oomycete Phytophthora infestans (Fig. 1) (29, 30), Src (2) although they are far less numerous than those of metazoans and choano- Btk (1) flagellates. These findings prompted us to explore whether a TK-based Csk (2) signaling system is also found in the Filasterea, the sister group to the Abl (1) Metazoa and Choanoflagellata (MeCh) (Fig. 1) (4, 5). The Filasterea Fak (1) consist of only two known species: Capsaspora owczarzaki,asymbiotic CTK1 (2) amoeba that dwells in the snail Biomphalaria glabrata (31), and Ministeria CTK2 (1) vibrans,afree-livingmarineprotist(4, 32). Although several putative TKs CTK3 (1) have been reported in expressed sequence tag (EST) data from M. vibrans Receptor (4), a comprehensive genomic analysis of filasterean TKs has not yet been RTK1 (5) conducted. RTK2 (4) Here, we analyzed the whole genome sequence of C. owczarzaki and performed targeted polymerase chain reaction (PCR)–based cloning of RTK3 10 (1) M. vibrans complementary DNAs (cDNAs) to provide a picture of the RTK4 4 (1) earlier stage of TK evolution focused on these unicellular relatives of RTK5 2 (5) metazoans. Through comprehensive phylogenetic approaches including RTK6 2 (1) two filastereans and other premetazoans, together with a detailed compar- RTK7 6 (1) ison of their protein domain organizations, we have elucidated in detail the RTK8 (7) Downloaded from early evolution of TK diversity. Our data show very different divergence RTK9 (1) patterns between CTKs and RTKs. The basic repertoire of CTKs had al- RTK10 8 (1) ready been established before the separation of filastereans and the MeCh, RTK11 (40) whereas the RTKs show an extensive and independent diversification in RTK12 (1) each of the filasterean, choanoflagellate, and metazoan clades. RTK13 (5) RTK14 (4) stke.sciencemag.org RESULTS RTK15 (1) RTK16 (11) Diversity of TKs in the Filasterea RTK17 (1) To better understand the early evolution of TKs, we conducted a genomic RTK18 (1) and PCR-based survey for TK-encoding genes in the Filasterea. We iden- RTK19 (1) tified 103 putative TK-encoding genes in the whole genome sequence of M. vibrans C. owczarzaki (Fig. 2 and fig. S1). Of these, 92 are predicted to be RTKs, Cytoplasmic which contain an intracellular KD, a transmembrane (TM) segment, a sig- Src (2) on May 2, 2012 nal peptide, and known protein domains and motifs in the extracellular Csk (2) Abl (1)

Opisthokonta Fak (1) Fes (1) Holozoa AB MeCh CTK1 (1) Metazoa - Homo, Drosophila, Amphimedon Receptor Choanoflagellata - Monosiga RTK1 (7) Filasterea - Capsaspora, Ministeria TyrK Focal_AT Ichthyosporea* (catalytic domain) Calxβ 200 aa signal peptide DSL DUF1793 Fungi - Allomyces, Spizellomyces transmembrane Sushi FN3 Apusozoa - Thecamonas SH2 LRRNT PbH1 SH3 LRR Amoebozoa - Dictyostelium, Entamoeba, FU Acanthamoeba PH N Cys-rich MopE Plantae - Chlamydomonas BTK EGF Recep_L_domain Oomycota - Phytophthora B41 Fig. 2. Classification of C. owczarzaki and M. vibrans TKs. One hundred Fig. 1. Schematic phylogenetic tree of eukaryotes depicting the pres- three C. owczarzaki TKs and 15 M. vibrans TKs were divided into 27 and ence of group A and B TKs. A widely accepted consensus phylogeny 7 distinct families, respectively, by domain architecture and KD phylogeny. (32, 55)basedonphylogenomicstudiesisshown.Thepresence(circle) A typical domain organization is schematically displayed for each family, and absence (x) of the group A and B TKs are shown on the right. Rep- with the number of the genes that belong to each family shown in parenthe- resentative genera used in this study are shown after the clade names. ses. The domain organizations of all the proteins can be found in fig. S1. The holozoan TKs (group A) most likely originated from the group B TKs. The Pfam or SMART domain names are shown on the bottom. The sizes of The asterisk indicates that the genome sequencing of ichthyosporeans the illustrations are proportional to the actual lengths of amino acid se- is ongoing (36). quences. Scale bar, 200 amino acid residues.

www.SCIENCESIGNALING.org 1 May 2012 Vol 5 Issue 222 ra35 2 RESEARCH ARTICLE region. The other 11 TKs lack a signal peptide and a TM and are thus βARK classified as CTKs. We also isolated seven RTKs and eight CTKs by a MLCK PCR-based survey in M. vibrans (Fig. 2 and fig. S1). PKA 0.1 According to their protein domain organization and the phylogenetic CaMK relationship between KDs, we classified C. owczarzaki TKs into 27 fam- MAPKK ilies (8 CTK and 19 RTK) and M. vibrans TKs into 7 families (6 CTK and Mos CDC2/28 1 RTK) (see Materials and Methods for the classification criteria). The TGFβR CTK repertoire of M. vibrans is similar to that of C. owczarzaki,butthe Wee only RTK family identified in M. vibrans is unique and does not share its MAPKKK domain architecture with any of the known RTK families, including those MLK of C. owczarzaki.RTKsofC. owczarzaki show an extensive divergence in LIMK LRRK their architectures, containing 19 families with distinct organizations of Raf TESK protein domains. Among them, the RTK11 family is the largest with 40 ILK genes, which encode 4 to 46 leucine-rich repeats (LRRs) (fig. S1). Many Plant RLK of the extracellular domains found in these filasterean RTKs, such as the GC-PK A epidermal growth factor (EGF)–like and fibronectin type III (FN3) do- CTR/EDR A A mains, generally interact with other proteins, including extracellular lig- A Plant PK ands or other receptors. 1.0 A

A Phylogenetic position of TKs in the ePK superfamily DPYK2 D T Downloaded from Plant PK D To elucidate the origin of TKs, we first conducted a preliminary phylo- DPYK1 0.97 A A genetic analysis of the ePK superfamily, including TKs and STKs. We in- cluded 23 diverse STK families, with a particular focus on close relatives A A A of TKs (21, 22), as well as a partial TK repertoire from filastereans, T A metazoans, and the choanoflagellate Monosiga brevicollis,inadditionto A AA A the previously reported TKs from pre-opisthokonts (29, 30). We also in- Group A A T T cluded 5 putative KD sequences of TKs that were found in the genome holozoan stke.sciencemag.org C P sequence of the apusozoan Thecamonas trahens (33), which is the pu- TK P tative sister group to the opisthokonts (Fig. 1) (34), and another 17 found E E in the genome of the amoebozoan Acanthamoeba castellanii.Wefound Group B no TK-encoding genes in any of the available fungal genomes, includ- pre-opisthokont TK ing those of the two early-branching species Allomyces macrogynus and T Spizellomyces punctatus (35), which have been sequenced under the UNICORN project (36). This suggests that TKs originated before the divergence between metazoan and fungi and were secondarily lost from Fig. 3. Phylogenetic positions of TKs in the ePK superfamily. A Bayesian on May 2, 2012 fungi (30), if the consensus phylogeny is correct (Fig. 1). tree was inferred from 173 amino acid sites of the KD alignment. Twenty- The Bayesian phylogenetic tree inferred from the alignment of KD three diverse STK families were chosen (21, 56). MLK, mixed lineage ki- sequences provided a nearly maximum statistical support (a Bayesian nase; GC-PK, guanylate cyclase–coupled protein kinase; CTR, constitutive posterior probability of 0.97) to a monophyletic clustering of holozoan triple response; EDR, enhanced disease resistance; DPYK, D. discoideum [metazoan, choanoflagellate, and filasterean (Fig. 1)] TKs (group A protein tyrosine kinase; ILK, integrin-linked kinase; CaMK, calcium/calmodulin- TKs) (Fig. 3 and fig. S2). We deduced a similar tree with the maximum dependent protein kinase; MLCK, myosin light chain kinase; bARK, b-adrenergic likelihood (ML) method (fig. S3A). Removal of TK sequences from the receptor kinase; LRRK, leucine-rich repeat kinase; Plant RLK, plant receptor– data set did not alter the overall topology of the ePK tree (fig. S4). The like kinase; TESK, testis-specific proteinkinase.TheBayesianposteriorprob- group A TKs branch within another type of TK (group B TKs), which in- abilities are shown at two key branches. Predicted or experimentally proven cludes all of the putative pre-opisthokont TKs except for some amoebozoan TKs are shown by red lines. The asterisk indicates that the TK motif is not TKs and shows a diversification independent of that of group A. The mono- clear but was experimentally shown to have only TK activity (29). Holozoan phyletic clustering of group A and B TKs is supported with a maximum group A TKs are highlighted by red shading. Abbreviated species names Bayesian posterior probability of 1.0. are as follows: A, A. castellanii;C,C. reinhardtii;D,D. discoideum;E, To further corroborate this topology, we performed two additional tests. E. histolytica;T,T. trahens;andP,P. infestans.Furtherdetailscanbefound First, we challenged the position of group A by statistically evaluating the in fig. S2. 144 alternative topologies that are different from each other by the group Aposition;alltopologiesthatplacedgroupAoutsidegroupBweresta- A. castellanii do not branch within either group A or group B, but they tistically rejected (fig. S5). Second, we statistically assessed the diversifi- are included in the ancestral STK class, reminiscent of multiple origins of cation pattern between group A TKs, group B TKs, and the STK group TKs. However, ML trees optimized either under the constraint that all of also by bootstrapping: 2000 hypothetical ML trees generated by bootstrap- the A. castellanii TKs included within group A and B, or under the con- ping mostly support the relationship among these three groups suggested straint that all the amoebozoan TKs (that is, those from both A. castellanii by the ML tree in Fig. 3 (fig. S6). It is thus likely that the canonical TKs and D. discoideum)includedwithingroupAandBarenotdeniedbythe of metazoans, choanoflagellates, and filastereans originated from pre- one-tailed, Kishino-Hasegawa test (fig. S3) (37). Therefore, we cannot opisthokont TKs and independently diversified. Any of the experimental- confidently conclude whether TKs and their catalytic loop motifs evolved ly proven TKs from D. discoideum (29)andafewpredictedTKsfrom once or multiple times in eukaryote evolution.

www.SCIENCESIGNALING.org 1 May 2012 Vol 5 Issue 222 ra35 3 RESEARCH ARTICLE

Fig. 4. Comprehensive phylogenetic tree of holozoan TKs. The ML tree was inferred from an alignment consisting of 190 holozoan TKs (group A), 0.1 Src HolozoanTKs (group A) 12 pre-opisthokont TKs (chosen from the group B TKs in Fig. 3), and 7 Amphimedon ancestor STKs (chosen from Fig. 3 as an outgroup). All TKs found in the Monosiga genomes except for highly similar paralogs were included. The genes Capsaspora Tec belonging to clusters I (C. owczarzaki)andII(A. queenslandica)(red Ministeria shading) are not fully included but were separately analyzed. The se- Csk quences of A. queenslandica, M. brevicollis, C. owczarzaki,andM. vibrans are in blue, green, red, and orange lines, respectively. Families are shaded Abl in gray, and the names of 29 major metazoan families are indicated. Black square indicates a putative M. brevicollis ortholog of the Syk family, which is classified into an independent family as a result of a minor architectural difference (gray stripe). Jak family proteins share the same domain Fak architecture with this protein. It can thus also be a Jak homolog whose Syk KD had been replaced with that of a Syk-related family. Black circle in- Shark/HTK16 dicates a possible A. queenslandica ortholog of the CCK4 family, as EGFR suggested by another ML analysis excluding biased or fast-evolving se- Ack quences (fig. S10). Asterisks indicate RTKs. The detailed tree with in- Downloaded from dividual gene names, bootstrap values, and protein architectures are Ministeria shown in fig. S7. RTKs

Capsaspora cluster I Evolution of architectural diversity of the holozoan TKs STYK To further explore the evolutionary history of holozoan TKs, we per-

PDGFR/VEGFR stke.sciencemag.org formed a more focused ML phylogenetic analysis on the TKs from FGFR C. owczarzaki and M. vibrans as well as those from the choanoflagellate Ret M. brevicollis and some representative metazoans, including Homo sapiens, Tie Drosophila melanogaster, and the sponge Amphimedon queenslandica (Fig. 4 and fig. S7). TKs showing identical domain architecture and high- Ryk ly similar KD sequences with others in a single organism are considered to HGFR/Met Amphimedon be recently duplicated paralogs, which were removed to simplify the analy- Axl cluster II sis. Two large monophyletic clusters of C. owczarzaki and A. queenslandica Ros/Sev on May 2, 2012 TKs (clusters I and II, respectively) were also trimmed and analyzed sep- Alk arately (Fig. 5 and figs. S8 and S9). Caution should be exercised in using LMTK published genome annotations, which were usually automatically per- INSR formed, because long protein sequences such as those of TKs are often mis- VKR predicted. Thus, we manually revised the published M. brevicollis and CCK4 A. queenslandica TK-encoding gene predictions (24, 25)andoptimized the predictions of their domain organizations. To minimize phylogenetic Ror/Musk artifacts, we also inferred an ML tree that excluded sequences with biased NGFR amino acid composition and rapidly evolving sequences, using as an out- group the group B TKs, which are the closest relatives of group A TKs DDR (fig. S10); the tree topology was mostly in agreement with the origi- Eph nal tree including the complete data set (Fig. 4), except for the position of A. queenslandica immunoglobulin (Ig)7–RTK, which is suggested to be a colon carcinoma kinase 4 (CCK4) family homolog (fig. S10). Consistent with the preliminary analysis (Fig. 3), the tree showed that all holozoan TKs (group A) form a clade distinct from that of the group B pre-opisthokont TKs (Fig. 4). As shown by previous PCR-based studies of a freshwater sponge Jak (17, 18), the A. queenslandica genome strengthened the notion that the basic repertoire of the CTK and RTK families found in most eumetazoan Fes lineages (cnidarians and bilaterians) had already been established before the divergence of eumetazoans and sponges (25); 15 (16 if the CCK4 fam- Pre-opisthokont TKs ily is included) of the 29 major eumetazoan TK families were identified in (group B) the genome of A. queenslandica (Fig. 4). Phylogenetic analyses suggest gene losses in sponge rather than innovations in eumetazoans, even for Outgroup (STKs) eumetazoan families that sponge lacks (17, 18). Their protein domain

www.SCIENCESIGNALING.org 1 May 2012 Vol 5 Issue 222 ra35 4 RESEARCH ARTICLE

Fig. 5. Expansion of C. owczarzaki–specific TKs in cluster I. The tree was inferred by the ML method. Different C. owczarzaki TK families are indi- 0.1 cated by distinct symbols and colors. A typical domain organization is shown for each family in the lower panel. The abbreviated names of the protein domains or motifs were taken from the SMART or Pfam databases, except for “Cys-rich,” which are not mapped to any of the known protein domains or motifs but just contain periodically arranged cysteines. The TKs with Cys-rich domains were classified by the number of cysteines (shown in the diagrams) usually seen in a repeating unit. C indicates CTKs. A cross indicates an RTK whose extracellular domain organiza- tion has been entirely altered by domain duplication, shuffling, and con- version that occurred relatively recently (more than 85% KD identity with its counterpart). Blue bars indicate the evolutionary points or branches where extracellular domain alterations are supposed to have occurred by parsimonious inference. Red and green arrowheads indicate CoPTK-207 and CoPTK-219, respectively, whose extracellular sequences are highly similar (see fig. S14). The detailed tree with the bootstrap supports and protein architectures is presented in fig. S8. Downloaded from the 10 major metazoan CTK families (Src, Tec, Csk, Abl, Fak, and Fes) on the basis of KD sequence similarity and overall domain architecture, whereas none of the RTKs can be confidently assigned to any metazoan family by the same criteria (Figs. 2 and 4 and fig. S11). Similarly, the PCR-based sampling of the TK repertoire of M. vibrans identified eight CTKs, of which seven are assigned to five metazoan families, and seven RTKs, all of which belong to a unique family (Figs. 2 and 4 and fig. S11). stke.sciencemag.org Asimilarsituationwasalsoseeninchoanoflagellates,whichhaveortho- logs of at least six metazoan CTK families (Fig. 4 and fig. S11), and a large RTK expansion that appears independent of both filastereans and metazoans (19, 24).

Diversity and commonality of the holozoan TK repertoires We summarized the numbers of unique and shared TKs among these on May 2, 2012 five holozoan genomes, H. sapiens, D. melanogaster, A. queenslandica, M. brevicollis,andC. owczarzaki (Fig. 6 and fig. S11), highlighting the RTK1 TyrK ancient establishment of CTK families and the rapid turnover of RTKs in RTK2 Signal peptide the premetazoan stage. Of the 10 CTK families that are mostly in common RTK3 10 Transmembrane RTK4 4 DSL among the three metazoans (fig. S11), 6 are also present in at least one RTK5 2 Sushi filasterean. Another ML analysis focusing only on CTK families confirmed RTK6 2 LRRNT this finding (fig. S12). The basic Src homology 2 (SH2)–SH3–KD archi- RTK7 6 LRR tecture of the Src-related families (Src, Tec, Csk, and Abl) was established RTK8 N Cys-rich RTK10 8 EGF before the divergence of filastereans and the MeCh and had already RTK11 Calxβ diversified into the four families. The presence of homologs of Fes and RTK12 DUF1793 Fak in filastereans demonstrates that they were secondarily lost from RTK13 FN3 RTK14 M. brevicollis, although a putative Fes homolog is present in the choano- PbH1 RTK15 PH flagellate Codosiga gracilis (19). In addition, we identified a previously RTK16 unreported Shark/HTK16 homolog in the M. brevicollis genome. This CTK2 CTK1 200 aa gene, together with the possible Syk ortholog or Jak homolog of this choanoflagellate (Fig. 4 and fig. S11), brings the number of metazoan CTK families with a premetazoan origin to eight. In addition to these organizations are also well conserved (figs. S7 and S11) (17, 25). More- common CTK families, each holozoan clade has also increased the rep- over, this anciently established repertoire has not suffered a marked ertoire independently by gene duplication and domain shuffling (fig. S11). change in metazoan evolution (17, 18). Some metazoans, however, ex- In contrast, we did not observe a clear orthology between the RTKs of ceptionally diversified their TK repertoire independently in each lineage metazoans, choanoflagellates, and filastereans (Fig. 6), either by the phy- (22, 25) by creating TKs with diverse architectures, for example, in logeny of KD sequences or by comparison of domain organization (Fig. 4 A. queenslandica (Fig. 4, cluster II, and fig. S9). and figs. S7 and S11). The TK repertoire of filastereans shows a divergence pattern markedly We then focused on the pattern of clade-specific RTK expansion within different from that of sponges; most CTKs are clearly orthologous to 6 of the Choanoflagellata. As described earlier, most metazoan-specific families

www.SCIENCESIGNALING.org 1 May 2012 Vol 5 Issue 222 ra35 5 RESEARCH ARTICLE

Fig. 6. Diversity and commonality of holo- M. brevicollis zoan TK families. Numbers of TK families A. queenslandica of five holozoans are summarized in Venn diagrams. The numbers were counted on M. brevicollis the basis of the classification shown in fig. 11 19 S11. Gene duplications that occurred within A. queenslandica afamilywerenottakenintoaccount.Anas- 1 11 4 terisk indicates that although the Syk 6 C. owczarzaki family is shared only by H. sapiens and 4 A. queenslandica,havingthesamedo- 1 2 main architecture, a possible ortholog is 6 1 1 5 1 4 present in M. brevicollis with a B41 domain Human added. This gene could also be a Jak 1 3 19 homolog. See the legend to Fig. 4 for further D. melanogaster Human D. melanogaster C. owczarzaki details. The dagger indicates that ML anal- ysis on the data set excluding biased or CTK RTK fast-evolving sequences (see fig. S10) sug- gests the A. queenslandica Ig7-RTK as a CCK4 family gene, although we do not include it here according to the ML analysis on the complete data set (Fig. 4 and fig. S7). Downloaded from seem to have been generated within early metazoans before the diver- DISCUSSION gence between sponges and the rest of the metazoans. To determine whether a similar pattern was seen in choanoflagellates, we compared the com- Our data show that holozoan TKs originated from ancestral pre-opisthokont plete TK repertoire of M. brevicollis (family Codonosigidae) with partial TKs and diversified independently, and that fungi secondarily lost TKs. ones of Monosiga ovata (family Salpingoecidae) (38), C. gracilis (family The two known filastereans, C. owczarzaki and M. vibrans,haveTK Codonosigidae), and Stephanoeca diplocostata (family Acanthoecidae), repertoires that rival those of choanoflagellates and metazoans in size. This stke.sciencemag.org which were obtained by PCR surveys (fig. S13) (19). Five choanoflagellate- suggests that an elaborate TK repertoire emerged before the divergence of specific CTK families and three RTK families are shared between filastereans from choanoflagellates and metazoans. Moreover, our data M. brevicollis and other choanoflagellate species (figs. S11 and S13). In show that the early evolution of RTKs is markedly different from that contrast, three CTK and seven RTK families found in M. ovata, C. gracilis, of CTKs. Of the 10 common metazoan CTK families, eight were already and S. diplocostata are not present in the whole genome of M. brevicollis in place even before the divergence of the Filasterea from the MeCh. In (fig. S13). Thus, unlike in metazoans, the generation of novel TK families contrast to the early maturation of CTKs, no orthology was found among early in choanoflagellate evolution continued to at least beyond the RTKs in metazoans, choanoflagellates, and filastereans. This indicates that divergence of these three taxonomic families of the Choanoflagellata. the last common ancestor of holozoans had (i) one or a few RTKs, from on May 2, 2012 which each clade has expanded the repertoire by frequent gene duplica- Extensive lineage-specific diversification of tion, and duplication and shuffling of the extracellular domains, or (ii) the C. owczarzaki RTKs many RTKs, but extensive domain shuffling, gene loss, or KD conversion Of the 92 C. owczarzaki RTKs, 88 belong to a single cluster, cluster I in each clade has obscured the orthology on the basis of the architectures (Figs. 4 and 5 and fig. S8), on the basis of KD sequence similarity. They of the extracellular regions and KD sequence homology. In all of these display diverse domain organizations, which is indicative of frequent do- three lineages, RTKs substantially outnumber CTKs, and some families main shuffling. Given that their common ancestor belonged to the RTK11 are highly expanded. family (Fig. 5), which is most common in this cluster (Fig. 2), we parsi- Within metazoans, we saw substantial RTK generation at the earliest moniously estimated 22 extracellular architecture alterations in total (Fig. 5). stage of metazoan evolution before the split between sponges and eu- Fourteen RTKs show particularly high KD sequence identities (more than metazoans but far fewer changes in domain architecture in more evolu- 85%) to their close relatives that have distinct extracellular domain orga- tionarily recent times. Although some metazoan lineages (for example, nizations, which suggest recent duplication and domain swapping (Fig. 5). sponges) seem to have expanded their TK repertoire also recently (fig. S9), It is thus likely that frequent gene duplication and domain shuffling that most recent large-scale changes are limited to simple duplications, in- occurred in C. owczarzaki augmented the diversity of RTKs of this protist. cluding whole-genome duplications within vertebrates (12, 16, 42, 43). The A. queenslandica RTKs in cluster II show a similar pattern of di- On the other hand, choanoflagellates seem to have continued the diversi- versification (fig. S9), indicating that the rapid expansion of RTKs is fication of TKs even after the divergence of their three taxonomic families. not a phenomenon specific to unicellular protists but occurred also in However, more data are necessary to assess the continued TK diversifica- the earliest branching metazoans with primitive multicellularity (39, 40). tion within choanoflagellates and filastereans. These mechanisms of quick expansion of RTKs are likely to have con- We have no experimental evidence for the biological functions of tributed to increasing the variety of protein tyrosine phosphatases (PTPs), filasterean TKs. Although a previous publication suggested that the which are antagonists of TK signaling (41). For example, the extracellular phosphotyrosine-mediated signaling system of choanoflagellates might region of the C. owczarzaki RTK CoPTK-207 (Fig. 5) shows a high se- serve a function related to the cell cycle or cell survival (44), still nothing quence identity not only to that of the other RTK CoPTK-219 (Fig. 5) but is known about the extracellular signals received by the RTKs. However, also to that of a PTP (CoPTP-14FN3) that has two PTP catalytic domains the clade-specific divergence pattern of RTKs is consistent with the hy- instead of a KD (fig. S14). pothesis that they act to detect changes in the extracellular environment or

www.SCIENCESIGNALING.org 1 May 2012 Vol 5 Issue 222 ra35 6 RESEARCH ARTICLE to recognize and catch prey (44), because each organism has to be adapted lished gene predictions automatically performed on whole genome se- to its own environment and nutrient conditions. This might explain why quences is perfect even though they are covered by theoretically sufficient the extracellular regions of RTKs in filastereans and choanoflagellates, raw reads, and it is still important to analyze directly on the DNA sequences which are usually exposed to the environment, are highly divergent in each and the raw reads even if official gene predictions are available. We there- lineage. The metazoan RTK repertoire, however, seems to be largely stable fore manually revised all the published prediction of TK-encoding sequences after the initial expansion, with a unique set of metazoan RTKs retained of A. queenslandica, M. brevicollis,andC. owczarzaki by searching the after the emergence of multicellularity. The recruitment of this initial set genomic DNA sequences with two ab initio gene prediction programs, for functions such as intercellular communication and control of cell ad- GENSCAN (49)andAugustus(50). For the Augustus prediction, we opti- hesion, instead of for receiving environmental cues, might be a key to the mized the model parameters specifically for each organism. Their genomic evolution of metazoan multicellularity. The sponge-specific RTK reper- DNA assemblies were also revised and reconstructed from the raw data toire, which diversified independently of the common RTK set of meta- by the use of the wgs-assembler (http://sourceforge.net/apps/mediawiki/ zoans, may reflect the constant exposure of their tissues to the environment wgs-assembler/) and phrap (http://www.phrap.org) programs when nec- (45), in contrast to eumetazoans, in which the RTKs are generally exposed essary. We then checked all of the predictions manually by comparing with to internal fluids. The more stable evolution of CTKs would then reflect a closely related sequences and by inspecting the independent predictions, relatively stable intracellular environment; CTKs generally act downstream including suboptimal ones, by GENSCAN and Augustus, paying special of RTKs or other receptors to transmit extracellular signals (46). attention to obtaining complete sequences. Such reannotations were indis- We have shown the extensive diversification of holozoan TKs after the pensable both to the classification of TKs based on their protein domain split from fungi, as well as the differences in the patterns of expansion be- organization and to the reliable phylogeny by KD sequence alignment. tween RTKs and CTKs before and after the evolution of multicellularity. The integrity of each protein domain and the presence of the signal pep- We hypothesize that the generation of a new RTK repertoire and its use tide in the N terminus of RTK are good indicators of the correct predic- Downloaded from to encode cell-to-cell communication tools has been one of the key genetic tion. Comparison of domain architectures of closely related proteins also changes at the transition from the unicellular to the multicellular system. improved the gene prediction and even the genome sequence assembly. The genome sequencing of the other filasterean M. vibrans,aswellasthe The word “mod” or “new” is added to the name of a M. brevicollis gene remaining holozoan taxon, the Ichthyosporea (Fig. 1), and other pre- if the previous prediction (24)ismodified,orifthegeneisnewlydiscovered opisthokonts (36), together with functional assays on TKs, will further in the genome. We totally renamed the A. queenslandica genes that were clarify the evolutionary origin and functional transition of TK signaling. originally predicted in the whole genome sequence (25)becausemostof them have been radically modified in this study. Not all A. queenslandica stke.sciencemag.org and M. brevicollis TKs were included for the analyses, but they were se- MATERIALS AND METHODS lected such that the representative sequences cover all the diversity of TK families. We also searched the whole genome sequences of five pre- Cultures of filastereans opisthokonts, E. histolytica, T. trahens, A. castellanii, C. reinhardtii,and Live cultures of C. owczarzaki and M. vibrans were purchased from the P. infestans, which were retrieved from http://pathema.jcvi.org/cgi-bin/ American Type Culture Collection (ATCC) and maintained at 23°C in Entamoeba/PathemaHomePage.cgi, ftp://ftp.ncbi.nlm.nih.gov/pub/TraceDB/ ATCC medium 1034 and at 17°C in ATCC medium 1525. amastigomonas_sp__atcc_50062, http://www.hgsc.bcm.tmc.edu/microbial- detail.xsp?project_id=163, http://genome.jgi-psf.org/chlamy/, and http:// on May 2, 2012 Cloning of TK-encoding genes from C. owczarzaki www.broadinstitute.org/annotation/genome/phytophthora_infestans/, re- and M. vibrans spectively. Here, we completed, if possible, only the KD sequences, because M. vibrans cDNAs were obtained either by PCR with degenerate primers the quality of assembly was not always satisfactory for obtaining the full- as described previously (17, 19)orbysearchingavailableESTs(4). The length sequences. We also went back to the raw data and assembled them 15 obtained cDNAs were extended to the 5′ and 3′ termini by rapid am- if necessary. The whole genome sequences of two basal fungal species plification of cDNA ends (RACE) (47). All of the putative TK-encoding A. macrogynus and S. punctatus were retrieved from http://www.broadinstitute. sequences previously found in ESTs (4), except the clearly redundant org/annotation/genome/multicellularity_project. All of the sequences ones, were included in this study. The uncertain exon-intron boundaries (re)annotated in this study are found in the Supplementary Materials. and 5′ and 3′ ends of five Capsaspora TKs (CoPTK-96, CoPTK-102, CoPTK-219, CoPTK-235, and CoPTK-238) were confirmed by PCR and Protein domain architecture RACE. Sequences were deposited in GenBank under accession numbers Protein domain organization was analyzed with HMMER software by search- AB591048 to AB591054. ing the SMART (http://smart.embl-heidelberg.de) and Pfam (http://pfam. sanger.ac.uk/) databases. We confirmed the presence of these predicted Data mining domains by manually inspecting the alignments by HMMER. Cysteine- The genome sequences and predicted protein collections of H. sapiens, rich sequences that were not mapped to any known domain were classified D. melanogaster, M. brevicollis,andC. owczarzaki were obtained from by the number of the cysteines in a repeating unit, which was identified by http://www.ncbi.nlm.nih.gov/RefSeq/, http://flybase.org/, http://genome. aligning the repeats. The TM and signal peptides of RTKs were predicted jgi-psf.org/Monbr1/, and http://www.broadinstitute.org/annotation/ by the TMHMM and SignalP programs (http://www.cbs.dtu.dk/services/), genome/multicellularity_project/, respectively. The predicted TK-encoding respectively (51). sequences of M. brevicollis and A. queenslandica were obtained from http://kinase.com/. The genome sequence of A. queenslandica was pro- Classification of TKs vided by B. Degnan (University of Queensland). The predicted proteomes We def ined “cognate” TKs as those that had identical domain organization were searched for TKs by the BLAST and HMMER (48)programs.The or differed by the loss of one or more copies of a single domain. Repeats highly characteristic motif in the catalytic loop [typically HRDLAARN/ of a domain were counted as one (for example, proteins with five and HRDLRAAN (24)] was also used to find TKs. Note that none of the pub- seven Ig-like repeats are cognate, as in the CCK4 family; Fig. 4 and

www.SCIENCESIGNALING.org 1 May 2012 Vol 5 Issue 222 ra35 7 RESEARCH ARTICLE fig. S7). Noncognate TKs include those that differ by more than one and members of the STK group), we developed a new method, which deletion or by one or more domain insertions. Insertions were distin- assesses the overall group-level diversification in each tree of bootstrap guished from deletions by phylogenetic analysis of their timing. Domain replicates. We first inferred 2000 bootstrap-replicated trees, each opti- insertions that occurred specifically in bilaterian lineages after the sepa- mized by the ML method (one example is shown in fig. S6B). We then ration from nonbilaterians (or protists) were not taken into account, because determined for each tree two putative boundary branches that most likely such domains might have been defined only by alignments of bilaterian separate group A and B TK clusters from STKs and group A TK clusters sequences, which are generally much better investigated than others. from group B TKs, allowing some violations of included members. We We classified TKs into families according to both their domain organiza- tracked the tree branches by the minimum path starting from a group A tion and, with some exceptions (discussed later), their phylogenetic rela- TK-encoding gene (1 in fig. S6A) to an STK-encoding gene (84 in fig. tionship on the basis of the KD sequence similarity. Cognate TKs showing S6A) and sought a branch (STK-TK boundary, blue bars in fig. S6, A and a close phylogenetic relationship were classified into a single family. B) that maximized the number of group A and group B TKs below this Those that show a cognate domain organization but do not take the neigh- boundary and the number of STKs on the other side of this boundary. boring position in the phylogenetic tree were thus classified into distinct We then sought another branch (group A–group B boundary, red bars in families because they were considered to be products of evolutionary con- fig. S6, A and B) that maximized the number of group A TKs below this vergence. However, cognate TKs of filasterean, choanoflagellate, or sponge boundary and the number of group B TKs above this boundary but still lineages that were not mapped to any common metazoan family (thus, below the STK-TK boundary. We defined the area below the group A– they are likely to have diversified specifically in each lineage) were clas- group B boundary as the red area, and that above the group A–group B sified into a single family even if they do not show the closest phylo- boundary, but still below the STK-TK boundary, as the blue area. The genetic relationship to each other. As seen in Fig. 5 and figs. S8 and S9, numbers of the group A TKs (1 to 16), group B TKs (17 to 37), and TKs frequent domain swapping and conversion have obscured the orthology that belong to neither group A nor B (38 to 45) were counted in each of the Downloaded from of such TK genes during their diversification process. We therefore clas- red and blue areas and presented as histograms. sified them only on the basis of the domain organizations to avoid over- estimation of clade-specific expansion of the TK families of these lineages. SUPPLEMENTARY MATERIALS By the same reasoning, we also classified TKs from a single species having www.sciencesignaling.org/cgi/content/full/5/222/ra35/DC1 no known domain but KD (or KD, TM, and signal peptide in the case of Fig. S1. Protein domain organizations of all of the TKs of C. owczarzaki and M. vibrans. RTKs) into a single family, which can be subdivided into multiple families Fig. S2. Phylogenetic position of TKs in the ePK superfamily. when new protein domains or motifsaredescribedinthefuture. Fig. S3. ML analysis of the phylogenetic positions of TKs and a test of the hypothesis of stke.sciencemag.org the multiple origins of TKs. Fig. S4. Bayesian and ML analyses of ePKs without TKs. Inference of phylogenetic trees Fig. S5. Statistical tests of the group A position. Phylogenetic trees were inferred on the basis of the comparison of KD Fig. S6. Statistical test of the group-level clustering of TKs. sequences by the ML method with the four categories WAG-G model Fig. S7. Phylogenetic tree and domain architectures of holozoan TKs. by the use of RAxML v7.0.4 software (52). For large trees with more than Fig. S8. ML tree and protein domain organizations of cluster I TKs of C. owczarzaki. 90 operational taxonomic units, we refined the initial near-ML topology Fig. S9. Diversification of A. queenslandica RTKs in cluster II. Fig. S10. An ML tree excluding biased sequences and rapidly evolving sequences. found by the RAxML program with the same program in a genetic Fig. S11. TK families of six holozoans. on May 2, 2012 algorithm (GA)–based heuristic approach, similar to a method previously Fig. S12. Phylogenetic tree of CTKs. described (53). The difference of our approach is that, after the normal Fig. S13. Protein domain organizations of TKs from three choanoflagellates and their com- GA-based optimization, we applied a final tuning of the discovered near- parison with the M. brevicollis TK repertoire. Fig. S14. Comparison of three C. owczarzaki receptor proteins with very similar extra- ML topology by the tree bisection and reconnection (TBR) and crossover cellular regions. (tree recombination) algorithms (53); we explored all possible topologies Fig. S15. Alignment of ePKs. generated by TBR from the GA-optimized topology and all of the possible Fig. S16. Alignment of holozoan TKs. topologies generated by crossing-over all topologies of the GA population References Sequence data in every possible combination, and confirmed that there was no more im- provement. With this approach, we could effectively avoid the local max- ima of likelihood for large trees (53). The calculation was performed with REFERENCES AND NOTES acomputerclustercomposedofsixiMacs(Apple)withtwocores,aMacPro 1. J. T. Bonner, The origins of multicellularity. Integr. Biol. 1, 27–36 (1998). 2. N. King, The unicellular ancestry of animal development. Dev. Cell 7,313–325 (Apple) with eight cores, and two PCs (Dell) with four cores. The boot- (2004). strap replicates were also performed by RAxML. One C. owczarzaki TK 3. A. Rokas, The molecular origins of multicellular transitions. Curr. Opin. Genet. Dev. (CoPTK-243) was not included because its KD is highly divergent. Phylo- 18, 472–478 (2008). genetic inference of the holozoan TK position in the protein kinase super- 4. K. Shalchian-Tabrizi, M. A. Minge, M. Espelund, R. Orr, T. Ruden, K. S. Jakobsen, T. Cavalier-Smith, Multigene phylogeny of choanozoa and the origin of animals. family was performed also by the Bayesian method with MrBayes v3.1.2 PLoS One 3,e2098(2008). (54)withthesamemodel.Werantheprogramfor10,000,000generations 5. I. Ruiz-Trillo, A. J. Roger, G. Burger, M. W. Gray, B. F. Lang, A phylogenomic investigation (average SD of split frequencies < 0.013). The confidence of clustering is into the origin of metazoa. Mol. Biol. Evol. 25,664–672 (2008). represented by Bayesian posterior probability. The alignments are shown in 6. A. Sebé-Pedrós, A. J. Roger, F. B. Lang, N. King, I. Ruiz-Trillo, Ancient origin of the integrin-mediated adhesion and signaling machinery. Proc. Natl. Acad. Sci. U.S.A. figs. S15 and S16. 107, 10142–10147 (2010). 7. N. King, M. J. Westbrook, S. L. Young, A. Kuo, M. Abedin, J. Chapman, S. Fairclough, Statistical test on the group-level clustering U. Hellsten, Y. Isogai, I. Letunic, M. Marr, D. Pincus, N. Putnam, A. Rokas, K. J. Wright, ML bootstrap values for the trees including TKs and STKs (fig. S3A) are R. Zuzow, W. Dirks, M. Good, D. Goodstein, D. Lemons, W. Li, J. B. Lyons, A. Morris, relatively low, mostly as a result of the use of divergent sequences and the S. Nichols, D. J. Richter, A. Salamov, J. Sequencing, P. Bork, W. A. Lim, G. Manning, W. T. Miller, W. McGinnis, H. Shapiro, R. Tjian, I. V. Grigoriev, D. Rokhsar, The genome low number (173) of alignment sites. To test the robustness of higher-level of the choanoflagellate Monosiga brevicollis and the origin of metazoans. Nature 451, topology (that is, the relationship between group A TKs, group B TKs, 783–788 (2008).

www.SCIENCESIGNALING.org 1 May 2012 Vol 5 Issue 222 ra35 8 RESEARCH ARTICLE

8. A. Sebé-Pedrós, A. de Mendoza, B. F. Lang, B. M. Degnan, I. Ruiz-Trillo, Unexpected 36. I. Ruiz-Trillo, G. Burger, P. W. Holland, N. King, B. F. Lang, A. J. Roger, M. W. Gray, The repertoire of metazoan transcription factors in the unicellular holozoan Capsaspora origins of multicellularity: A multi-taxon genome initiative. Trends Genet. 23, 113–118 owczarzaki. Mol. Biol. Evol. 28, 1241–1254 (2011). (2007). 9. W. J. Fantl, D. E. Johnson, L. T. Williams, Signalling by receptor tyrosine kinases. 37. H. Kishino, M. Hasegawa, Evaluation of the maximum likelihood estimate of the evolu- Annu. Rev. Biochem. 62, 453–481 (1993). tionary tree topologies from DNA sequence data, and the branching order in Hominoidea. 10. P. van der Geer, T. Hunter, R. A. Lindberg, Receptor protein-tyrosine kinases and J. Mol. Evol. 29,170–179 (1989). their signal transduction pathways. Annu. Rev. Cell Biol. 10, 251–337 (1994). 38. M. Carr, B. S. Leadbeater, R. Hassan, M. Nelson, S. L. Baldauf, Molecular phylogeny 11. J. Gerhart, 1998 Warkany lecture: Signaling pathways in development. Teratology 60, of choanoflagellates, the sister group to Metazoa. Proc. Natl. Acad. Sci. U.S.A. 105, 226–239 (1999). 16641–16646 (2008). 12. H. Suga, K. Kuma, N. Iwabe, N. Nikoh, K. Ono, M. Koyanagi, D. Hoshiyama, T. Miyata, 39. P. R. Bergquist, in Invertebrate Zoology, D. T. Anderson, Ed. (Oxford Univ. Press, Intermittent divergence of the protein tyrosine kinase family during animal evolution. Victoria, 2001), pp. 10–27. FEBS Lett. 412,540–546 (1997). 40. L. Margulis, K. V. Schwartz, Five Kingdoms: An Illustrated Guide to the Phyla of Life 13. G. Apic, J. Gough, S. A. Teichmann, Domain combinations in archaeal, eubacterial on Earth (W. H. Freeman and Company, New York, 1998), pp. 246–249. and eukaryotic proteomes. J. Mol. Biol. 310, 311–325 (2001). 41. N. K. Tonks, Protein tyrosine phosphatases: From genes, to function, to disease. 14. E. V. Koonin, Y. I. Wolf, G. P. Karev, The structure of the protein universe and Nat. Rev. Mol. Cell Biol. 7,833–846 (2006). genome evolution. Nature 420,218–223 (2002). 42. H. Suga, D. Hoshiyama, S. Kuraku, K. Katoh, K. Kubokawa, T. Miyata, Protein tyrosine 15. S. Wuchty, Scale-free behavior in protein domain networks. Mol. Biol. Evol. 18, kinase cDNAs from amphioxus, hagfish, and lamprey: Isoform duplications around the 1694–1702 (2001). divergence of cyclostomes and gnathostomes. J. Mol. Evol. 49,601–608 (1999). 16. T. Miyata, H. Suga, Divergence pattern of animal gene families and relationship with 43. P. Dehal, J. L. Boore, Two rounds of whole genome duplication in the ancestral ver- the Cambrian explosion. Bioessays 23, 1018–1027 (2001). tebrate. PLoS Biol. 3, e314 (2005). 17. H. Suga, K. Katoh, T. Miyata, Sponge homologs of vertebrate protein tyrosine kinases 44. N. King, C. T. Hittinger, S. B. Carroll, Evolution of key cell signaling and adhesion and frequent domain shufflings in the early evolution of animals before the parazoan– protein families predates animal origins. Science 301, 361–363 (2003). eumetazoan split. Gene 280, 195–201 (2001). 45. T. L. Simpson, The Cell Biology of Sponges (Springer-Verlag, New York, 1984),

18. H. Suga, M. Koyanagi, D. Hoshiyama, K. Ono, N. Iwabe, K. Kuma, T. Miyata, Extensive pp. 255–340. Downloaded from gene duplication in the early evolution of animals before the parazoan–eumetazoan 46. T. Mustelin, P. Burn, Regulation of src family tyrosine kinases in lymphocytes. Trends split demonstrated by G proteins and protein tyrosine kinases from sponge and hydra. Biochem. Sci. 18, 215–220 (1993). J. Mol. Evol. 48, 646–653 (1999). 47. J. Sambrook, D. W. Russell, Molecular Cloning: A Laboratory Manual (Cold Spring 19. H. Suga, G. Sasaki, K. Kuma, H. Nishiyori, N. Hirose, Z. H. Su, N. Iwabe, T. Miyata, Harbor Laboratory Press, New York, 2001). Ancient divergence of animal protein tyrosine kinase genes demonstrated by a gene 48. S. R. Eddy, A new generation of homology search tools based on probabilistic inference. family tree including choanoflagellate genes. FEBS Lett. 582, 815–818 (2008). Genome Inform. 23,205–211 (2009). 20. S. K. Hanks, A. M. Quinn, T. Hunter, The protein kinase family: Conserved features 49. C. Burge, S. Karlin, Prediction of complete gene structures in human genomic and deduced phylogeny of the catalytic domains. Science 241, 42–52 (1988). DNA. J. Mol. Biol. 268,78–94 (1997).

21. S. K. Hanks, T. Hunter, The eukaryotic protein kinase superfamily: Kinase (catalytic) 50. M. Stanke, S. Waack, Gene prediction with a hidden Markov model and a new intron stke.sciencemag.org domain structure and classification. FASEB J. 9, 576–596 (1995). submodel. Bioinformatics 19 (suppl. 2), ii215–ii225 (2003). 22. G. Manning, G. D. Plowman, T. Hunter, S. Sudarsanam, Evolution of protein kinase 51. J. D. Bendtsen, H. Nielsen, G. von Heijne, S. Brunak, Improved prediction of signal signaling from yeast to man. Trends Biochem. Sci. 27, 514–520 (2002). peptides: SignalP 3.0. J. Mol. Biol. 340, 783–795 (2004). 23. R. A. Lindberg, A. M. Quinn, T. Hunter, Dual-specificity protein kinases: Will any 52. A. Stamatakis, RAxML-VI-HPC: Maximum likelihood-based phylogenetic analy- hydroxyl do? Trends Biochem. Sci. 17,114–119 (1992). ses with thousands of taxa and mixed models. Bioinformatics 22,2688–2690 24. G. Manning, S. L. Young, W. T. Miller, Y. Zhai, The protist, Monosiga brevicollis, has a (2006). tyrosine kinase signaling network more elaborate and diverse than found in any 53. K. Katoh, K. Kuma, T. Miyata, Genetic algorithm-based maximum-likelihood analysis known metazoan. Proc. Natl. Acad. Sci. U.S.A. 105, 9674–9679 (2008). for molecular phylogeny. J. Mol. Evol. 53, 477–484 (2001). 25. M. Srivastava, O. Simakov, J. Chapman, B. Fahey, M. E. Gauthier, T. Mitros, 54. F. Ronquist, J. P. Huelsenbeck, MrBayes 3: Bayesian phylogenetic inference under G. S. Richards, C. Conaco, M. Dacre, U. Hellsten, C. Larroux, N. H. Putnam, M. Stanke, mixed models. Bioinformatics 19, 1572–1574 (2003). on May 2, 2012 M. Adamska, A. Darling, S. M. Degnan, T. H. Oakley, D. C. Plachetzki, Y. Zhai, 55. A. J. Roger, A. G. Simpson, Evolution: Revisiting the root of the eukaryote tree. Curr. Biol. M. Adamski, A. Calcino, S. F. Cummins, D. M. Goodstein, C. Harris, D. J. Jackson, 19,R165–R167 (2009). S. P. Leys, S. Shu, B. J. Woodcroft, M. Vervoort, K. S. Kosik, G. Manning, B. M. Degnan, 56. T. Hunter, Protein kinase classification. Methods Enzymol. 200,3–37 (1991). D. S. Rokhsar, The Amphimedon queenslandica genome and the evolution of animal complexity. Nature 466,720–726 (2010). Acknowledgments: The genome sequences of C. owczarzaki, A. macrogynus, S. punctatus, 26. N. King, S. B. Carroll, A receptor tyrosine kinase from choanoflagellates: Molecular and T. trahens are being sequenced by the Broad Institute, Massachusetts Institute of Tech- insights into early animal evolution. Proc. Natl. Acad. Sci. U.S.A. 98, 15032–15037 nology and Harvard University, under the auspices of the National Human Genome Research (2001). Institute. We thank B. Degnan (University of Queensland) for access to the A. queenslandica 27. Y. Segawa, H. Suga, N. Iwabe, C. Oneyama, T. Akagi, T. Miyata, M. Okada, Functional genome sequence database; K. Worley and her colleagues in the Human Genome Sequencing development of Src tyrosine kinases during evolution from a unicellular ancestor to Center of the Baylor College of Medicine for allowing us to analyze the A. castellanii genome multicellular animals. Proc. Natl. Acad. Sci. U.S.A. 103,12021–12026 (2006). sequence; and F. Lang (University of Montreal) for access to the unpublished genome data of 28. W. Li, S. L. Young, N. King, W. T. Miller, Signaling properties of a non-metazoan Src T. trahens, A. macrogynus, and S. punctatus. Funding: H.S. is supported by the Marie Curie kinase and the evolutionary history of Src negative regulation. J. Biol. Chem. 283, Intra-European Fellowship within the 7th European Community Framework Programme. A.d.M. 15491–15501 (2008). is supported by the grant Formación del Personal Investigador from Ministerio de Ciencia e 29. J. L. Tan, J. A. Spudich, Developmentally regulated protein-tyrosine kinase genes in Innovación awarded to I.R.-T. K.S.-T. is supported by starting grants from the University of Oslo. Dictyostelium discoideum. Mol. Cell. Biol. 10, 3578–3583 (1990). This study was financially supported by the European Research Council Starting Grant ERC- 30. S. H. Shiu, W. H. Li, Origins, lineage-specific expansions, and multiple losses of tyrosine 2007-StG-206883 (to I.R.-T.), grant BFU2008-02839/BMC from Ministerio de Ciencia e Inno- kinases in eukaryotes. Mol. Biol. Evol. 21,828–840 (2004). vación (to I.R.-T.), and NIH grant HG004164 (to G.M.). Author contributions: H.S. and A.d.M. 31. L. A. Hertel, C. J. Bayne, E. S. Loker, The symbiont Capsaspora owczarzaki, nov. performed the experiments; H.S., M.D., G.M., and A.d.M. performed computational work; H.S. gen. nov. sp., isolated from three strains of the pulmonate snail Biomphalaria glabrata and I.R.-T. designed the experiments; H.S., I.R.-T., and G.M. designed computational analyses; is related to members of the . Int. J. Parasitol. 32,1183–1191 (2002). and H.S., I.R.-T., G.M., M.D., and K.S.-T. wrote the paper. Competing interests: The authors 32. J. Paps, I. Ruiz-Trillo, Animals and their unicellular ancestors, in Encyclopedia of Life declare that they have no competing interests. Data and materials availability: Sequences Sciences (John Wiley and Sons, Chichester, UK, 2010), pp. 1–8. have been deposited in GenBank under accession numbers AB591048 to AB591054. 33. T. Cavalier-Smith, E. E. Chao, Phylogeny and evolution of Apusomonadida (Protozoa: Apusozoa): New genera and species. Protist 161,549–576 (2010). Submitted 28 November 2011 34. G. Torruella, R. Derelle, J. Paps, B. F. Lang, A. J. Roger, K. Shalchian-Tabrizi, I. Ruiz-Trillo, Accepted 13 April 2012 Phylogenetic relationships within the Opisthokonta based on phylogenomic analyses of Final Publication 1 May 2012 conserved single copy protein domains. Mol. Biol. Evol. 29,531–544 (2012). 10.1126/scisignal.2002733 35. T. Y. James, P. M. Letcher, J. E. Longcore, S. E. Mozley-Standridge, D. Porter, M. J. Powell, Citation: H. Suga, M. Dacre, A. de Mendoza, K. Shalchian-Tabrizi, G. Manning, I. Ruiz-Trillo, G. W. Griffith, R. Vilgalys, A molecular phylogeny of the flagellated fungi (Chytridiomycota) Genomic survey of premetazoans shows deep conservation of cytoplasmic tyrosine kinases and description of a new phylum (Blastocladiomycota). Mycologia 98,860–871 (2006). and multiple radiations of receptor tyrosine kinases. Sci. Signal. 5,ra35(2012).

www.SCIENCESIGNALING.org 1 May 2012 Vol 5 Issue 222 ra35 9

54

Results R3

Resum de l’article R3: L'evolució del sistema de senyalització de GPCR en eucariotes: modularitat, conservació i la transició a la multicel·lularitat animal.

El sistema de senyalització de receptors acoblats a proteïna G (GPCR) és una de les principals vies de senyalització en eucariotes. Aquí analitzem la història evolutiva de tots els seus components, des dels receptors als reguladors, per obtenir un panorama general de la seva evolució a nivell de sistema. Mitjançant les dades de genomes eucariotes que cobreixen la major diversitat disponible, trobem que els diversos components han evolucionat independentment, destacant la naturalesa modular de la via de senyalització GPCR. Les nostres dades mostren que algunes famílies de GPCR, proteïnes G i reguladors de les proteïnes G (RGS) han patit diversificacions específiques de llinatge, usant els mateixos dominis proteics de forma recurrent. A més, trobem que la majoria de les famílies de gens implicats en el sistema de senyalització de GPCR ja estaven presents en l'últim ancestre comú de tots els eucariotes (LECA). També demostrem que l'ancestre unicel·lular dels metazous ja tenia la major part dels components citoplasmàtics del sistema de senyalització de GPCR, incloent, totes famílies de la subunitats alfa de proteïnes G típiques dels metazous. Per tant, la transició a la multicel·lularitat va involucrar la conservació de la maquinària de transducció de senyal, mentre que hi va haver una explosió en el nombre de receptors, probablement per fer front a les noves necessitats del modus de vida multicel·lular.

55

56

Title: The evolution of the GPCR signalling system in eukaryotes: modularity, conservation and the transition to metazoan multicellularity

Alex de Mendoza,a,b, Arnau Sebé-Pedrós,a,b, Iñaki Ruiz-Trillo1,a,b,c

a. Institut de Biologia Evolutiva (CSIC-Universitat Pompeu Fabra) Passeig Marítim de la Barceloneta, 37–49 08003

Barcelona, Spain

b. Departament de Genètica, Universitat de Barcelona Av. Diagonal 645 08028 Barcelona, Spain

c. Institució Catalana de Recerca i Estudis Avançats (ICREA) Passeig Lluís Companys, 23 08010 Barcelona, Spain

1Corresponding author:

Iñaki Ruiz-Trillo

Abstract The G Protein Coupled Receptor (GPCR) signalling system is one of the main signalling pathways in eukaryotes. Here we analyse the evolutionary history of all its components, from receptors to regulators, to gain a broad picture of its system- level evolution. Using eukaryotic genomes covering most lineages sampled to date, we find that the various components evolved independently, highlighting the modular nature of the GPCR signalling pathway. Our data show that some GPCR families, G proteins and Regulators of G proteins (RGS) diversified through lineage-specific diversifications and recurrent domain shuffling. Moreover, most of the gene families involved in the GPCR signalling system were already present in the Last Common Ancestor of Eukaryotes (LECA). Furthermore, we show that the unicellular ancestor of Metazoa already had most of the cytoplasmic components of the GPCR signalling system, including, remarkably, all of the G protein alpha subunits, which are typical of metazoans. Thus, we show how the transition to multicellularity involved conservation of the signalling transduction machinery, as well as a burst of receptor diversification to cope with the new multicellular necessities.

Introduction and Cytokinin (Rensing et al. 2008). Other A molecular system to receive and transduce signalling pathways, such as the metazoan signals from the environment or from other Notch pathway, have instead been assembled cells of the same organism is key to the from various, more ancient components by organization of a complex multicellular domain-shuffling (Gazave et al. 2009). Others organism (Gerhart 1999; Pires-daSilva and were already present in the unicellular Sommer 2003), although molecular signalling ancestors and were subsequently co-opted for pathways are not only required within a multicellular functions. A good example are the multicellular context. Unicellular species face receptor tyrosine kinases, which emerged and similar signalling needs as multicellular expanded in unicellular holozoans (i.e., organisms, dealing with a changing choanoflagellates and filastereans), and were environment and, in some cases, coordinating later recruited for developmental control in different cells (e.g. density sensing) (Crespi metazoans (King et al. 2008; Manning et al. 2001; King 2004; Rokas 2008). 2008; Suga et al. 2012). The re-use of Both animals (metazoans) and plants have previously assembled signalling systems is evolved complex signalling pathways to govern indeed an important mechanism of signalling their embryonic development, and, according pathway co-option in multicellular lineages to current genomic data, some of these (King et al. 2008). pathways appear to be specific to either One of the major eukaryotic signalling metazoans or plants. This is the case of the pathways is the G Protein Coupled Receptors metazoan-specific WNT and Hedgehog (GPCRs) and their associated signalling signalling pathways (Ingham et al. 2011; modules (Fritz-Laylin et al. 2010; Niehrs 2012) and the land plant-specific Auxin Anantharaman et al. 2011; Krishnan et al. 2012), which are conserved from excavates to

Figure 1. Distribution and abundance of GPCR signalling components in 78 eukaryotic genomes. Domain numbers and abundance are depicted according to the colour legend in the upper-left. Black boxes indicate absence of the domain in a given taxa. The various domains are grouped into functional modules, as shown in the schema at the bottom-right. animals. GPCRs are involved in many protein heterotrimeric signalling is, in turn, processes apart from developmental control, regulated by various proteins families, such as cell growth, migration, density sensing including RGS and GoLoco-motif-containing or neurotransmission (Bockaert and Pin 1999; proteins (Pierce et al. 2002; Siderovski and Pierce et al. 2002; Rosenbaum et al. 2009). Willard 2005; Wilkie and Kinch 2005). The GPCRs are able to sense a wide diversity of combination of GPCR, G proteins and their signals, including proteins, nucleotides, ions regulators results in many diverse signalling and photons. Structurally, GPCRs have a 7 outputs. transmembrane domain (they are also known as Besides the classic GPCR-G protein signalling 7TM receptors), which forms a ligand-binding system described above, there are alternative pocket in the extracellular region, and a upstream and downstream molecules (Figure cytoplasmic G-protein-interacting domain 1). For instance, monomeric G protein alpha (Pierce et al. 2002; Lagerström and Schiöth activation by Ric 8 (Resistance to inhibitors of 2008), which binds to G-proteins to mediate cholinesterase 8) is GPCR-independent (Wilkie intracellular signalling. G proteins form a and Kinch 2005; Hinrichs et al. 2012), and heterotrimeric complex that is disassembled Beta-Gamma heterodimers are regulated via when activated by the GPCR, and transduce the Phosducins (Willardson and Howlett 2007). signal into downstream effectors (Oldham and Complementarily, GPCRs can perform Hamm 2008). The G protein heterotrimeric downstream signalling independently of G complex has three different subunits of distinct proteins by G protein-coupled Receptor evolutionary origin, alpha, beta and gamma. G Kinases (GRK) and Arrestins (Gurevich and Gurevich 2006; Reiter and Lefkowitz 2006; B), and Frizzled (Class F). This system can be DeWire et al. 2007; Liggett 2011; Shenoy and extended to GPCR types described in non- Lefkowitz 2011). metazoans, including the cAMP (Class E), Most of the proteins involved in the GPCR ITR-like and GPR-108-like families, as well as signalling pathway have previously been several lineage-specific receptor families such analysed as single units in various phylogenetic as insect odorant receptors, nematode contexts (Blaauw et al. 2003; Fredriksson and chemoreceptors or vertebrate vomeronasal Schiöth 2005; Alvarez 2008; Oka et al. 2009; receptors (Nordström et al. 2011). Fungi also Anantharaman et al. 2011; Krishnan et al. have well defined GPCR families such as Ste2 2012; Mushegian et al. 2012). However, not and Ste3 (both included in Class D), and Git3 much attention has been paid to the system- and plant Absicic acid receptors are also level evolution of the entire pathway, and given thought to be GPCRs (Plakidou-Dymock et al. the modularity of the system, it is important to 1998; Tuteja 2009; Krishnan et al. 2012). Most investigate its evolution from a global point of GPCR families are associated with a view. In this paper, we provide an update on characteristic PFAM domain (Fredriksson et al. the evolutionary histories of all components of 2003; Fredriksson and Schiöth 2005; the GPCR signalling system using a genomic Lagerström and Schiöth 2008). survey that includes representatives of all First, we assessed the presence and abundance eukaryote supergroups. We analyse the of GPCR family domains in diverse eukaryotic modular structure of the signalling pathway and genomes (see Figure 1 for a complete taxon show how different parts of the system co- sampling). Our data show that the distribution evolved in complementary or independent of GPCR families in eukaryotes follows two patterns. We also reconstruct the GPCR distinct evolutionary patterns. Some families signalling system in the Last Common are pan-eukaryotic, while others are biased Ancestor of Eukaryotes (LECA) and track its towards unikonts. For instance, GRAFS are evolution in various lineages. Finally, we more abundant in unikonts (or amorpheans), analyse the evolution of the system in the especially in metazoans, although some transition from unicellular ancestors to (Glutamate, Adhesion/Secretin and Rhodopsin) metazoans. We observe strong conservation in are also observed in some bikonts. Other the pathway components associated with families, such as cAMP receptors, Git3, ITR- cytoplasmic signalling transduction, while like, GPR-108-like and Absicic acid Receptors receptors radiated extensively in metazoans, are found in similar abundance among becoming one of the largest gene families in eukaryotes. Interestingly, non-GRAFS GPCR metazoan genomes (Fredriksson and Schiöth families are never expanded in any species 2005). The dissimilarity between the pattern of (<10 members in all genomes). We also evolution in pre-adapted signalling transduction surveyed the taxonomically restricted metazoan machinery and active diversification of families, and, although we found chemosensory receptors provides clues on how key receptors (7tm_7) and the Serpentine type innovations in metazoan complexity could have chemoreceptors Srw and Srx in some evolved from pre-existing machineries. previously unreported metazoan genomes, none were observed in non-metazoan eukaryotes (supplementary figure S1), with exception of Results OA1 (Ocular Albinism receptor), which is GPCR families: ancient origins and specific to metazoans and Capsaspora architecture diversifications owczarzaki. These results indicate that most A widely-accepted classification of the GPCR families have ancient origins in the last metazoan GPCR complement is the GRAFS eukaryotic common ancestor. system, which is based on both phylogeny and Diversification of ancient GPCR families is structural similarity (Fredriksson et al. 2003; usually accompanied by architectural Fredriksson and Schiöth 2005; Lagerström and diversification of the N-terminal protein Schiöth 2008; but see Pierce et al. for an domain (Lagerström and Schiöth 2008). Thus alternative classification). The GRAFS system we analysed the architectural diversity of each divides GPCRs into 5 different families, GPCR family in each genome (supplementary Glutamate (also known as Class C), Rhodopsin figure S2), and observed two types of GPCRs (Class A), Adhesion (Class B), Secretin (class in terms of N-terminal domain diversity. Some, G Protein Coupled Receptors G Protein Regulators

Rhodopsin Adhesion/Secretin Glutamate Frizzled Regulators of G protein Signaling GoLoco PXA-RGS-PX- Ldl_receptor_a ANF ANF_receptor RGS7 TPR_repeat

SNX GRK cNMP binding DEP-Ggamma- EGF OpuAC RGS Bmp NCD3G LRR GPS HRM Fz GPS DEP Pkinase Nexin_C GoLoco GoLoco Frizzled 7tm_3 7tm_1 7tm_3 7tm_3 7tm_3 7tm_3 7tm_1 7tm_2 7tm_2 7tm_2 7tm_1 7tm_2 RGS RGS RGS RGS

Metazoa

Choanoflagellata

Filasterea

Ichthyosporea

Fungi

Amoebozoa

Embryophyta

Chlorophyta

Naegleria Trichomonas Heterokonta Bigelowiella Alveolata

Emiliania

% 95,6 2,7 1,4 34,9 22,3 9,3 5,1 32,3 33,6 22 2,9 2,9 75 69 5,6 3,9 2,7 2,6 1,6 36,9 23,7

Total nº of genes 4362 592 682 84 1108 38 containing the domain

Figure 2. Conservation of the domain architecture of different GPCR signalling components across eukaryotic genomes. A black dot indicates the presence of a given domain architecture. A white dot refers to similar domain architecture, Tyrosine Kinase instead of Serine/Threonine kinase in the case of Choanoflagellate GRK- like genes. For simplicity, only the most common architectures are shown. The percentage of genes found with a given architecture within a family is indicated at the bottom part of the table, as well as the total number of genes within the family. The complete domain architectures of the GPCR signalling system components are found in supplementary figures 3, 4 and 6. such as Glutamate, Adhesion/secretin, and, to a metazoan species have diversified their own lesser extent, Rhodopsin are susceptible to the species-specific configurations of glutamate recruitment of new domains in the N-terminal receptors (supplementary figure S3). The region, especially in Metazoa, while others, Adhesion family is also quite structurally such as like cAMP, Git3, OA1, Absicic Acid diverse, especially in metazoans and, to a lesser receptors, GPR108-like and ITR-like, have extent, unicellular holozoans (supplementary substantially lower diversity of protein domains figure S3). Similarly, the Rhodopsin family is at the N-terminal. This result suggests that architecturally diversified, mainly in some GPCR families have metazoans. Finally Fz-Frizzled, RpkA (cAMP- functionalconstraints while others are prone to PIP5K domain architecture) and Git3-Git3_C diversify through recruitment of concurrent protein domain architectures could be domains. identified in several eukaryotic genomes To gain further insights into domain (supplementary figure S4), expanding the diversification, we searched for evolutionary previous distribution of those architectures at conservation of specific protein domain LECA or at the root of unikonts. Remarkably architectures (Figure 2), and found that some most of the GPCR complex architectures architectures are highly conserved across belong to GRAFS families and are mostly lineages. For example, Glutamate receptors diversified and conserved within metazoans. (7tm_3) have protein domain configurations that are conserved in distant eukaryotic Heterotrimeric G protein Complex lineages, including those with Venus Flytrap GPCRs typically signal through G proteins. In module (ANF_receptor), OpuAC or Bmp an inactive state, the three G protein subunits domains (Figure 2). Additionally, several non- (alpha, beta and gamma) form a heterotrimeric Basidiomycota G Protein alpha subunit complex (Pierce et al. 2002; Oldham <50/1.00 GPA-4 Mucoromycota Blastocladiomycota and Hamm 2008) (Figure 1). When a Chytridiomycota Metazoa Ascomycota 96/1.00 GPA-2 Basidiomycota Unicellular Holozoa ligand activates a GPCR it acts as a <50/1.00 Mucoromycota Fungi Guanidine Exchange Factor (GEF), Dikarya Amoebozoa GPA-1 Mucoromycota Excavata promoting GDP to GTP exchange in <50/0.99 Blastocladiomycota Chytridiomycota Heterokonta the G alpha subunit. This exchange Eumetazoa Rhizaria <50/0.91 Placozoa Haptophyta Gαi Porifera alters G alpha subunit conformation <50/0.92 Ctenophora Embryophyta <50/1.00 Gαo Eumetazoa and promotes the disaggregation of Placozoa Metazoa Gαi/o Filasterea the heterotrimeric complex. The

Ichtyosporea active G alpha subunit and an active <50/0.68 Opisthokonta group dimer of beta and gamma subunits Ichthyosporea Dikarya mediate further downstream 93/1.00 GPA-3 Mucoromycota Blastocladiomycota Chytridiomycota signalling

<50/0.90 Gαq Metazoa Capsaspora Using the signature domains of each 99/1.00 Gq/12/13 <50 Gα12/13 Metazoa 1.00 G protein, we surveyed our dataset to Choano!agellata Ichthyosporea Filasterea find their general distribution

<50/0.87 Metazoa patterns, and found that the Gαs Filasterea Ichthyosporea abundance of each subunit varies Bilateria <50/1.00 Ctenophora markedly across eukaryotes, and that Gαv Porifera 99/1.00 Choano!agellata Dictyostelia Capsaspora some taxa have lost these three

Naegleria subunits entirely (Anantharaman et Excavata group al. 2011). G protein alpha is the most

<50/1.00 Trichomonas susceptible to diversification, and, Naegleria Naegleria Naegleria interestingly, beta and gamma Dictyostelia Entamoeba subunits have multiple copies in G Acanthamoeba Amoebozoa Groups 56/1.00 Dictyostelia Acanthamoeba alpha rich species. Although Ichthyosporea Allomyces Eukaryotic group combination of the three elements is 89/1.00 Dictyostelia Bigelowiella Ectocarpus Metazoa <50/0.97 important for signalling plasticity, G Filasterea Bigelowiella Ichthyosporea alpha is the most evolutionarily <50/0.98 Emiliana Heterokonta 93/1.00 84/1.00 dynamic of the three G proteins. Bigelowiella To gain further insights into the Bigelowiella Spermatophyta 100/1.00 evolution of G alpha proteins, we GPA-2 Selaginella 84/1.00 Physcomitrella Embryophyta Groups GPA-1 Spermatophyta performed phylogenetic analyses 96/1.00 Selaginella 0.5 using our eukaryotic dataset (Figure 3), and the resulting tree shows that Figure 3. Maximum likelihood (ML) phylogenetic several groups have lineage-specific tree inferred by the G protein alpha subunit. diversifications, such as those in Naegleria Different eukaryotic lineages are represented by a gruberi, Bigelowiella natans, and Emiliania colour code depicted in the legend. Within the gene huxlei. The opisthokonts have a diverse but family clades, the specific taxonomic groups which conserved repertoire of G proteins. Fungi have comprise eukaryotic lineages represented in that four distinct paralogs (GPA-1 to 4) present in clade (i.e. eumetazoans, placozoans) are shown on most fungal lineages (families reviewed in Li, the right. Nodal supports indicate 100-replicate ML bootstrap support and Bayesian Posterior Wright, Krystofova, Park, & Borkovich, 2007), Probability (BPP). Supports are only shown for and therefore were most likely present in the nodes recovered by both ML and Bayesian fungal ancestor. Holozoa also have four ancient inference, with BPP>0.9. paralogs, Gαs, Gαq/12/13, Gαi/o and Gαv (described for Metazoa in Oka, Saraiva, Kwan, through various effectors (Milligan and & Korsching, 2009). It is worth mentioning Kostenis 2006; Oldham and Hamm 2008). G that all of the metazoan G alpha families are alpha is a low-efficiency GTPase, while G beta conserved in the unicellular relatives of has various WD-40 repeats (PF00400) and G Metazoa, indicating that they originated prior gamma is a small protein containing a to the diversification of metazoans from the conserved domain (Milligan and Kostenis rest of holozoans. 2006; Anantharaman et al. 2011). We also identified a new and divergent family the presence of heterotrimeric subunits (Figure of holozoan G alpha subunits that branches out 1). The number of RGS varies from one single from the Opisthokonta clade, comprising copy in some taxa to numerous copies in other Nematostella vectensis, Lottia gigantea and lineages. For example, some eukaryotes such as other holozoans (Figure 3). Additionally we Naegleria gruberi (229), Bigelowiella natans observed a cluster of conserved G alpha (39), Ectocarpus siliculosus (47) or the subunits in several distant eukaryotic lineages: ichthyosporeans (22 to 119) have more RGS Ichthyosporea, Allomyces macrogynus and proteins than Homo sapiens (34), while other dictyostelids within the Unikonta, and B. multicellular lineages such as plants possess natans and Ectocarpus siliculosus within the only one copy. In contrast, the GoLoco motif Bikonts. It is likely that this particular family appears to be exclusive to metazoans and originated in the LECA and was lost many choanoflagellates (Figure 1), and while its copy times during eukaryotic evolution. number may vary, it is less abundant than RGS. We also performed a phylogenetic analysis of Therefore, our data show that the eukaryotic eukaryotic beta-subunits, in order to compare RGS system underwent independent radiations the evolutionary histories of alpha and beta in many lineages, while GoLoco is a later (supplementary figure S5). Our tree shows that development that originated prior to the holozoans have a particular ancient duplication, divergence of choanoflagellates and metazoans. Gβ1-4 and Gβ5, with the more derived Gβ5 We then examined the architectural known to interact with G gamma-like subunits, configurations of RGS proteins, since they are such as RGS7 (Sondek and Siderovski 2001; known to combine with many other protein Anderson et al. 2009), a multi-domain protein domains (Siderovski and Willard 2005; that contains a G gamma domain. We identified Anantharaman et al. 2011). Our survey shows RGS7 in both chytrid fungi and holozoans that distant lineages evolved their own (Figure 2 and supplementary figure 6), and architectural repertoires, and generally have therefore the ancient duplication of G protein unique configurations that are not found beta as well as its partner, RGS7, are ancient elsewhere (supplementary figure S6). features of Holozoans. Moreover, many configurations evolved independently, recruiting the same domain in Regulatory proteins: RGS and GoLoco different configurations. For example DEP, Regulation of G-proteins is a key step in GPCR cNMP binding, Kinases, Rho GTPase, Leucine signalling that involves two main protein Rich Repeat (LRR), START, and Ankyrin families, RGS (Regulators of G protein repeats are all present in various combinations Signalling) and GoLoco motif-containing in RGS genes from divergent taxa. However proteins (Siderovski and Willard 2005; Wilkie some domain architectures are evolutionary and Kinch 2005). RGS proteins act as GTPase- conserved (Figure 2 and supplementary figure accelerating proteins (GAP), turning GTP into S6), such as in opisthokonts, which share some GDP and thereby promoting the formation of common RGS architectures, mainly Sorting the G protein heterotrimeric complex and Nexins (SNX13/14/25) and the previously completing G alpha signalling (Siderovski and mentioned RGS7. Additionally, the RGS-like Willard 2005). Nevertheless, not all RGS domain, typical of PDZ-RhoGEF, is an domains act as GAP proteins in G protein innovation of Holozoa (Figure 1), while signalling, and some have lost their GAP RGS12 and Axin are metazoan innovations activity and have developed scaffolding (supplementary figure S6). Our results functions (Anantharaman et al. 2011). GoLoco- emphasize that metazoans and their unicellular motif-containing proteins (also known as G relatives have conserved elements of RGS Protein Regulators) act as guanine dissociation complement, which is quite susceptible to antagonists, inhibiting the dissociation of the diversification through domain re- heterotrimeric complex by binding to G alpha- arrangements. GDP and blocking downstream signal Of specific interest are RGS proteins with transduction (Siderovski and Willard 2005). transmembrane (TM) domains (Anantharaman We traced the distribution and abundance of et al. 2011; Urano et al. 2012), as they localize RGS and GoLoco motif proteins in eukaryotes, to the cell membrane next to heterotrimeric G and found that RGS is present in many proteins. We found that in most lineages RGS different eukaryotes, mainly coinciding with is fused to at least one TM domain (supplementary figure S7). In plants and other (Blaauw et al. 2003), and this is further eukaryotes RGS domains have been observed reinforced by the fact that, most species that together with 7TM organizations, somehow have Phosducin I proteins also possess the resembling a GPCR but with the opposite heterotrimeric beta subunit. Conversely, the effect on G proteins (Urano et al. 2012). We phosducin-II/III clade includes chlorophyte found that chytrid fungi, filastereans and sequences, a group that lacks G protein ichthyosporeans have this type of receptor, signalling. This suggests that proteins while metazoans do not, suggesting that belonging to the phosducin-II/III clade have metazoans dispensed with GAP transmembrane substrates other than G proteins (Willardson signalling and focused on typical GPCR and Howlett 2007). signalling. GoLoco motif-containing proteins are also part Alternative signalling inputs: GRK and of multi-domain proteins. Our results show that Arrestins choanoflagellates have a unique configuration GPCRs can also signal independently of G- (SH2-GoLoco) and that metazoans have some proteins, which is mainly achieved through conserved architectures, such as RGS12 and interactions with G protein coupled Receptor RGS14 (Figure 2 and supplementary figure Kinases (GRKs) and Arrestins, where Arrestins S6). can either antagonize G protein signalling or connect GPCRs to other signalling modules Upstream alternative regulators: Ric8 and (Gurevich and Gurevich 2006; Reiter and Phosducin Lefkowitz 2006; DeWire et al. 2007; Shenoy Ric8 is a long domain that acts as a GEF and Lefkowitz 2011). GRKs have an active (Guanidine Exchange Factor), activating G kinase domain and an inactive RGS domain, alpha subunits in the absence of GPCR which allows it to scaffold with GPCRs. Like signalling, or as a chaperone to stabilize G other kinases (e.g. PKC and PKA) GRKs alpha (Hinrichs et al. 2012; Chan et al. 2013). phosphorylate active GPCR receptors in a Ric8-mediated activation of monomeric G process called desensitization, inhibiting the alpha is involved in development and signalling GPCR and allowing Arrestin binding. Arrestin in metazoans, fungi and Dictyostelium binding promotes receptor internalization by (Hinrichs et al. 2012; Kataria et al. 2013). endocytosis, which can result in ubiquitination While we found Ric8 in almost all unikonts, or recycling of the GPCR (Pierce et al. 2002; suggesting it was secondarily lost in some Gurevich and Gurevich 2006; DeWire et al. species (figure 1), it is rare in bikonts, and 2007). Additionally, Arrestins can also act as found only in a small number of Heterokonta. adaptors for other signal transduction pathways The presence of Ric8 in only a few heterokonts such as MAPK or Akt (DeWire et al. 2007). could be explained by horizontal gene transfer, Thus, understanding the evolutionary dynamics although our phylogenetic analysis does not of Arrestin/GRK signalling is key to building a support this hypothesis (supplementary figure complete picture of GPCR signalling. S8), but suggests instead that Ric8 was present We found that GRK proteins are present in a in the LECA, and secondarily lost in many reduced subset of eukaryotes, including eukaryotic lineages. Holozoa, Dictyostelida, Heterokonta and Phosducins belong to a small and ancient gene Haptophyta (Mushegian et al. family, Phosducin-like (Blaauw et al. 2003; 2012)(supplementary figure S10). Our Willardson and Howlett 2007), and act as co- phylogenetic analysis supports the duplication chaperones of the G beta/gamma dimers, of GRKa and GRKb paralog groups at the root allowing normal dimer configuration and of Holozoa, as some sequences belonging to transiently inhibiting their junction with G filastereans and ichthyosporeans branch within alpha (Willardson and Howlett 2007). We the GRKa clade (supplementary figure S10). performed a phylogenetic analysis of Nevertheless, some RGS-kinase architectures Phosducin-like proteins, and the resulting tree may be convergent, such as the case where shows three great clades: Phosducin I, choanoflagellate RGS is fused to a Tyrosine Phosducin II/III, and orphan phosducin Kinase domain. While the absence of GRK in (supplementary figure S9). The only one many GPCR rich genomes is not surprising, known to interact with G protein beta subunits since other kinases can replicate this function, is the Phosducin-I or Phosducin/PhLP1 clade holozoans retained two paralogs of this GPCR signalling system specialized kinase. After addressing the evolutionary histories of While GRKs are rather scarce in eukaryotes, the various components of GPCRs and their Arrestins are broadly distributed, and our signalling modules, we analysed them at survey shows that most eukaryotes have a system level by reducing the diversity of variable number of Arrestins (Figure 1). To molecules into the main functional categories gain insights into the evolutionary history of and analysing their co-evolution (Figure 4A). Arrestins, we performed a phylogenetic Our data show that some lineages have most of analysis and identified three major clades, the components of the GPCR signalling though with low nodal support (supplementary system, while others, such as G. lamblia and figure S11). One clade includes metazoan beta the miscrosporidians, are completely reduced. Arrestins, as well as several sequences from Other lineages have retained only a subset of unicellular holozoans. The tree also shows a the components involved in GPCR signalling, large lineage-specific expansion in which challenges general views on the basic Ciliophorans, some fungal clades, and the mechanics of the system. First, Absicic acid metazoans Caenorhabditis elegans, Drosophila receptors (PF12430) and GPR-108-like melanogaster and Trichoplax adherens. (PF06814) are present in genomes where most Interestingly Arrestin sequences from all three of the GPCR signalling system has been lost clades are known to interact with GPCR (such as Cyanidioschyzion merolae and (Alvarez 2008), and therefore their presence Leishmania major, see Figure 1), which implies and expansion suggests a complementary that their role as GPCRs is doubtful, as system to G protein signalling. previously suggested (Maeda et al. 2008; Anantharaman et al. 2011).

Figure 4. A) Schematic representation of the functional modules in eukaryotic lineages that were analysed in the study. Green boxes indicate the presence, black the absence and yellow the presence with some simplification or uncertain affiliation. Black dots in Arrestin and Phosducin rows indicate the presence of orthologs of a subfamily (B-arrestins and Phosducin-I clade), as discussed in the main text. In the upper part of the table, green dots indicate a complete GPCR signalling system, red indicate full reduction, and yellow indicate severe simplifications but with some conserved functional modules. B) Principal component analysis showing the clustering of eukaryotic genomes according to GPCR signalling components. The two principal components displayed account for 26.4% (PC1) and 12.65% (PC2) of variation. Colour coding of dots according to taxonomic grouping is represented in the colour legend on the right.

Furthermore, there are other taxa in which signal transduction pathways in those lineages. some GPCRs are present, even though the This is not the case in Guillardia theta, heterotrimeric complex is absent (or partially however, which has cAMP and ITR-like absent). For example, the apusozoan GPCRs but neither G proteins nor Arrestins. Thecamonas trahens, which lacks All of these data suggest that GPCRs might be heterotrimeric subunits, has four cAMP connected to alternative signalling modules receptors and one Adhesion receptor, all of other than G proteins. which are canonical GPCRs. Similarly, The modularity of the GPCR signalling system ciliophorans, which only have the G protein is further supported by the fact that various G subunit beta, have members of Rhodopsin, protein subunits can be found independently of Adhesion, cAMP and ITR-like receptors. the other subunits. For example, the G alpha Interestingly both T. trahens and ciliophorans subunit, but not the G beta and gamma have Arrestins, in high numbers in the latter subunits, is present in Trichomonas vaginalis group, suggesting that Arrestins might provide and Cyanophora paradoxa. The former has an alternative link between GPCRs and other

7tm1 Bilateria GPS 7tm_2 7tm2 Frizzled Cnidaria Phospholipase C beta EGF GPS 7tm_2 Placozoa Axin, RGS12 Ctenophora Calx-Beta GPS 7tm_2 GoLoco Porifera METAZOA OA1 Git3 Git3_C ȕArrestin 7tm_1, OA1,GpcrRhopsn4,Frizzled, Git3 Choano!agellata

ANF_receptor 7tm_3 *Įi/o*Įs*Įq/12/13*Įv *ȕ*ȕ Frizzled, Git3 Filasterea PDZ-RhoGEF(RGS-like) cAMP 3,3. RGS, Arrestin SNX, RGS7 GpcrRhopsn4, Frizzled Ichthyosporea HOLOZOA GPA1 GPA2 STE2, STE3 GPA3 7tm_1, 7tm_3, Frizzled Dikarya GPA4 GpcrRhopsn4 LECA other Fungi FUNGI OPISTHOKONTA

G protein Alpha subunit Frizzled

G protein Beta subunit 7tm_1, 7tm_2, 7tm_3, GpcrRhopsn4, Frizzled, Git3, ABA GPCR Apusozoa G protein Gamma subunit GpcrRhopsn4, Frizzled RGS 7tm_1 Amoebozoa UNIKONTA Ric8

Phosducin GpcrRhopsn4 Embryophyta Arrestin Chlorophyta

GPCR type VIRIDIPLANTAE 7tm_1/Rhodopsin Rhodophyta 7tm_2/Adhesion/Secretin GpcrRhopsn4, cAMP Receptor, ABA

7tm_3/Glutamate 7tm_1, 7tm_2, 7tm_3, Git3

GpcrRhopsn4, cAMP Receptor, ABA GPCR Glaucophyta ARCHAEPLASTIDA cAMP receptor Cryptophyta Git3 Abscisic acid (ABA) GPCR Excavata GpcrRhopsn4/ITR/GPR180 7tm_1, 7tm_3, cAMP Receptor, Git3 Lung_7-TM_R/GPR108-like Haptophyta 7tm_1, 7tm_2, 7tm_3, Git3 Heterokonta

7tm_3 Alveolata

7tm_1, ABA GPCR Rhizaria BIKONTA SAR

Figure 5. Cladogram representing the major patterns of evolution of GPCR signalling components in a eukaryotic phylogeny. Coloured boxes indicate specific components defined by a domain. Dashed coloured boxes refer to secondary loss of the given domain. Family names next to coloured boxes refer to specific gene family acquisitions within a domain. Green and red boxes depict gain and loss of GPCR types. Blue boxes depict significant enrichments of the component shown, according to a Wilcoxon rank-sum test, with p-value threshold of <0.01. Additionally, a selected set of conserved GPCR architectures placed where they must have appeared according to Dollo Parsimony.

RGS proteins, which, in the absence of GPCR studies (Brown et al. 2012; Burki et al. 2012; and two of the components of the Torruella et al. 2012). Our data show that most heterotrimeric complex, may be interacting GPCR families are ancient, and that some of with other signalling pathways. Ciliophorans the specific architectures of each family can be only have the G beta subunit, but have several traced back to the eukaryotic ancestor. Phosducin-like genes, which may also imply Therefore, the LECA already had a complex that ciliophorans have co-opted Phosducin and GPCR signalling system, as well as many other G protein beta into a distinct function. diversified gene families (Derelle et al. 2007; Additionally T. trahens has an RGS protein Fritz-Laylin et al. 2010; Wickstead et al. 2010; with no obvious function due to the absence of Grau-Bové et al. 2013). G alpha subunits. Thus, the evolutionary conservation of some components in simplified genomes underpins the modular plasticity of Discussion the GPCR signalling system. Our genomic survey and evolutionary We also performed a Principal Components reconstruction show that the LECA had a Analysis of our eukaryote dataset with the aim complex repertoire of GPCRs (Figure 5). of elucidating different evolutionary tendencies Independent expansions of the GPCR (Figure 4B). We observed at least three clusters signalling system occurred in some eukaryotic among eukaryotes that illustrate different lineages, and, interestingly, most of the species patterns of evolution: expansion, simplification that have these expansions are unicellular or and conservation of the GPCR signalling colonial, such as B. natans, N. gruberi and system. Principal Component 1 (PC1) is ichthyosporeans (Figure 4). This supports the principally loaded by the core functional view that unicellular lifestyles also require categories of the GPCR signalling system, complex signalling machineries (Crespi 2001). clustering the most simplified taxa together, In fact multicellular fungi such as the including strict parasites such as Basidiomycota Coprinus cinereus and the microsporidians, G. lamblia, trypanosomatids, Ascomycota Tuber melanosporum have rather Perkinsus marinus or apicomplexans. simpler complements of GPCRs than other Interestingly, many autotrophic lineages, such fungal lineages, including chytrids and as Archaeplastida and Cryptophyta, also have a Mucoromycotina. Similarly, embryophytes considerably reduced complement of GPCRs. possess a reduced GPCR signalling system. Of On the other hand, PC2 differentiates between course, other signalling pathways are also the two kinds of diversification of the GPCR present in eukaryotes, such as Histidine signalling system. In a cluster characterized by kinases, Serine/Threonine kinases or Tyrosine the loading of G alpha and beta subunits, RGS, Kinases (Anantharaman et al. 2007; Schaller et and cAMP receptors we find some al. 2011; Suga et al. 2012), and these can have ichthyosporeans, N. gruberi, B. natans and more important roles in the taxa where GPCR Allomyces macrogynus. Metazoans are signalling is simplified. differentiated in PC2 by the presence of 7tm1, An important conclusion from our work is the 7tm2, GoLoco and Frizzled. Therefore our data modularity of the system. We find that some indicate that the composition of the GPCR species have GPCRs without G proteins and signalling system evolved repeatedly towards a vice versa, and we also show how different more complex pathway in various eukaryotic parts of the GPCR signalling system evolved lineages. In particular, metazoans developed a independently so that different functional more complex system through the expansion of categories involved in the pathway can become GPCR signalling components. simplified without altering the others, as has Reconstruction of GPCR signalling been hinted at in other studies (Wilkie and components in LECA Kinch 2005; Anantharaman et al. 2011). In We reconstructed the evolutionary stories of addition, some parts of the pathway have the various modules throughout the eukaryotic diversified, both in terms of gene number and branch of the tree of life (Figure 5) using the domain architecture, while other elements unikont-bikont root for eukaryotes (Derelle and remain conservative. All of this evidence Lang 2012) and taking into account the suggests that the system is plastic, and that topology from the most recent phylogenomic drastic rearrangements can occur without complete loss of functionality. This robustness of eukaryotic signalling systems has been compared to the simpler and more direct signalling systems of prokaryotes (Anantharaman et al. 2007), and indeed modularity is a key feature of eukaryotic signalling pathways, which show great diversity of signalling machineries across different lineages (Anantharaman et al. 2007; Schaller et al. 2011). Modularity is not only observed in how the various elements of the GPCR signalling pathway evolve, but also at the level of protein Figure 6. Chart plot depicts the median number of domain architectures. Overall, our results on GPCR signalling components in Opisthokont domain architectures clearly show that domain lineages. Total numbers of G protein alpha, beta and gamma subunits are comprised in the shuffling is a major mechanism of signalling Heterotrimeric G proteins category, RGS and Go- system evolution. Indeed, pervasive convergent Loco-motif containing proteins are comprised in evolution of domain arrangements is a major Regulators of G proteins category, and GPCR types feature of both GPCR receptors and RGS presented in Figure 1 are comprised in GPCR proteins (Nordström et al. 2009; Anantharaman category. et al. 2011; Krishnan et al. 2012). However, since not all GPCR families are equally susceptible to acquiring new domains, GPCR, complex environmental sensing, from functional constraints might also exist that light sensing to odour and taste. We suggest prevent this evolutionary mechanism of that the shift from a universal eukaryotic innovation. signalling system to a dramatic expansion and refinement in metazoans played a key role in Regarding the origin of metazoans, our results the acquisition of complex multicellularity. show a bimodal pattern of evolution of the elements of the GPCR signalling system. Cytoplasmic transduction elements, such as G Materials and Methods proteins, Ric-8, GoLoco motif, B-arrestins and Taxon sampling, data gathering RGS families, are largely conserved between The 75 publicly available genomes used in this unicellular holozoans and metazoans, both in study were downloaded from databases at terms of gene families and protein domain NCBI, The Joint Genome Institute, and The architectures (Figure 6). In contrast, receptors Broad Institute. Data from some unicellular underwent a dramatic expansion in metazoans holozoan species come from RNAseq compared to their closest unicellular relatives, sequenced in-house (Pirum gemmata, and a similar pattern has also been observed for Abeoforma whisleri and Corallochytrium tyrosine kinases, Hippo signalling and Notch limacisporum) or from The Broad Institute signalling elements (Gazave et al. 2009; Sebé- “Origin of Multicellularity Database” Pedrós et al. 2012; Suga et al. 2012). The (Ministeria vibrans and Amoebidium signalling output of GPCRs depend on the parasiticum). The RNAseq data was translated combinatory of heterotrimeric G proteins and 6-frames. their regulators, and, remarkably, the All components of the GPCR signalling combination that originated in ancient machinery were selected from the literature and holozoans was already sufficient for the PFAM database (Punta et al. 2012). All transducing the huge amount of GPCR proteomes were scanned using PfamScan signalling inputs present in metazoans. The gathering threshold. This is important in the expansion and evolutionary success of case of GPCRs because it helps to receptors is probably driven by metazoans’ disambiguate between different GPCR families multicellularity, which co-opted the GPCR by selecting the most significant hit. signalling system for many new functions, such Additionally, PfamScan gathering threshold as cell-cell communication, developmental avoids the spurious partial hits typical of control, and most importantly in the case of transmembrane proteins and is a conservative approach to minimize false positives (Punta et uncovers novel structural and functional al. 2012). General distribution patterns were features of the heterotrimeric GTPase obtained by counting domain presence in the signaling system. Gene 475:63–78. PfamScan proteomic outputs. The same files Anantharaman V, Iyer LM, Aravind L. 2007. were used to obtain multi domain architectures, Comparative genomics of protists: new insights into the evolution of eukaryotic signal with the exception of the Transmembrane transduction and gene regulation. Annual domains analysed in RGS proteins, which were review of microbiology 61:453–475. obtained using the TMHMM software (Krogh Anderson GR, Posokhova E, Martemyanov KA. et al. 2001). 2009. The R7 RGS protein family: multi- Heatmaps, PCA and parsimony subunit regulators of neuronal G protein reconstruction signaling. Cell biochemistry and biophysics Heatmaps were built using R heatmap.2 54:33–46. function, from the gplots package. Principal Blaauw M, Knol JC, Kortholt A, Roelofs J, component analysis (PCA) was carried out Ruchira, Postma M, Visser AJWG, van using the built-in R prcomp function, with Haastert PJM. 2003. Phosducin-like proteins scaling and a covariance matrix, and were in Dictyostelium discoideum: implications for the phosducin family of proteins. The EMBO plotted using the R bpca package. We assumed journal 22:5047–5057. Dollo parsimony to infer ancestral gains and Bockaert J, Pin J. 1999. Molecular tinkering of G secondary reconstructions in Figure 5 using protein-coupled receptors : an evolutionary Mesquite (Maddison and Maddison 2011). success. The EMBO journal 18:1723–1729. Brown MW, Kolisko M, Silberman JD, Roger AJ. Phylogenetic analyses 2012. Aggregative Multicellularity Evolved Arrestins, Ric8, G alpha subunit, G beta Independently in the Eukaryotic Supergroup subunit, Phosducin and RGS domains were Rhizaria. Current Biology 22:1–5. used for phylogenetic analyses. The alignments Burki F, Okamoto N, Pombert J-F, Keeling PJ. were obtained using MAFFT with the L-INS-i 2012. The evolutionary history of haptophytes option (Katoh and Standley 2013), and these and cryptophytes: phylogenomic evidence for separate origins. Proceedings of the Royal alignments were manually trimmed to avoid Society B: Biological Sciences 279:2246– ambiguous regions. Seed alignments are 2254. disposable upon request and deposited at Dryad Chan P, Thomas CJ, Sprang SR, Tall GG. 2013. repository. The aminoacidic model of evolution Molecular chaperoning function of Ric-8 is to used for phylogenetic inference was LG, with a fold nascent heterotrimeric G protein. discrete gamma distribution of among-site Proceedings of the National Academy of variation rates (four categories) and a Sciences of the United States of America. proportion of invariable sites. Crespi BJ. 2001. The evolution of social behavior Maximum Likelihood analyses were performed in microorganisms. Trends in ecology & using RAxML version 7.2.6. (Stamatakis evolution 16:178–183. 2006). The best-tree topology depicted in the Derelle R, Lang FB. 2012. Rooting the eukaryotic tree with mitochondrial and bacterial proteins. figures was obtained by selecting the best tree Molecular biology and evolution 29:1277– out of 100 replicates. Bootstrap support was 1289. obtained using 100 bootstrap replicates of the Derelle R, Lopez P, Le Guyader H, Manuel M. same alignment. Bayesian inference trees were 2007. Homeodomain proteins belong to the inferred using PhyloBayes v3.3 (Lartillot et al. ancestral molecular toolkit of eukaryotes. 2009). The resulting tree and posterior Evolution and development 9:212–219. probabilities were obtained when two parallel DeWire SM, Ahn S, Lefkowitz RJ, Shenoy SK. runs converged (tracecomp standard values), 2007. β-Arrestins and Cell Signaling. Annual after surpassing at least 500.000 generations. Review of Physiology 69:483–510. The runs were sampled every 100 generations, Fredriksson R, Lagerström MC, Lundin L-G, and the burn-in was established using a Schiöth HB. 2003. The G-protein-coupled receptors in the human genome form five bpcomp maxdiff < 0.3. main families. Phylogenetic analysis, paralogon groups, and fingerprints. Molecular References pharmacology 63:1256–1272. Alvarez CE. 2008. On the origins of arrestin and Fredriksson R, Schiöth HB. 2005. The repertoire of rhodopsin. BMC evolutionary biology 8:222. G-protein–coupled receptors in fully Anantharaman V, Abhiman S, de Souza RF, Aravind L. 2011. Comparative genomics sequenced genomes. Molecular pharmacology application to complete genomes. Journal of 67:1414. Molecular Biology 305:567–580. Fritz-Laylin LK, Prochnik SE, Ginger ML, et al. Lagerström MC, Schiöth HB. 2008. Structural 2010. The genome of Naegleria gruberi diversity of G protein-coupled receptors and illuminates early eukaryotic versatility. Cell significance for drug discovery. Nature 140:631–642. reviews. Drug discovery 7:339–357. Gazave E, Lapébie P, Richards GS, Brunet F, Lartillot N, Lepage T, Blanquart S. 2009. Ereskovsky A V, Degnan BM, Borchiellini C, PhyloBayes 3: a Bayesian software package Vervoort M, Renard E. 2009. Origin and for phylogenetic reconstruction and molecular evolution of the Notch signalling pathway: an dating. Bioinformatics 25 :2286–2288. overview from eukaryotic genomes. BMC Li L, Wright SJ, Krystofova S, Park G, Borkovich Evolutionary Biology 9:249. K a. 2007. Heterotrimeric G protein signaling Gerhart J. 1999. 1998 Warkany lecture: signaling in filamentous fungi. Annual review of pathways in development. Teratology 60:226– microbiology 61:423–452. 239. Liggett SB. 2011. Phosphorylation barcoding as a Grau-Bové X, Sebé-Pedrós A, Ruiz-Trillo I. 2013. mechanism of directing GPCR signaling. A genomic survey of HECT ubiquitin ligases Science signaling 4:pe36. in eukaryotes reveals independent expansions Maddison WP, Maddison DR. 2011. Mesquite: a of the HECT system in several lineages. modular system for evolutionary analysis. Genome Biology and Evolution. Version 2.75 [Internet]. Available from: Gurevich V V, Gurevich E V. 2006. The structural http://mesquiteproject.org basis of arrestin-mediated regulation of G- Maeda Y, Ide T, Koike M, Uchiyama Y, Kinoshita protein-coupled receptors. Pharmacology & T. 2008. GPHR is a novel anion channel therapeutics 110:465–502. critical for acidification and functions of the Hinrichs M V, Torrejón M, Montecino M, Olate J. Golgi apparatus. Nature Cell Biology 2012. Ric-8: different cellular roles for a 10:1135–1145. heterotrimeric G-protein GEF. Journal of Manning G, Young SL, Miller WT, Zhai Y. 2008. cellular biochemistry 113:2797–2805. The protist, Monosiga brevicollis, has a Ingham PW, Nakano Y, Seger C. 2011. tyrosine kinase signaling network more Mechanisms and functions of Hedgehog elaborate and diverse than found in any signalling across the metazoa. Nature Reviews known metazoan. Proceedings of the National Genetics 12:393–406. Academy of Sciences of the United States of Kataria R, Xu X, Fusetti F, Keizer-Gunnink I, Jin America 105:9674–9679. T, van Haastert PJM, Kortholt A. 2013. Milligan G, Kostenis E. 2006. Heterotrimeric G- Dictyostelium Ric8 is a nonreceptor guanine proteins: a short history. British journal of exchange factor for heterotrimeric G proteins pharmacology 147 Suppl:S46–55. and is important for development and Mushegian A, Gurevich V V, Gurevich E V. 2012. chemotaxis. Proceedings of the National The origin and evolution of G protein-coupled Academy of Sciences of the United States of receptor kinases. Plos ONE 7:e33806. America 110:6424–6429. Niehrs C. 2012. The complex world of WNT Katoh K, Standley DM. 2013. MAFFT Multiple receptor signalling. Nature reviews. Molecular Sequence Alignment Software Version 7: cell biology 13:767–779. Improvements in Performance and Usability. Nordström KJ V, Almén MS, Edstam MM, Molecular Biology and Evolution 30 :772– Fredriksson R, Schiöth HB. 2011. 780. Independent HHsearch, Needleman-Wunsch- King N, Westbrook MJ, Young SL, et al. 2008. The based and motif analyses reveals the overall genome of the choanoflagellate Monosiga hierarchy for most of the G protein-coupled brevicollis and the origin of metazoans. receptor families. Molecular biology and Nature 451:783–788. evolution 28:2471–2480. King N. 2004. The unicellular ancestry of animal Nordström KJ V, Lagerström MC, Wallér LMJ, development. Developmental Cell 7:313–325. Fredriksson R, Schiöth HB. 2009. The Krishnan A, Almén MS, Fredriksson R, Schiöth Secretin GPCRs descended from the family of HB. 2012. The Origin of GPCRs: Adhesion GPCRs. Molecular biology and Identification of Mammalian like Rhodopsin, evolution 26:71–84. Adhesion, Glutamate and Frizzled GPCRs in Oka Y, Saraiva LR, Kwan YY, Korsching SI. 2009. Fungi.Xue C, editor. Plos ONE 7:e29817. The fifth class of Gα proteins. Proceedings of Krogh A, Larsson B, von Heijne G, Sonnhammer the National Academy of Sciences of the ELL. 2001. Predicting transmembrane protein United States of America. topology with a hidden markov model: Oldham WM, Hamm HE. 2008. Heterotrimeric G protein activation by G-protein-coupled receptors. Nature reviews. Molecular cell thousands of taxa and mixed models. biology 9:60–71. Bioinformatics 22 :2688–2690. Pierce KL, Premont RT, Lefkowitz RJ. 2002. Suga H, Dacre M, de Mendoza A, Shalchian- Seven-transmembrane receptors. Nature Tabrizi K, Manning G, Ruiz-Trillo I. 2012. reviews. Molecular cell biology 3:639–650. Genomic Survey of Premetazoans Shows Pires-daSilva A, Sommer RJ. 2003. The evolution Deep Conservation of Cytoplasmic Tyrosine of signalling pathways in animal development. Kinases and Multiple Radiations of Receptor Nature Reviews Genetics 4:39–49. Tyrosine Kinases. Science Signaling 5:ra35– Plakidou-Dymock S, Dymock D, Hooley R. 1998. ra35. A higher plant seven-transmembrane receptor Torruella G, Derelle R, Paps J, Lang BF, Roger AJ, that influences sensitivity to cytokinins. Shalchian-Tabrizi K, Ruiz-Trillo I. 2012. Current biology 8:315–324. Phylogenetic relationships within the Punta M, Coggill PC, Eberhardt RY, et al. 2012. Opisthokonta based on phylogenomic The Pfam protein families database. Nucleic analyses of conserved single-copy protein Acids Research 40 :D290–D301. domains. Molecular biology and evolution Reiter E, Lefkowitz RJ. 2006. GRKs and beta- 29:531–544. arrestins: roles in receptor silencing, Tuteja N. 2009. Signaling through G protein trafficking and signaling. Trends in coupled receptors. Plant signaling & behavior endocrinology and metabolism: TEM 17:159– 4:942–947. 165. Urano D, Jones JC, Wang H, Matthews M, Rensing S a, Lang D, Zimmer AD, et al. 2008. The Bradford W, Bennetzen JL, Jones AM. 2012. Physcomitrella genome reveals evolutionary G protein activation without a GEF in the insights into the conquest of land by plants. plant kingdom. PLoS genetics 8:e1002756. Science 319:64–69. Wickstead B, Gull K, Richards T a. 2010. Patterns Rokas A. 2008. The origins of multicellularity and of kinesin evolution reveal a complex the early history of the genetic toolkit for ancestral eukaryote with a multifunctional animal development. Annual Review of cytoskeleton. BMC evolutionary biology Genetics 42:235–251. 10:110. Rosenbaum DM, Rasmussen SGF, Kobilka BK. Wilkie TM, Kinch L. 2005. New roles for Galpha 2009. The structure and function of G-protein- and RGS proteins: communication continues coupled receptors. Nature 459:356–363. despite pulling sisters apart. Current biology Schaller GE, Shiu S-H, Armitage JP. 2011. Two- 15:R843–54. component systems and their co-option for Willardson B, Howlett A. 2007. Function of eukaryotic signal transduction. Current phosducin-like proteins in G protein signaling biology 21:R320–30. and chaperone-assisted protein folding. Sebé-Pedrós A, Zheng Y, Ruiz-Trillo I, Pan D. Cellular signalling 19:2417–2427. 2012. Premetazoan Origin of the Hippo Signaling Pathway. Cell Reports 1:1–8. Shenoy S, Lefkowitz R. 2011. β-arrestin-mediated receptor trafficking and signal transduction. Trends in pharmacological sciences 32:521– 533. Acknowledgments Siderovski DP, Willard FS. 2005. The GAPs, This work was supported by a contract from the GEFs, and GDIs of heterotrimeric G-protein Institució Catalana de Recerca i Estudis alpha subunits. International journal of Avançats, a European Research Council biological sciences 1:51–66. Starting Grant (ERC-2007-StG- 206883), and a Sondek J, Siderovski DP. 2001. Ggamma-like grant (BFU2011-23434) from Ministerio de (GGL) domains: new frontiers in G-protein Economía y Competitividad (MINECO) to I. signaling and beta-propeller scaffolding. R.-T. A.dM and A.S-P. are supported by a Biochemical pharmacology 61:1329–1337. pregraduate Formación del Personal Stamatakis A. 2006. RAxML-VI-HPC: maximum Investigador and a pregraduate Formación likelihood-based phylogenetic analyses with Profesorado Universitario grant from MINECO.

Supplementary Figures

7TM_GPCR_Sra_PF02117 7TM_GPCR_Srb_PF02175 Supplementary figure 1. Distribution 7tm_6_PF02949 Insect Odorant Receptors and abundance of metazoa-specific 7TM_GPCR_Srab_PF10292 GPCR types. Domain numbers and 7TM_GPCR_Srbc_PF10316 7TM_GPCR_Srd_PF10317 abundance are depicted according to 7TM_GPCR_Srh_PF10318 the colour code legend. Black boxes 7TM_GPCR_Srj_PF10319 indicate absence of the domain in a 7TM_GPCR_Srsx_PF10320 7TM_GPCR_Srt_PF10321 given species’ genome. No hits were 7TM_GPCR_Sru_PF10322 found in non-metazoan genomes. 7TM_GPCR_Srv_PF10323 7TM_GPCR_Srz_PF10325 7TM_GPCR_Str_PF10326 7TM_GPCR_Sri_PF10327 7TM_GPCR_Srx_PF10328 7tm_7_PF08395 Chemosensory receptor 7tm_4_PF13853 Olfactory Receptors 7TM_GPCR_Srw_PF10324 Serpentine type chemoreceptor Srw a

e r vskii mela tinalis s

al e ectensis w o a_digiti f k Homo_sapiens Lottia_gigantea Capitella_teleta a_magnipapillata Ciona_inte 200 Mnemiopsis_leidyi Acropo r Oscarella_ca r Count

ichoplax_adhaerens r 0 Hyd r T Nematostella_ v Caenorhabditis_elegans 0 40 80 Drosophila_melanogaster

Saccoglossus_ Value

Amphimedon_queenslandica

G Protein Coupled Receptors 7tm1 Rhodopsin/Class A 7tm2 Adhesion/Secretin/Class B 7tm3 Glutamate/Class C Ocular_alb OA1 ITR ITR-like IPR GPR-108-like Dicty_CAR cAMP/Class E Frizzled Frizzled/Class F Git3 STE2 Fungal Pheromone Receptor/Class D STE3 Fungal Pheromone Receptor/Class D ABA_GPCR Abscicic acid receptors 7tm4 7tm_7 7TM_GPCR_Srw 7TM_GPCR_Sra 7TM_GPCR_Srb 7tm_6 Odorant receptor 7TM_GPCR_Srab 7TM_GPCR_Srbc 7TM_GPCR_Srd 7TM_GPCR_Srh 7TM_GPCR_Srj 7TM_GPCR_Srsx 7TM_GPCR_Srt 7TM_GPCR_Sru 7TM_GPCR_Srv 7TM_GPCR_Srz 7TM_GPCR_Str 7TM_GPCR_Sri 7TM_GPCR_Srx G Protein regulators RGS 5*6ïOLke GoLoco i i i i a yi lia us us xa us on isii uzi um um um e r y ans tie r ydis i n ulea vskii assa yzae mela utum ube r mans iabilis erens visiae ad o ahens n vicollis estans aurelia ticillata o r aginalis a r mophila ectensis asiticum a c r x ca r e r ia g r antissima ia vib r alcipa r o wczarzaki ico r ow al e esleea n a digiti f ma arctica a in f a pa r ma whisle r us cinereus oon cuniculi yces pombe opus o r o r ol v ag r ulus guttatus a o wiella natans la k um gemmata i n o r V yzion merolae a pseudonana Homo sapiens pus siliculosus Lottia gigantea Capitella teleta Giardia lam b yces punctatus kinsus ma r Guillardia theta yces cer e Ustilago m a Pi r xoplasma gondii Sorghum bicolor a magnipapillata Emiliania huxl e Rhi z Mi m Ciona intestinalis ypanosoma c r e r tierella v Naegle r o yces macrogy n Ministe r ium limacispo r Mnemiopsis leidyi ium dendrobatidis Leishmania major r abidopsis thaliana Cop r Acropo r Oscarella ca r Chlorella v P Nematocida pa r Aquilegia coe r T Ostreococcus tau r ytophto r Neurospo r T Abeo f ymena the r Bigel o Micromonas pusilla yscomitrella patens anopho r ichoplax adhaerens yces b amecium tet r Mucor circinelloides ypodium distac h uber melanospo r yt r Monosiga br e ichomonas v A r yt r r ello m ostelium discoideum ydomonas reinhardtii r Salpingoeca rossetta Thecamonas t r Hyd r T Mo r T Sphaero f P h a h a r P h Entamoeba histolytica T C y Ectoca r Nematostella v Allo m P ac h Plasmodium f anidiosc h 2000 Caenorhabditis elegans yptococcus neo f Capsaspo r et r osaccharo m olysphondylium pallidum Amoebidium pa r Spi z Encephalito z Creolimax f r yco m Drosophila melanogaster Selaginella moellendorffii Dict y T B r Acanthamoeba castellanii Saccoglossus k alloc h P Nannochloropsis gaditana C y C r Saccharo m Chla m Thalassiossi r Phaeodactylum t r achoc h P h Amphimedon queenslandica ureococcus anophagef f Cryptophyta Co r 0 Schi z Count A Bat r 0 10 25 Haptophyta

Unicellular Unicellular Value Rhizaria Metazoa Holozoa Fungi Amoebozoa Embryophyta Archaeplastida Heterokonta Alveolata Excavata Supplementary figure 2. Abundance of diverse domain architectures of a given domain in Eukaryotic genomes. Pale green indicates a single domain architecture, i.e. the core-domain. Black indicates absence of the domain. Among GPCRs, 7tm_1, 7tm_2 and 7tm_3 are the richest in terms of domain architecture, and RGS also very often has diverse domain arrangements.

Supplementary figure 3. Distribution of specific protein domain architectures of Rhodopsin (7tm_1), Adhesion/Secretin (7tm_2) and Glutamate (7tm_3) GPCR families across eukaryotic genomes. Domain architectures are named according to PFAM nomenclature (Punta et al. 2012).

/Dicty_CAR/

/Dicty_CAR//PIP5K/

/Git3/

/Git3//Git3_C/

/Fz//Frizzled/

/Frizzled/ i i i a yi us xa us us on isii ydi uzi um um um e r y ans tie r ydis i n ulea vskii assa yzae mela utum ube r mans erens iabilis visiae ad o ahens n vicollis aurelia estans ticillata al e o r aginalis a r mophila ectensis a c r x ca r ystolitica w e r ia g r ia vib r alcipa r o h wczarzaki ico r esleea n o a digiti f a in f a pa r ma whisleri us cinereus oon cuniculi yces pombe opus o r ol v ulus guttatus la k a o wiella natans i n o r um gemmatta V yzion merolae a pseudonana Homo sapiens pus siliculosus Lottia gigantea Capitella teleta yces punctatus Giardia lamblia kinsus ma r Guillardia theta yces cer e Ustilago m a xoplasma gondii Sorghum bicolor a magnipapillata Pi r Emiliania huxl e Rhi z Mi m Ciona intestinalis ypanosoma c r e r Mnemiopsis l e tierella v Naegle r o yces macrogy n Ministe r ium limacispo r ium dendrobatidis Leishmania major r abidopsis thaliana Cop r Acropo r Oscarella ca r Chlorella v P Nematocida pa r Aquilegia coe r T Ostreococcus tau r Neurospo r T ymena the r Bigel o Abeo f yscomitrella patens anopho r ichoplax adhaerens echamonas t r amecium tet r Mucor circinelloides ypodium distac h uber melanospo r yt r Monosiga br e ichomonas v A r yt r r ello m ytophtho r ostelium discoideum T r Salpingoeca rossetta Hyd r T Mo r T a h ydomonas reindhardti a r Entamoeba P h T C y Ectoca r Allo m Nematostella v P h P ac h Plasmodium f anidiosc h Sphaerophorma arctica Caenorhabditis elegans yptococcus neo f Capsaspo r et r osaccharo m olysphondylium pallidum Spi z Encephalito z Amoebidium parasiticum Selaginella moellendorffii Drosophila melanogaster Creolimax fragrantissima T Dict y B r Acanthamoeba castellanii Saccoglossus k alloc h P Nannochloropsis gaditana C y C r Saccharo m Thalassiossi r Phaeodactylum t r achoc h Phycomyces b Chla m Amphimedon queenslandica ureococcus anophagef f Co r Schi z A Micromonas pusilla CCMP1545 Bat r yta op h yp t r C yta 200 Count op h oa z 0 Hap t ia Unicellular Unicellular 0 40 80 moeb o hiza r R Metazoa Holozoa Fungi A Embryophyta Archaeplastida Heterokonta Alveolata Excavata Value

Supplementary figure 4. Distribution of specific protein domain architectures of cAMP (Dicty_CAR), Git3 and Frizzled GPCR families across eukaryotic genomes. Domain architectures are named according to PFAM nomenclature (Punta et al. 2012). Supplementary figure 5. G Protein beta subunit ML phylogenetic tree of G Eumetazoa 68/0.94 Placozoa protein beta subunit. Gβ1-4 Porifera Ctenophora Different eukaryotic lineages are represented by <50/0.94 the colour code depicted in <50/0.67 Ministeria the legend. Nodal supports Holozoa Ichthyosporea indicate 100-replicate ML Capsaspora bootstrap support and <50/0.96 Bayesian Posterior

89/1.00 Eumetazoa Gβ5 Placozoa Probability (BPP). Supports Porifera <50 are only shown for nodes 0.95 <50/0.98 Ichthyosporea <50/0.94 Ministeria recovered by both ML and Capsaspora Bayesian inference and with BPP>0.9. Dikarya Mucoromycota Fungi 66/1.00 Blastocladiomycota Chytridiomycota

61/0.97 Dictyostelia Amoebozoa Acanthamoeba

100/1.00 Spermatophyta Metazoa Selaginella Embryophyta Unicellular Holozoa Physcomitrella 51/0.97 Fungi 88/1.00 Heterokonta Amoebozoa Excavata 51/0.86 67/0.99 Bigelowiella Heterokonta Emiliana Enthamoeba Rhizaria 98/1.00 Naegleria Haptophyta Embryophyta 0.3

Supplementary figure 6. Distribution of domain architectures of RGS and GoLoco families across eukaryotic genomes. Domain architectures are named according to PFAM nomenclature (REF PFAM). Red text indicates domain architectures that use the same domains but have convergent origins. The name of the gene family to which the domain architecture belongs is shown in bold.

Supplementary figure 7. Table showing the number of transmembrane domains present in RGS proteins. The percentage of RGS proteins that have an associated transmembrane domain are shown on the right. The acronym of each organism consists of the first letter of the generic name and the first 3 letters of the species name (e.g., Hsap= Homo sapiens). Ric8 Metazoa Unicellular Holozoa Fungi Amoebozoa Heterokonta Eumetazoa Placozoa Porifera

Ichthyosporea

Filasterea

Choano!agellata

Fungi

Amoebozoa

Heterokonta

Supplementary figure 8. ML phylogenetic tree of the domain Ric8. The various eukaryotic lineages are represented by the colour code depicted in the legend. Nodal supports indicate 100-replicate ML bootstrap support.

Supplementary figure 9. ML phylogenetic tree of Phosducin domain proteins. Different eukaryotic lineages are represented by a colour code depicted in the legend. Nodal supports indicate 100-replicate ML bootstrap support. Only the bootstrap supports >50 are shown. Supplementary figure 10. ML phylogenetic tree of RGS domain including SNX13/14/25 protein as the closest out-group to GRK. The various eukaryotic lineages are represented by the colour code depicted in the legend. Best tree obtained from 100 independent runs in RAxML.

Supplementary figure 11. ML phylogenetic tree of Arrestin proteins (Arrestin_N and Arrestin_C domains used to construct the alignment). The various eukaryotic lineages are represented by the colour code depicted in the legend. Nodal supports indicate 100-recplicate ML bootstrap support. Only the bootstrap supports >50 are shown.

80

Results R4

Resum de l’article R4: El genoma de Capsaspora ens revela que la prehistòria unicel·lular dels animals va ser complexa.

Per reconstruir l'origen evolutiu dels animals ens calen les seqüències genòmiques dels seus ancestres unicel·lulars. Per tal de poder inferir com era l’ancestre unicel·lular dels metazous, necessitem comparar els genomes d’animals amb els dels seus parents unicel·lulars vius. No obstant, fins ara només disposavem del genoma del coanoflagel·lat Monosiga brevicollis. En aquest treball analitzem la seqüència completa del genoma del filasteri Capsaspora owczarzaki, pertanyent al grup més proper als metazous juntament amb els coanoflagel·lats. Els resultats que n’obtenim canvien la nostra concepció de la complexitat molecular dels avantpassats unicel·lulars dels metazous, ja que C. owczarzaki té un repertori ric en proteïnes implicades en l'adhesió cel·lular i regulació transcripcional que no es trobaven al genoma de M. brevicollis. Per tant, moltes d'aquestes proteïnes es van perdre secundàriament als coanoflagel·lats. Per contra, la majoria dels sistemes de senyalització intercel·lular implicats en el desenvolupament embrionari van evolucionar més tard, concomitant amb l'aparició dels primers metazous. Proposem que l'adquisició d'aquests sistemes de desenvolupament específics de metazous i la co-opció de gens preexistents van impulsar la transició evolutiva de protists unicel·lulars a metazous.

81

82

ARTICLE

Received 18 Jun 2013 | Accepted 18 Jul 2013 | Published 14 Aug 2013 DOI: 10.1038/ncomms3325 OPEN The Capsaspora genome reveals a complex unicellular prehistory of animals

Hiroshi Suga1,*, Zehua Chen2,*, Alex de Mendoza1, Arnau Sebe´-Pedro´s1, Matthew W. Brown3, Eric Kramer4, Martin Carr5, Pierre Kerner6, Michel Vervoort6,Nu´ria Sa´nchez-Pons1, Guifre´ Torruella1, Romain Derelle7, Gerard Manning4, B. Franz Lang8, Carsten Russ2, Brian J. Haas2, Andrew J. Roger3, Chad Nusbaum2 & In˜aki Ruiz-Trillo1,9,10

To reconstruct the evolutionary origin of multicellular animals from their unicellular ancestors, the genome sequences of diverse unicellular relatives are essential. However, only the genome of the choanoflagellate Monosiga brevicollis has been reported to date. Here we completely sequence the genome of the filasterean Capsaspora owczarzaki, the closest known unicellular relative of metazoans besides choanoflagellates. Analyses of this genome alter our understanding of the molecular complexity of metazoans’ unicellular ancestors showing that they had a richer repertoire of proteins involved in cell adhesion and transcriptional regulation than previously inferred only with the choanoflagellate genome. Some of these proteins were secondarily lost in choanoflagellates. In contrast, most intercellular signalling systems con- trolling development evolved later concomitant with the emergence of the first metazoans. We propose that the acquisition of these metazoan-specific developmental systems and the co-option of pre-existing genes drove the evolutionary transition from unicellular protists to metazoans.

1 Institut de Biologia Evolutiva (CSIC-Universitat Pompeu Fabra), Passeig Marı´tim de la Barceloneta 37-49, 08003 Barcelona, Spain. 2 Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, Massachusetts 02142, USA. 3 Department of Biochemistry and Molecular Biology, Faculty of Medicine, Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia, Canada B3H 1X5. 4 Razavi Newman Center for Bioinformatics, Salk Institute for Biological Studies, 10010 North Torrey Pines Road, La Jolla, California 92037, USA. 5 School of Applied Sciences, University of Huddersfield, Huddersfield HD1 3DH, UK. 6 Institut Jacques Monod, CNRS, UMR 7592, Univ Paris Diderot, Sorbonne Paris Cite´, F-75205 Paris, France. 7 Centre for Genomic Regulation (CRG), Dr Aiguader 88, 08003 Barcelona, Spain. 8 De´partement de Biochimie, Centre Robert- Cedergren, Universite´ de Montre´al, 2900 Boulevard Edouard Montpetit, Montre´al (Que´bec), Canada H3C 3J7. 9 Departament de Gene`tica, Facultat de Biologia, Universitat de Barcelona, Avinguda Diagonal 643, 08028 Barcelona, Spain. 10 Institucio´ Catalana de Recerca i Estudis Avanc¸ats (ICREA), Passeig Lluı´s Companys, 23, 08010 Barcelona, Spain. * These authors contributed equally to this work. Correspondence and requests for materials should be addressed to I.R.-T. (email: [email protected]).

NATURE COMMUNICATIONS | 4:2325 | DOI: 10.1038/ncomms3325 | www.nature.com/naturecommunications 1 & 2013 Macmillan Publishers Limited. All rights reserved. ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3325

ow multicellular animals (metazoans) evolved from a the pulmonate snail Biomphalaria glabrata15 and the sister group single-celled ancestor remains a long-standing evolution- to metazoans and choanoflagellates7,8. Recent analyses identified Hary question. To unravel the molecular mechanisms and some proteins in Capsaspora crucial to metazoan multicellularity genetic changes specifically involved in this transition, we need to including cell adhesion molecules such as integrins and cadherins, reconstruct the genomes of both the most recent unicellular development-related transcription factors, receptor TKs and ancestor of metazoans and the last common ancestor of organ growth control components16–21. However, the whole multicellular animals. To date, most studies have focused on suite of molecules involved in these pathways and other the latter, obtaining the genome sequences of several early- important systems has not to date been systematically analysed. branching metazoans, which provided significant insights into By comparing the Capsaspora genome with those of choano- early animal evolution1–4. However, available genome sequences flagellate and metazoans, we develop a comprehensive picture of of close unicellular relatives of metazoans have been insufficient the evolutionary path from the ancestral holozoans to the last to investigate their unicellular prehistory. common ancestor of metazoans. Recent phylogenomic analyses have shown that metazoans are closely related to three distinct unicellular lineages, choanoflagel- lates, filastereans and ichthyosporeans, which together with Results metazoans form the holozoan clade5–8. Until recently, only the The genome of Capsaspora. We sequenced genomic DNA from genome of the choanoflagellate Monosiga brevicollis had been an axenic culture of Capsaspora owczarzaki (Fig. 1) and assem- sequenced9. This genome provided us with the first glimpse into bled the raw reads of approximately 8 coverage into 84 scaf- the unicellular prehistory of animals, showing that the unicellular folds, which span 28 Mb in total. The N50 contig and scaffold ancestor of Metazoa had a variety of cell adhesion and receptor- sizes are 123 kb and 1.6 Mb, respectively. We predicted 8,657 type signalling molecules, such as cadherins and protein tyrosine protein-coding genes, which comprise 58.7% of the genome. kinases (TKs)9–11. However, many transcription factors involved in Transposable elements make up at least 9.0% of the genome animal development, as well as some cell adhesion and the majority (Supplementary Figs S1 and S2, Supplementary Table S1 and of intercellular signalling pathways were not found. They were Supplementary Note 1), a much larger fraction than in M. bre- therefore assumed to be both specific to metazoans and largely vicollis (1%)22 or the yeast Saccharomyces cerevisiae (3.1%)23. responsible for development of their complex multicellular body The Capsaspora genome has a more compact structure than plans9,12. This view was further reinforced with the recent genome that of M. brevicollis or metazoans, containing 309.5 genes per sequence of another choanoflagellate, the colonial Salpingoeca Mb (Table 1). Genes have an average of 3.8 introns with a mean rosetta13. However, inferences based on only a few sampled intron length of 166 bp. The mean distance between protein- lineages are notoriously problematic, especially in light of the high coding genes is 724 bp. Interestingly, genes involved in receptor frequency of gene loss reported in eukaryotic lineages14.Clearly, activity, transcriptional regulation and signalling processes have genome sequences from earlier-branching holozoan lineages are particularly large upstream intergenic regions compared with needed in order to robustly infer the order and timing of genomic other genes. (Supplementary Figs S3–S5, Supplementary Note 1). innovations that occurred along the lineage leading to the Metazoa. This pattern is seen across most of the eukaryotic taxa we Here we present the first complete genome sequence of a analysed. In contrast to its compact nuclear genome, Capsaspora filasterean, Capsaspora owczarzaki, an endosymbiont amoeba of has a 196.9 kb mitochondrial genome, which is approximately

Homo sapiens Strongylocentrotus purpuratus Capitella teleta Lottia gigantea Caenorhabditis elegans Drosophila melanogaster Nematostella vectensis Hydra magnipapillata Trichoplax adhaerens Amphimedon queenslandica Metazoa Monosiga brevicollis Capsaspora owczarzaki Holozoa Laccaria bicolor Coprinopsis cinerea Neurospora crassa Schizosaccharomyces pombe Rhizopus oryzae Phycomyces blakesleeanus Fungi Opisthokonta Dictyostelium discoideum Dictyostelium purpureum Dictyostelium fasciculatum Polysphondylium pallidum Acanthamoeba castellanii Amoebozoa 0.2 substitutions per site

Figure 1 | The filasterean Capsaspora owczarzaki. (a,b) Differential interference contrast microscopy (a) and scanning electron microscopy (b) images of C. owczarzaki. Scale bar, 10 mm(a) and 1 mm(b). (c), Phylogenetic position of C. owczarzaki. Four different analyses on the basis of two independent data sets and two different methods indicate an identical topology, except for the clustering of all non-sponge metazoans (white circle). Details are in Supplementary Note 2. Gray and black circles indicate Z90% (0.90) and Z99% (0.99) of bootstrap values and Bayesian posterior probabilities, respectively, for all four analyses.

2 NATURE COMMUNICATIONS | 4:2325 | DOI: 10.1038/ncomms3325 | www.nature.com/naturecommunications & 2013 Macmillan Publishers Limited. All rights reserved. NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3325 ARTICLE

12 and 2.6 times larger than the average metazoan mtDNAs outgroup taxa) in the lineage leading to the Metazoa, but also (B16 kb) and that of M. brevicollis (76.6 kb), respectively substantial domain loss in fungi, Capsaspora and M. brevicollis. (Supplementary Fig. S6, Supplementary Tables S2 and S3 and Protein domains acquired by the last common ancestor of filas- Supplementary Note 1). Our multi-gene phylogenetic analyses tereans, choanoflagellates and metazoans were enriched in with several data sets corroborate that Capsaspora is the ontology terms associated with signal transduction and tran- sister group to choanoflagellates and metazoans7,8 (Fig. 1, scriptional regulation (Fig. 2b, Supplementary Table S5). Inter- Supplementary Figs S7–S10 and Supplementary Note 2). estingly, such domains include those composing proteins that are involved in metazoan multicellularity and development; for example the cell adhesion molecule integrin-b, and the tran- The origins of metazoan protein domains. Utilizing all available scription factors p53 and RUNX (Fig. 2b, Supplementary genome sequences from early-branching metazoans and the two Table S4). Several domains involved in transcriptional regula- unicellular relatives of the Metazoa (Capsaspora and M. brevi- tion were secondarily lost in M. brevicollis (Fig. 2c, collis), we inferred the protein domain evolution along the Supplementary Table S6)17. Domains involved in extracellular eukaryotic tree14 (Fig. 2, Supplementary Fig. S11, Supplementary functions have been frequently lost in both Capsaspora and M. Tables S4–S7 and Supplementary Note 3). We observed a con- brevicollis. Our data indicate that 235 new domains emerged after tinuous emergence of new protein domains (domains without the divergence of filastereans and choanoflagellates from statistically significant homologies to any proteomes in the the lineage leading to the Metazoa. These ‘metazoan-specific

Table 1 | Genome statistics of Capsaspora owczarzaki and other eukaryotes.

H. sa N. vec A. que M. bre C. owc N. cra S. cer D. dis Genome size (Mb) 3,101.8 357.0 167.1 41.6 28.0 41.0 12.1 34.1 % GC 40.9 40.6 31.1 54.9 53.8 48.2 38.3 22.4 Number of genes 22,128 27,273 30,327 9,171 8,657 9,730 5,863 12,474 Gene density (per Mb) 7.1 76.4 181.4 220.3 309.5 237.3 485.7 365.5 CDS % genome 1.2 7.6 21.4 39.7 58.7 36.3 72.4 61.8 Mean intron # per gene 8.8 4.3 4.7 6.6 3.8 1.7 0.1 1.5 Mean intron size (bp) 5,645 799 251 171 166 115 203 139 Mean inter-CDS size (bp) 99,262 6,707 3,205 1,490 724 2,440 571 804

A. que, Amphimedon queenslandica; C. owc, Capsaspora owczarzaki; D. dis, Dictyostelium discoideum; H. sa, Homo sapiens; M. bre, Monosiga brevicollis; N. cra, Neurospora crassa; N. vec, Nematostella vectensis; S. cer, Saccharomyces cerevisiae.

[Gain] Gene ontology N domains

Holozoa Signal transducer activity 8 Transcription regulatory region DNA binding 7 Metazoa Extracellular region 9 Metazoa M. brevicollis C. owczarzaki Fungi Outgroup Transcription regulatory region DNA binding 8

5,157 3,014 3,287 3,941 [Loss] +235 +19 +6 +55 –56 –1,983 –1,688 –1,143 Holozoa Hydrolase activity (O-glycosyl bonds) 7 Fungi Multi-organism process 10 Negative regulation of hydrolase activity 7 Metazoa+ 4,978 Transcription regulatory region DNA binding 16 choano +61 –52 M. brevicollis Extracellular region 29 Regulation of transcription, DNA-dependent 62 Nucleoplasm 22 Holozoa 4,969 +100 Developmental process 11 –160 Cellular cell wall organization or biogenesis 7 Racemase and epimerase activity 8 Opisthokonta 5,029 C. owczarzaki Extracellular region 26 Receptor binding 12 Reproduction 12 Multicellular organismal process 13 Cell wall organization biogenesis 11 Pathogenesis 8 Viral reproduction 9 External encapsulating structure 7

Figure 2 | Gain and loss of protein domains within the Opisthokonta. (a) The number of Pfam protein domains that were gained or lost at each evolutionary period was inferred by Dollo parsimony, which does not consider multiple independent evolution of a domain. Total number of protein domains, and the inferred numbers of domain gain ( þ ) and loss ( ) events are depicted at the tree edges. The full list of domains is in Supplementary Table S4. (b,c) GO terms that were enriched by the evolution of protein domains (b) or depleted by the loss of protein domains (c) were sought via the topology-weighted algorithm. The significant GO terms (Po1.0e 3) are shown at the tree edges together with the number of included Pfam domains. Terms including fewer than seven gained or lost domains are not shown. The list of domains included in each GO is in Supplementary Tables S5 and S6.

NATURE COMMUNICATIONS | 4:2325 | DOI: 10.1038/ncomms3325 | www.nature.com/naturecommunications 3 & 2013 Macmillan Publishers Limited. All rights reserved. ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3325 innovations’, narrowed down from 299 to 235 by the use of Nevertheless, several protein domains found in these ECM Capsaspora genome, include those that are part of extracellular proteins are present as components of other proteins, raising the ligands and their associated components and are involved in possibility of unknown ECM molecules secreted by Capsaspora metazoan development, such as Noggin, Wnt and transforming that could interact with its integrin machinery. In contrast to growth factor b (Supplementary Table S4). At the root of the Capsaspora, M. brevicollis, which lacks integrins, has some ECM Metazoa, we observed significant gains in ontology terms asso- proteins (Supplementary Fig. S14, Supplementary Note 5). ciated with transcriptional regulation and extracellular domains. Capsaspora also has several components of the dystrophin- This ‘metazoan-origin’ domain set, which is much better deli- associated glycoprotein complex, another cell–ECM adhesion neated through comparative analysis using both the Capsaspora system. Both Capsaspora and choanoflagellates have cadherin and M. brevicollis genomes, likely comprises the key innovations domain-containing proteins, but M. brevicollis has a much larger relevant to the evolution of complex multicellular development. repertoire (23 proteins)9 than Capsaspora, which has only one21 (Supplementary Fig. S15). Both immunoglobulin-like cell adhe- sion molecules and C-type lectins, which are lacking in Enrichment of domains in Holozoa. Gene duplication is an Capsaspora, were present in the unicellular common ancestor important evolutionary driving force that increases the functional of metazoans and choanoflagellates, as they are encoded by the capacity of proteomes24. We thus examined not only the origin of M. brevicollis genome. domains involved in metazoan multicellularity but also the Several transcription factors arose and diversified in metazoans abundance of these domains in the genomes of different (for example, those involved primarily in developmental eukaryotic lineages. We chose 106 InterPro25 protein domains patterning and cell differentiation such as group A basic helix– that are most significantly overrepresented in metazoan genomes loop–helix, ANTP-class homeodomains, POU-class homeo- compared with the non-holozoan genomes, and counted the domains, Six, LIM, Pax and group I Fox). However, many other number of genes encoding these domains (Fig. 3, Supplementary transcription factors, including some previously thought to be Figs S12 and S13 and Supplementary Note 4). Our data show that metazoan-specific, for example, NFk, RUNX and Brachyury, were these domains are, in metazoans, mainly involved in cell adhe- already present in the ancestral unicellular holozoans17 sion, intercellular communication, signalling, transcriptional (Supplementary Figs S16–S18, Supplementary Table S8, regulation and apoptosis, which are relevant to multicellularity Supplementary Note 5). Interestingly, some transcription factors and development of metazoans. Most of these domains show that act downstream of some signalling pathways in metazoans, clear enrichment exclusively in metazoans. However, the abun- such as CSL (Notch–Delta pathway) and STAT (Jak–STAT dance of some of these domains is also increased in the genome of pathway), are present in Capsaspora, whereas their upstream Capsaspora. Those that are particularly enriched include the proteins are missing. laminin-type epidermal growth factor-like, Integrin-b4, Sushi, Our data reveal the contrasting evolutionary histories of protein tyrosine kinase, Pleckstrin homology, Src homology 3, extracellular (or membrane-bound) components versus cyto- p53-like transcription factor DNA binding and Band4.1 domain plasmic components of signalling pathways involved in metazoan and leucine-rich repeat. These domains are not always similarly multicellularity and development. Most metazoan receptors and enriched in the M. brevicollis genome, as seen, for example, in the diffusible ligands are either ancestral metazoan innovations or Integrin-b4 domain and LRR. Overall, our analyses show that have independently diversified in metazoans, whereas the majority protein domains involved in cellular signal transduction and, to a of their intracellular components were already present in the certain extent, cell adhesion and extracellular regions were unicellular ancestors of metazoans (Fig. 4). Both Capsaspora and already abundant in the common ancestor of the Holozoa, M. brevicollis lack receptors and ligands in several systems whereas those in other categories such as channels and trans- involved in cell communication and development in metazoans, porters expanded much later, during metazoan evolution. for example, those in the Hedgehog, Rhodopsin family G-protein- coupled receptors, Wnt, transforming growth factor-b and nuclear Gene repertoire of Capsaspora. To further investigate the receptor signalling pathways (Fig. 4). Notch signalling also seems evolutionary origin of the molecular components required for to be a metazoan innovation, although Capsaspora has several multicellularity, we performed homology searches and, in most receptor proteins that resemble the metazoan Notch and Delta cases, phylogenetic analyses of genes involved in cell adhesion, proteins in their domain architecture, which may represent the transcriptional regulation, cell signalling, and nervous system ancestral components of this system (Supplementary Figs S19–S21, function (Supplementary Note 5). Additionally, to better under- Supplementary Note 5). Both Capsaspora and M. brevicollis stand the basic biology of Capsaspora, we analysed gene families have large numbers of TKs (92 and 128, respectively)20 proteins involved in meiosis, cell cycle regulation, flagellum for- (Supplementary Figs S22 and S23, Supplementary Table S9). mation, post-transcriptional regulation and small RNA synthesis Again, the receptor-type TKs independently diversified in and functioning. Figure 4 schematically summarizes our main Capsaspora, M. brevicollis and metazoans, whereas the cytoplas- findings, depicting the cellular structures and pathways present in mic TKs are mostly homologous among these three lineages, Capsaspora and metazoans. We note that none of the analyses highlighting the animal-specific adaptation of the receptor-ligand provided any evidence of lateral gene transfer events from system in the Metazoa20. The mitogen-activated protein kinase metazoans to Capsaspora. pathway, a downstream cytoplasmic signalling system of the TK The unicellular common ancestor of metazoans and Capsas- pathway, is also present in Capsaspora in the diversified form that pora appears to have been well equipped with some type of cell we see now in metazoans (Supplementary Figs S24 and S25, adhesion mechanism (Fig. 4, Supplementary Fig. S14, Supplementary Note 5). The diverse members of the G-protein a- Supplementary Note 5). For example, the main components of subunit family and the regulator of G-protein-signalling family, the integrin adhesion machinery, which in metazoans is used for which together coordinate signal transduction from the 7TM the attachment of cells to the extracellular matrix (ECM), are receptors to their specific effectors, are also present in the present in Capsaspora16. However, M. brevicollis lacks integrins Capsaspora genome, indicating that the diversity of these and thus choanoflagellates may have secondarily lost them. Even components has been secondarily lost, to some extent, in the though Capsaspora has integrins, it lacks homologues of lineage leading to M. brevicollis (Supplementary Figs S26 and S27, metazoan ECM proteins such as fibronectins and laminins. Supplementary Note 5).

4 NATURE COMMUNICATIONS | 4:2325 | DOI: 10.1038/ncomms3325 | www.nature.com/naturecommunications & 2013 Macmillan Publishers Limited. All rights reserved. NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3325 ARTICLE

Metazoa PhaeophyceaeApicomplexaViridiplantaeAmoebozoaFungi FilastereaChoanoflagellata(animals) Holozoa

Fibrinogen_a/b/g_C Coagulation_fac_5/8−C_type_dom Tetraspanin_EC2 Thrombospondin_1_rpt Collagen Laminin_G Zona_pellucida_Endoglin/CD105 VWF_C Cell adhesion TIL_dom Kringle−like and extracellular components Pentaxin EGF_laminin Marvel Cadherin−like Ig−like_fold Fibronectin_type3 Calx_beta EGF−like C−type_lectin−like LDrepeatLR_classA_rv Sushi_SCR_CCP EGF_rcpt_L LDLR_classB_rpt Extracellular region of receptors NHL_repeat and ligands GPS_dom Srcr_rcpt−rel CUB Galactose−bd−like Cys−rich_flank_reg_C Chitin−bd_dom MAM_dom Wnt TGF−b_C PAN−1_domain GPCR_Rhodpsn_supfam GPCR_2_secretin−like Other receptors of GPCR_3 7TM_GPCR_serpentine_rcpt_Srw extracellular signals Nucl_hrmn_rcpt_lig−bd_core Neur_channel Na+channel_ASC K_chnl Channels and transporters Na/ntran_symport OA_transporter Tyr_kinase_cat_dom Tyr_Pase_rcpt/non−rcpt PH_like_dom SH2 SH3_domain PTyr_interaction_dom Signal transduction in cell PDZ TRAF−like CHK_kinase−like Homeodomain TF_fork_head p53−like_TF_DNA−bd TF_T−box MADF−dom Transcriptional control Tscrpt_reg_SCAN HTH_Psq Ets Bbox_C CARD DEATH−like Pept_C14_p45 Pept_cys/ser_Trypsin−like Peptidase_M12A Protein degradation/modification Peptidase_M12B and apoptosis Peptidase_M13_N Prot_inh_Kunz−m Kazal−type_dom ShK_toxin Leu−rich_rpt_typical−subtyp LRR−contain_N Other/diverse function Esil Pfal Ptet Crei Lbic Ehis Ddis Cele Atha Ncra Lmaj Mbre Nvec Tadh Dmel Hsap Aque Cowc Hmag

0 1 Normalized Holozoa gene count

Figure 3 | Capsaspora owczarzaki enrichment of metazoan-biased protein domains. The number of genes encoding proteins that contain each of the selected InterPro domains were analysed (the InterPro short names are shown on the right) for 19 eukaryote genomes. We chose 106 domains that are significantly (Fisher’s exact test; Po1.0e 20) enriched in the metazoan genomes compared with the genomes of non-holozoan lineages. Redundant domains are not exhaustively shown. Domains present only in a single taxon are not shown (available in Supplementary Fig. S12). Values were normalized by the number of all protein-coding genes in the genomes, and relative values to the maximum were calculated. Numbers were manually entered for the protein tyrosine kinase catalytic domain (Tyr_kinase_cat_dom) to exclude mispredicted serine/threonine kinase domains (see Supplementary Note 5). Protein domains were manually classified into 12 functional categories, shown on the right. In this figure, the categories ‘Zinc-fingers’, ‘cytoskeleton and its control’, ‘Functions on DNA or RNA molecules’, ‘Virus and transposons’ and ‘Other/diverse functions’ are collapsed (only leucine-rich repeats are shown; full figure available in the Supplementary Fig. S13). Domains with high relative gene counts (40.65) in Capsaspora are depicted in red. Hsap, H. sapiens; Dmel, D. melanogaster; Cele, C. elegans; Hmag, H. magnipapillata; Nvec, N. vectensis; Tadh, T. adhaerens; Aque, A. queenslandica; Mbre, M. brevicollis;Cowc, C. owczarzaki; Ncra, N. crassa; Lbic, L. bicolor; Ddis, D. discoideum; Ehis, E. histolytica; Atha, A. thaliana; Crei, C. reinhardtii; Pfal, P. falciparum; Lmaj, L. major; Ptet, P. tetraurelia; Esil, E. siliculosus. A widely-accepted phylogeny among species is depicted on the bottom.

NATURE COMMUNICATIONS | 4:2325 | DOI: 10.1038/ncomms3325 | www.nature.com/naturecommunications 5 & 2013 Macmillan Publishers Limited. All rights reserved. ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3325

Cell adhesion Extracellular matrix CadherinIntegrins DG SGSSPNIgCAM C-lectin Catenins Src FAK DBDP SYN ILK/PINCH/parvin Ras Talin/vinculin/paxillin Smoothened MAPK MAPKK MAPKKK RTKs Hedge Hog Patched Nuclear receptor HedgeHog JNK CTKs **** * Dispatched Gli pTyr signalling JAK p53 Runt NFκB Grh STAT *** NRPTPs RPTP Notch Notch Myc Brachyury Fringe ** CSL Delta Jun Fos Sox Ets SMAD TGFβR TGFβ TGFβ HD (TALE) HD (non-TALE) Noggin Rhodopsin 7TM S6K p70 TOR GluR Lef-1/Tcf Sd/TEAD AKT 4E-BP1 Raptor Secretin Gα Gβ Gγ Yorkie/YAP RGS ITR Hippo/Mst Merlin GPR108 PDE Growth Sav Kibra regulation FAT OA1 Warts/Lats Expanded AxinDsh *** Wnt Frizzled -Catenin Mats/Mob1 Gα CK1 GSK3

Figure 4 | Schematic representation of the putative Capsaspora owczarzaki cell. Protein components of major metazoan cell adhesion complexes (green background) and various signalling pathways including receptors (yellow background) are depicted. Components with red and blue backgrounds indicate those found both in C. owczarzaki and M. brevicollis and those found in C. owczarzaki but not in M. brevicollis, respectively. Dotted components are absent in the C. owczarzaki genome; greyed when M. brevicollis has them. A grey striped line represents an actin filament, to which the cell–ECM–adhesion complexes bind. See Supplementary Note 5 for details. *, two domains hedge and hog are found in different proteins in M. brevicollis. **, receptor-type proteins with domain architectures similar to Notch and Delta proteins are present in C. owczarzaki. ***, proteins with similar domain architectures are present in M. brevicollis, but not confidently mapped to those metazoan families by phylogenetic analyses. ****, the repertoires of RTKs are totally different between C. owczarzaki, M. brevicollis and metazoans, and thus likely to have diversified independently in each lineage. 4E-BP1, eukaryotic initiation factor 4E-binding protein 1; CK1, Casein kinase 1; C-Lectin, C-type lectin; CSL, CBF1/RBP-Jk/suppressor of hairless/LAG-1; CTKs, cytoplasmic tyrosine kinases; DB, Dystrobrevin; DG, Dystroglycan; DP, Dystrophin; Dsh, Dishevelled; Ets, E-twenty six; FOX, Forkhead box; Ga, G-protein-a subunit; GluR, glutamate receptor; GPR108, G-protein-coupled receptor 108; GSK3, glycogen synthetase kinase 3; Grh, Grainy head; HD, homeodomain; IgCAM, Immunoglobulin-like cell adhesion molecule; ILK, Integrin-linked kinase; ITR, intimal thickness-related receptor; JNK, c-Jun N-terminal kinase; NRPTPs, non-receptor protein tyrosine phosphatases; OA1, Ocular albinism 1-like; PDE, phosphodiesterase; RGS, regulator of G-protein signalling; RTKs, receptor tyrosine kinases; RPTP, receptor protein tyrosine phosphatase; S6K p70; 70kDa ribosomal protein S6 kinase; SAV, Salvador; Sd, Scalloped; SG, Sarcoglycan; SSPN, Sarcospan; SYN, Syntrophin; TALE, three amino-acid loop extension-class homeodomain; TGFb, Transforming growth factor b; TOR, Target of rapamycin; YAP, Yes-associated protein.

Neither sexual reproduction nor meiosis has been reported in highly divergent. The Capsaspora genome also possesses, similar Capsaspora. Nonetheless, we identified in its genome a rich to the M. brevicollis genome, a large number of proteins repertoire of proteins known to be involved in sex and meiosis in homologous to those involved in neurosecretion and pre- and post- metazoans (Supplementary Fig. S28, Supplementary Note 5), synapse formation and function (Supplementary Figs S33–S36, suggesting the presence of a full sexual reproductive cycle in this Supplementary Note 5). organism. Capsaspora also has a rich repertoire of genes involved in cell cycle regulation (Supplementary Fig. S29), including some genes not present in M. brevicollis, such as cyclin E. We also Discussion found, as expected, that Capsaspora, which lacks flagellum or We have reported the first whole genome sequence of a cilia, retains only a minor fraction (29 out of 117 genes) of the filasterean, a close relative of metazoans. We show that the gene set encoding flagellar components (Supplementary Fig. 30, genome of Capsaspora encodes many proteins that are involved Supplementary Note 5). Moreover, all motor protein kinesins, in cell adhesion, signalling and development in metazoans. which are involved in various basic cellular functions such as Previously, the absence of a number of these proteins in the mitosis and transport in many cellular structures, are conserved choanoflagellate M. brevicollis and in any sequenced fungi had between Capsaspora and H. sapiens, except for a few families misled inferences that they were metazoan-specific12,27,28, including kinesins 2, 9, 13 and 17, which are thought to be underscoring the importance of taxonomic sampling in flagellum components26. We also identified several RNA-binding comparative genomics. By adding the whole genome proteins (Supplementary Figs S31 and S32, Supplementary information of the filasterean Capsaspora, the sister group of Note 5), some of which are homologous to those involved in choanoflagellates and metazoans, we have reconstructed a more stem cell or germ-line cell development, such as bruno, daz, pl10 robust picture of the unicellular ancestry of metazoans. This and pumilio. Although we identified putative homologues of evolutionary scenario will be increasingly clarified as genome data some RNA-binding proteins involved in synthesis and function- from additional holozoan taxa (for example, ichthyosporeans) ing of the non-coding RNA in metazoans (for example, armi- become available. tage, exportin-5 and Tudor-SN), many other key players Our data show that the unicellular common ancestor of (piwi, argonaute, dicer, drosha and pasha) are absent, suggesting metazoans, choanoflagellates and filastereans already possessed a either that the non-coding RNA system is non-functional in wide variety of gene families that, in metazoans, are involved in Capsaspora, or that the silencing mechanism of this filasterean is multicellularity and development. This early genetic complexity

6 NATURE COMMUNICATIONS | 4:2325 | DOI: 10.1038/ncomms3325 | www.nature.com/naturecommunications & 2013 Macmillan Publishers Limited. All rights reserved. NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3325 ARTICLE raises at least two possibilities with regard to the ancestral roles of Phylogenetic analysis. We analysed two independent data sets based on whole genome sequences: the mutual best hit (fMBH) data set used for assessing the the encoded proteins. First, these proteins may have been already 3 fulfilling functions similar to their roles in extant multicellular phylogenetic position of the sponge A. queenslandica and the data set containing 145 putatively orthologous proteins (145POP data set), which were chosen by animals, such as communication between individual cells and OrthoMCL2 software39. The collected protein sequences were aligned using the cell-type differentiation. Alternatively, these proteins had differ- MAFFT program40, manually inspected and trimmed by the use of Gblocks ent functions such as environmental sensing and later were co- program41 with the default parameters. We inferred the maximum likelihood trees ! opted for different functions in the multicellular context during by using RAxML 7.2.8 (ref. 42) with the LG þ model. A nonparametric bootstrap test with 100 replicates for each topology was performed. We further tested metazoan evolution. As cell-cell communication and clear spatial topologies by the Bayesian inference using PhyloBayes 3.2 (ref. 43) with the CAT þ ! differentiation have not been reported in Capsaspora, the latter evolutionary model44. The Monte Carlo Markov Chain sampler was run for 10,000 possibility seems more plausible. generations, and then burned-in the last 8,000 saving every 10 generations. Our analyses of the Capsaspora genome have also more precisely defined the set of proteins and domains that evolved Protein domain gain and loss analysis. We ran the Hmmscan program from immediately after the divergence of metazoan lineages from HMMER 3.0 package45 against the Pfam-A version 25 database using protein sets filastereans and choanoflagellates. Among those, the evolution of from 35 species: Amphimedon queenslandica, Arabidopsis thaliana, Aspergillus oryzae, Branchiostoma floridae, Brugia malayi, Caenorhabditis elegans, Capitella protein components that are involved in intercellular com- teleta, Capsaspora owczarzaki, Chlamydomonas reinhardtii, Coprinopsis cinerea, munication represents an especially important step for the Cryptococcus neoformans, Daphnia pulex, Dictyostelium discoideum, Drosophila innovation of multicellularity. We propose that the acquisition melanogaster, Homo sapiens, Hydra magnipapillata, Laccaria bicolor, Lottia of these new ‘metazoan-specific’ genes with novel functions and gigantea, Monosiga brevicollis, Naegleria gruberi, Nematostella vectensis, Neurospora crassa, Physcomitrella patens, Phytophthora sojae, Rhizopus oryzae, the co-option of pre-existing genes that evolved earlier in the Schizosaccharomyces pombe, Strongylocentrotus purpuratus, Tetrahymena unicellular holozoan lineage together represent key innovations thermophila, Thalassiosira pseudonana, Tribolium castaneum, Trichoplax that led to the emergence of metazoans. The genome of adhaerens, Trypanosoma brucei, Tuber melanosporum, Ustilago maydis and Volvox Capsaspora also opens the door to new research avenues, namely carteri. Hits with the scores above the gathering threshold values were considered significant. Dollo parsimony criterion was used to infer the Pfam domains gained the analysis of the ancestral functions of these genes, which will and lost along the branches of the phylogenetic tree. The Pfam domains were provide further insights into the molecular mechanisms that mapped to GO terms by the use of the Pfam2GO mapping (July 2011). The allowed unicellular protists to evolve into multicellular animals. Ontologizer 2.0 program46 was used for the GO term enrichment analysis. We evaluated whether a GO functional category evolved in a certain evolutionary position using a P-value calculated by the topology-weighted algorithm47. Methods Cell culture and nucleic acid extraction and sequencing. Live cultures of Domain enrichment analysis. Protein sets for 12 genomes (H. sapiens, Capsaspora owczarzaki (ATCC30864) and Ministeria vibrans (ATCC5019; used D. melanogaster, C. elegans, H. magnipapillata, N. vectensis, T. adhaerens, only for mtDNA sequencing) were maintained at 23 °C in the ATCC 803 M7 A. queenslandica, M. brevicollis, C. owczarzaki, N. crassa, L. bicolor and D. dis- medium, and 17 °C in the ATCC 1525 medium, respectively. Genomic DNA and coideum) were first filtered by removing short proteins less than 30 amino acids. total RNA were extracted using standard methods. For genes that have multiple alternatively spliced isoforms, only the longest protein product was retained for each gene. Protein domain search was performed by the use of InterProScan48 against InterPro database25. The InterProScan results on the 29 Mitochondrial genome. MtDNA was sequenced from a random clone library complete proteomes of other eukaryotes (E. histolytica, A. thaliana, C. reinhardtii, and gaps were filled by sequencing of respective PCR-amplified regions. Gene P. falciparum, L. major, P. tetraurelia, and E. siliculosus) were retrieved from the annotation of the mitochondrial genome was performed with MFannot (http:// Uniprot (http://www.uniprot.org/) database. Protein domains that are enriched in megasun.bch.umontreal.ca/cgi-bin/mfannot/mfannotInterface.pl), followed by metazoans compared with all the other non-metazoans except C. owczarzaki and manual inspection and addition of missing gene features. M. brevicollis were selected by the use of Fisher’s exact test (Po1.0e 20). The number of genes containing such domains, but not the number of domains themselves, was considered. Values were normalized by the numbers of the Genome sequencing and assembly. Genomic DNA was sheared and cloned into protein-coding genes in the whole genome. The results were depicted in a heatmap plasmid (4 kb pOT and 10 kb pJAN) and fosmid (40 kb EpiFOS) vectors by by the R and its Bioconductor package49. standard methods. Resulting whole genome shotgun libraries were sequenced by Sanger chemistry, generating approximately eightfold paired-end raw reads: sixfold from the 4 kb library, 1.6-fold from the 10 kb library and 0.8-fold form the 40 kb Intergenic distance analysis. We approximated the intergenic distance by cal- library. Raw read sequences were submitted to NCBI’s Trace Archive and can be culating the distance between two protein-coding sequences. We then ran two retrieved with the search parameters CENTER_NAME ¼ ‘BI’ and sided t-tests on these distances at upstream (or downstream) regions of genes in CENTER_PROJECT ¼ ‘G941’. each functional category against all other genes in the same genome. Genes were 50 Sequencing reads were assembled by the Arachne assembler30 using the default classified by Gene Ontology (GO) annotations, which were generated by the use 51 52 parameters. After assembly, the AAImprover module (part of the Arachne of Blast2GO and InterPro2GO pipelines. assembler package) was run to improve assembly accuracy and contiguity. Finally, portions of the genome, which appeared to be misassembled, were manually Gene family analysis. We chose several gene families that are particularly broken to create the final assembly. The assembly was submitted to NCBI with interesting in the context of the evolution of multicellularity. For each gene family, accession number ACFS01000000, BioProject ID PRJNA20341. we inferred the presence and absence of the gene or protein domains in chosen taxa using the HMMER45 package, mutual Blast and phylogenetic analyses based on maximum likelihood trees inferred by RAxML42. Analysed taxa include three RNA sequencing. Total RNA was isolated from two differently-staged C. owczarzaki bilaterians (Homo sapiens, Strongylocentrotus purpuratus and Drosophila cultures with Trizol (Life Technologies). Libraries were sequenced using GAII and melanogaster), three non-bilaterian metazoans (Nematostella vectensis, Trichoplax HiSeq 2000 instruments (Illumina), which generated 76 base paired-end reads. adhaerens and Amphimedon queenslandica), the choanoflagellate M. brevicollis,the The RNA-seq data were used for the protein prediction. filasterean C. owczarzaki, three fungi (Rhizopus oryzae, Laccaria bicolor and Neurospora crassa), and the amoebozoan Dictyostelium discoideum. We also searched, if necessary, further basal eukaryotes whose genomes have been Gene prediction. An initial protein-coding gene set was called with Evidence- sequenced, in order to know the origin of gene families that could predate the split 31 Modeler by the combination with three ab initio predictions by GeneMark.hmm- between amoebozoans and opisthokonts. ES32, Augustus33, GlimmerHMM34, two sequence-homology-based predictions by Blast and GeneWise35 and transcript structures built from ESTs by PASA package36. The initial gene set was further improved by an incorporation of RNA- References seq data using PASA36 and Inchworm37 pipelines to obtain a final gene set. 1. Putnam, N. et al. Sea anemone genome reveals ancestral eumetazoan gene repertoire and genomic organization. Science 317, 86–94 (2007). 2. Srivastava, M. et al. The Trichoplax genome and the nature of Placozoans. Synteny. We performed a synteny conservation analysis between C. owczarzaki Nature 454, 955–960 (2008). and M. brevicollis, A. queenslandica and N. vectensis using DAGchainer38 with 3. Srivastava, M. et al. The Amphimedon queenslandica genome and the evolution default parameters. of animal complexity. Nature 466, 720–726 (2010).

NATURE COMMUNICATIONS | 4:2325 | DOI: 10.1038/ncomms3325 | www.nature.com/naturecommunications 7 & 2013 Macmillan Publishers Limited. All rights reserved. ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3325

4. Chapman, J. et al. The dynamic genome of Hydra. Nature 464, 592–596 (2010). 35. Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 5. Lang, B. F., O’Kelly, C., Nerad, T., Gray, M. W. & Burger, G. The closest 14, 988–995 (2004). unicellular relatives of animals. Curr. Biol. 12, 1773–1778 (2002). 36. Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal 6. Ruiz-Trillo, I., Roger, A., Burger, G., Gray, M. & Lang, B. A phylogenomic transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003). investigation into the origin of metazoa. Mol. Biol. Evol. 25, 664–672 (2008). 37. Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data 7. Shalchian-Tabrizi, K. et al. Multigene phylogeny of choanozoa and the origin of without a reference genome. Nat. Biotechnol. 29, 644–652 (2011). animals. PLoS One 3, e2098 (2008). 38. Haas, B. J., Delcher, A. L., Wortman, J. R. & Salzberg, S. L. DAGchainer: a tool 8. Torruella, G. et al. Phylogenetic relationships within the Opisthokonta based for mining segmental genome duplications and synteny. Bioinformatics 20, on phylogenomic analyses of conserved single-copy protein domains. Mol. Biol. 3643–3646 (2004). Evol. 29, 531–544 (2012). 39. Li, L., Stoeckert, Jr. C. J. & Roos, D. S. OrthoMCL: identification of ortholog 9. King, N. et al. The genome of the choanoflagellate Monosiga brevicollis and the groups for eukaryotic genomes. Genome Res. 13, 2178–2189 (2003). origin of metazoans. Nature 451, 783–788 (2008). 40. Katoh, K., Kuma, K., Toh, H. & Miyata, T. MAFFT version 5: improvement in 10. Manning, G., Young, S., Miller, W. & Zhai, Y. The protist, Monosiga brevicollis, accuracy of multiple sequence alignment. Nucleic Acids Res. 33, 511–518 has a tyrosine kinase signaling network more elaborate and diverse than found (2005). in any known metazoan. Proc. Natl Acad. Sci. USA 105, 9674–9679 (2008). 41. Castresana, J. Selection of conserved blocks from multiple alignments for their 11. Pincus, D., Letunic, I., Bork, P. & Lim, W. Evolution of the phospho-tyrosine use in phylogenetic analysis. Mol. Biol. Evol. 17, 540–552 (2000). signaling machinery in premetazoan lineages. Proc. Natl Acad. Sci. USA 105, 42. Stamatakis, A. RAxML-VI-HPC: maximum likelihood-based phylogenetic 9680–9684 (2008). analyses with thousands of taxa and mixed models. Bioinformatics. 22, 12. Rokas, A. The origins of multicellularity and the early history of the genetic 2688–2690 (2006). toolkit for animal development. Annu. Rev. Genet. 42, 235–251 (2008). 43. Lartillot, N., Lepage, T. & Blanquart, S. PhyloBayes 3: a Bayesian software 13. Fairclough, S. R. et al. Premetazoan genome evolution and the regulation of cell package for phylogenetic reconstruction and molecular dating. Bioinformatics 25, 2286–2288 (2009). differentiation in the choanoflagellate Salpingoeca rosetta. Genome Biol. 14, R15 44. Lartillot, N. & Philippe, H. A Bayesian mixture model for across-site (2013). heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. 21, 14. Zmasek, C. & Godzik, A. Strong functional patterns in the evolution of 1095–1109 (2004). eukaryotic genomes revealed by the reconstruction of ancestral protein domain 45. Eddy, S. R. A new generation of homology search tools based on probabilistic 12, repertoires. Genome Biol. R4 (2011). inference. Genome. Inform. 23, 205–211 (2009). 15. Hertel, L., Bayne, C. & Loker, E. The symbiont Capsaspora owczarzaki, nov. 46. Bauer, S., Grossmann, S., Vingron, M. & Robinson, P. N. Ontologizer 2.0--a gen. nov. sp., isolated from three strains of the pulmonate snail Biomphalaria multifunctional tool for GO term enrichment analysis and data exploration. glabrata is related to members of the Mesomycetozoea. Int. J. Parasitol. 32, Bioinformatics 24, 1650–1651 (2008). 1183–1191 (2002). 47. Alexa, A., Rahnenfuhrer, J. & Lengauer, T. Improved scoring of functional 16. Sebe-Pedros, A., Roger, A., Lang, F., King, N. & Ruiz-Trillo, I. Ancient origin of groups from gene expression data by decorrelating GO graph structure. the integrin-mediated adhesion and signaling machinery. Proc. Natl Acad. Sci. Bioinformatics 22, 1600–1607 (2006). USA 107, 10142–10147 (2010). 48. Quevillon, E. et al. InterProScan: protein domains identifier. Nucleic Acids Res. 17. Sebe-Pedros, A., de Mendoza, A., Lang, B., Degnan, B. & Ruiz-Trillo, I. 33, W116–W120 (2005). Unexpected repertoire of metazoan transcription factors in the unicellular 49. Gentleman, R. C. et al. Bioconductor: open software development for holozoan Capsaspora owczarzaki. Mol. Biol. Evol. 28, 1241–1254 (2011). computational biology and bioinformatics. Genome Biol. 5, R80 (2004). 18. Young, S. et al. Premetazoan ancestry of the Myc-Max network. Mol. Biol. Evol. 50. Blake, J. A. & Harris, M. A. The Gene Ontology (GO) project: structured 28, 2961–2971 (2011). vocabularies for molecular biology and their application to genome and 19. Sebe-Pedros, A., Zheng, Y., Ruiz-Trillo, I. & Pan, D. Premetazoan origin of the expression analysis. Curr. Protoc. Bioinformatics. Chapter 7, Unit 7.2 (2008). hippo signaling pathway. Cell Rep. 1, 13–20 (2012). 51. Conesa, A. et al. Blast2GO: a universal tool for annotation, visualization and 20. Suga, H. et al. Genomic survey of premetazoans shows deep conservation of analysis in functional genomics research. Bioinformatics 21, 3674–3676 (2005). cytoplasmic tyrosine kinases and multiple radiations of receptor tyrosine 52. Burge, S. et al. Manual GO annotation of predictive protein signatures: the kinases. Sci. Signal. 5, ra35 (2012). InterPro approach to GO curation. Database (Oxford) 2012, bar068 (2012). 21. Nichols, S. A., Roberts, B. W., Richter, D. J., Fairclough, S. R. & King, N. Origin of metazoan cadherin diversity and the antiquity of the classical cadherin/beta- Acknowledgements catenin complex. Proc. Natl Acad. Sci. USA 109, 13046–13051 (2012). We thank Lora Lindsey for providing the differential interference contrast microscopic 22. Carr, M., Nelson, M., Leadbeater, B. & Baldauf, S. Three families of LTR picture of C. owczarzaki. We thank Joshua Levin and Lin Fan for generating RNA-Seq retrotransposons are present in the genome of the choanoflagellate Monosiga libraries and the Broad Institute Genomics Platform for Sanger and Illumina sequencing. brevicollis. Protist 159, 579–590 (2008). We thank Terrance Shea for genome assembly, Qiandong Zeng for genome annotation 23. Kim, J., Vanguri, S., Boeke, J., Gabriel, A. & Voytas, D. Transposable elements and Lucia Alvarado-Balderrama, Narmada Shenoy and James Bochicchio for data release and genome organization: a comprehensive survey of retrotransposons revealed and NCBI submission. We thank Nicole King for providing useful insights to the by the complete Saccharomyces cerevisiae genome sequence. Genome Res. 8, manuscript. H.S. was supported by the Marie Curie Intra-European Fellowship within 464–478 (1998). the 7th European Community Framework Programme. Genome sequencing, assembly 24. Ohno, S. Evolution by Gene Duplication 160 (Springer, 1970). and some supporting analysis was supported by grants from the National Human 25. Hunter, S. et al. InterPro in 2011: new developments in the family and domain Genome Research Institute (HG003067-05 through HG003067-10), as were C.N., C.R., prediction database. Nucleic Acids Res. 40, D306–D312 (2012). B.H. and Z.C. B.F.L. and A.J.R. acknowledge financial support through the Canadian 26. Wickstead, B. & Gull, K. The evolution of the cytoskeleton. J. Cell Biol. 194, Research Chair program. This study was supported by an ICREA contract, a European 513–525 (2011). Research Council Starting Grant (ERC-2007-StG-206883) and a grant (BFU2011-23434) 27. Nichols, S. A., Dirks, W., Pearse, J. S. & King, N. Early evolution of animal cell from the Spanish Ministry of the Economy and Competitiveness (MINECO) awarded to signaling and adhesion genes. Proc. Natl Acad. Sci. USA 103, 1251–1256 (2006). I.R.-T. M.V. was supported by CNRS, the Agence Nationale de la Recherche (ANR grant 28. Degnan, B., Vervoort, M., Larroux, C. & Richards, G. Early evolution of BLAN-0294) and the Institut Universitaire de France. metazoan transcription factors. Curr. Opin. Genet. Dev. 19, 591–599 (2009). 29. Burger, G., Lavrov, D., Forget, L. & Lang, B. Sequencing complete mitochondrial and plastid genomes. Nat. Protoc. 2, 603–614 (2007). Author contributions 30. Jaffe, D. et al. Whole-genome sequence assembly for mammalian genomes: H.S. performed bioinformatic analyses, analysed the data, and wrote the paper; Z.C. Arachne 2. Genome Res. 13, 91–96 (2003). performed bioinformatic analyses and analysed the data; A.dM. performed bioinformatic 31. Haas, B. J. et al. Automated eukaryotic gene structure annotation using analyses, analysed the data and was involved in study design; A.S.-P. performed bion- EVidenceModeler and the Program to Assemble Spliced Alignments. Genome formatic analyses, analysed the data and performed the RNA extraction; M.W.B., E.K., Biol. 9, R7 (2008). M.C., P.K., M.V., N.S.-P., G.T., R.D. and G.M. performed bioinformatic analyses and 32. Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y. O. & Borodovsky, M. Gene analysed the data; B.F.L. extracted DNA and analysed the mitochondrial genome; C.R., B.J.H., A.J.R. and C.N. analysed the data and designed the sequencing strategy; I.R.-T. identification in novel eukaryotic genomes by self-training algorithm. Nucleic designed the study, analysed the data and wrote the paper. All authors discussed the Acids Res. 33, 6494–6506 (2005). results and commented on the manuscript. 33. Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19(Suppl 2): ii215–ii225 (2003). 34. Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two Additional information open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 Accession Codes: The whole genome sequence and annotated protein sequences of C. (2004). owczarzaki are deposited in NCBI with accession number ACFS01000000, BioProject ID

8 NATURE COMMUNICATIONS | 4:2325 | DOI: 10.1038/ncomms3325 | www.nature.com/naturecommunications & 2013 Macmillan Publishers Limited. All rights reserved. NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3325 ARTICLE

PRJNA20341. The RNA-seq raw read sequences were submitted to NCBI’s Short Read Reprints and permission information is available online at http://npg.nature.com/ Archive with the accession numbers SRX096928, SRX096921, SRX155797, SRX155796, reprintsandpermissions/ SRX155795, SRX155794, SRX155793, SRX155792, SRX155791, SRX155790 and SRX155789. How to cite this article: Suga, H. et al. The Capsaspora genome reveals a complex unicellular prehistory of animals. Nat. Commun. 4:2325 doi: 10.1038/ncomms3325 (2013). Supplementary Information accompanies this paper at http://www.nature.com/ This work is licensed under a Creative Commons Attribution- naturecommunications NonCommercial-NoDerivs 3.0 Unported License. To view a copy of Competing financial interests: The authors declare no competing financial interests. this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/

NATURE COMMUNICATIONS | 4:2325 | DOI: 10.1038/ncomms3325 | www.nature.com/naturecommunications 9 & 2013 Macmillan Publishers Limited. All rights reserved.

92

Results R5

Resum de l’article R5: Repertori inesperat de factors de transcripció típics de metazous al holozou unicel·lular Capsaspora owczarzaki.

La regulació transcripcional és crucial pel desenvolupament animal, per tant, desxifrar l'evolució primerenca dels factors de transcripció (FT) associats al desenvolupament, és fonamental per a la comprensió de l’origen dels metazous. En aquest estudi descrivim el repertori de 17 FT típics d’animals a l’ameba Capsaspora owczarzaki, una espècie que pertany a un llinatge unicel·lular estretament emparentat amb coanoflagel·lats i metazous. Recolzant-nos en les dades genòmiques disponibles vam dur a terme anàlisis filogenètics de les famílies gèniques i la subsegüent genòmica comparada, que ens va permetre formular noves hipòtesis sobre l'origen i evolució dels FT propis del desenvolupament animal. El repertori de FT a C. owczarzaki és sorprenentment complex, situant l’origen d’algunes famílies de FT encara més enrere del que es pensava. Entre ells trobem els gens T-box, els RUNX, els NF-kappa-beta i els p53. No obstant, certes famílies de FT que ja hi eren abans de l'origen del regne animal, com els hèlix-volta-hèlix (HLH) o els homeodominis, van patir una expansió i diversificació significativa en el llinatge que va dur als animals, probablement com a adaptació a les noves necessitats de diferenciació cel·lular i morfogènesis requerides en un context multicel·lular.

93

94

Unexpected Repertoire of Metazoan Transcription Factors in the Unicellular Holozoan Capsaspora owczarzaki

, , Arnau Sebe´-Pedro´s, 1 Alex de Mendoza, 1 B. Franz Lang,2 Bernard M. Degnan,3 and In˜aki Ruiz-Trillo*,1,4 1Departament de Gene`tica & Institut de Recerca en Biodiversitat (Irbio), Universitat de Barcelona, Barcelona, Spain 2Department of Biochemistry, Universite´ de Montre´al, Montre´al, Canada 3School of Biological Sciences, The University of Queensland, Brisbane, Queensland, Australia 4 Institucio´ Catalana per a la Recerca i Estudis Avancxats (ICREA), Passeig Lluı´s Companys, Barcelona, Spain These authors contributed equally to this work. *Corresponding author: E-mail: [email protected]. Associate editor: Billie Swalla eerharticle Research

Abstract How animals (metazoans) originated from their single-celled ancestors remains a major question in biology. As transcriptional regulation is crucial to animal development, deciphering the early evolution of associated transcription factors (TFs) is critical to understanding metazoan origins. In this study, we uncovered the repertoire of 17 metazoan TFs in the amoeboid holozoan Capsaspora owczarzaki, a representative of a unicellular lineage that is closely related to choanoflagellates and metazoans. Phylogenetic and comparative genomic analyses with the broadest possible taxonomic sampling allowed us to formulate new hypotheses regarding the origin and evolution of developmental metazoan TFs. We show that the complexity of the TF repertoire in C. owczarzaki is strikingly high, pushing back further the origin of some TFs formerly thought to be metazoan specific, such as T-box or Runx. Nonetheless, TF families whose beginnings antedate the origin of the animal kingdom, such as homeodomain or basic helix-loop-helix, underwent significant expansion and diversification along metazoan and eumetazoan stems. Key words: multicellularity, T-box, homeodomain, brachyury, origin Metazoa, choanoflagellates.

Introduction and three-amino acid-loop extension (TALE) (for a review, see Degnan et al. 2009). What genomic changes took place at the dawn of the Met- Comparative analyses including the holozoan choanofla- azoa remains a major biological question. Transcriptional gellate Monosiga brevicollis, the putative sister-group to regulation appears to be one of the most crucial aspects metazoans, are greatly improving our understanding of of animal development. Thus, understanding the early evo- metazoan TF evolution. The genome of M. brevicollis con- lution of the transcriptional regulatory machinery is critical tains the standard set of TFs observed across eukaryotes for drawing a complete picture of metazoan origins. Tran- but lacks most of the well-known metazoan TFs, except scription factors (TFs) act as regulators of cell fate, cell p53, Myc, and a putative Sox (King et al. 2008; Degnan et al. cycle, patterning, proliferation, development, and differen- 2009). Under this scenario, metazoan-specific TFs appear to tiation in metazoans (Larroux et al. 2008). Previous studies include ANTP, Prd-like, POU, LIM-HD, and six homeobox have shown that most TFs that play important roles in bi- genes, group I Fox, most bHLH groups (except B), some laterian development originated before the divergence of bZIP families, Ets, Runx, Mef2, and NR families (Degnan extant animal phyla (Larroux et al. 2006, 2008; King et al. 2009). et al. 2008; Degnan et al. 2009; Srivastava et al. 2010). How- To gain further insight into the evolution of TFs leading ever, the complexity of most TF families appears to have to the metazoan lineage, we characterized and analyzed all increased during early eumetazoan evolution, with cnidar- the TFs that supposedly constitute the metazoan TF toolkit ians having a TF gene repertoire typically being two to three in another close unicellular relative of animals, the amoe- times larger than that of sponges and placozoans (Putnam boid holozoan Capsaspora owczarzaki, putatively the sister- et al. 2007; Degnan et al. 2009; Srivastava et al. 2010). Based group to metazoans and choanoflagellates (Ruiz-Trillo et al. on comparative analyses, it has been hypothesized that the 2004, 2008; Shalchian-Tabrizi et al. 2008; Brown et al. 2009; metazoan TF ‘‘toolkit’’ included members of the basic helix- see fig. 1). The complete genome sequence of C. owczarzaki loop-helix (bHLH), myocite enhancer factors 2 (Mef2), Fox, (hereafter ‘‘Capsaspora’’) has recently been obtained under Sox, T-box, Ets, nuclear receptor (NR), Rel/nuclear factor- the ‘‘UNICORN project’’ at the Broad Institute (Ruiz-Trillo kappaB (NF-kappaB), basic-region leucine zipper (bZIP), et al. 2007). In addition to the TFs outlined above, our sur- and Smad families and a range of homeobox-containing vey of the Capsaspora genome in this study also included classes, including ANTP, Prd-like, Pax, POU, LIM-HD, Six, other TFs known to be important to animal development,

© The Author 2010. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected]

Mol. Biol. Evol. 28(3):1241 1254. 2011 doi:10.1093/molbev/msq309 Advance Access publication November 17, 2010 1241 À Sebe´-Pedro´s et al. · doi:10.1093/molbev/msq309 MBE

(NR) tor

bHLH bZIP HMG box MADS-box Homeobox Forkhead CP2 Churchill T-box Runt p53 DNA bindingSTAT DNA bindingRHD MH1+MH2 (Smad)Ets Hormone recep

Bilateria 69 38 45 4 162 31 4 1 16 2 3 3 5 8 18 70

Nematostella 72 35 36 3 153 33 2 1 13 1 3 1 2 5 16 20 Metazoa vectensis Trichoplax adhaerens 26 16 19 3 36 18 1 1 5 1 1 6 4 7 4 Amphimedon queenslandica 18 20 19 2 32 20 3 1 10 1 1 2 1 9 8 2 Holozoa Monosiga brevicollis 11 15 12 3 2 11 1 2 1 Capsaspora owczarzaki 31 25 9 2 9 4 2 1 4 2 1 1 1 Opisthokonts Dikarya 10 13 10 2 6 5 1

Allomyces Fungi macrogynus 23 43 19 10 13 7 Spizellomyces punctatus 12 18 12 2 15 3 1 Unikonts Apusozoa Thecamonas trahens 8 6 6 2 8 1 2

Amoebozoa 110103133? 2

FIG. 1. Table of domain presence and number across unikonts. Columns represent all the PFAM domains analyzed in this study. The number of genes in each TF family was inferred from each organisms’s proteome by PfamScan using the PfamScan default parameters. For Monosiga brevicollis and Capsaspora owczarzaki, the analyses were performed by HMMER 3.0 searches. For Smad proteins, containing one MH1 and one MH2 domain, the number shown is the minimal number of either MH1 or MH2. For Bilateria, Dikarya, and Amoebozoa the average number is shown. Bilateria includes Homo sapiens, Ciona intestinalis, Drosophila melanogaster, Anopheles gambiae, Caenorhabditis elegans, Helobdella robusta, and Lottia gigantea. Dikarya includes Saccharomyces cerevisiae, Schizosaccharomyces pombe, Cryptococcus neoformans, Yarrowia lypolitica, Ustilago maydis, Aspergillus niger, Neurospora crassa, and Phanerochaete chrysosporium. Amoebozoa includes Dictyostellium discoideum, Dictyostellium purpureum, Entamoeba histolytica, Entamoeba dispar, and Acanthamoeba castellanii. The phylogenetic relationships are based on several recent phylogenomic studies (Burki et al. 2008; Ruiz-Trillo et al. 2008; Brown et al. 2009; Liu et al. 2009; Minge et al. 2009). including Churchill, p53, Stat, and LSF/Grainyhead (GRH) cellularity_project/MultiHome.html). For the remaining TF (fig. 1). Comparative genomic analyses were performed families (i.e., bZIP [bZIP domain], Fox [forkhead domain], on holozoan genomes, and in some cases, other recently Sox [HMG box], homeobox [homeodomain], bHLH [bHLH sequenced opisthokont and apusozoan genomes, namely domain], and LSF/GRH [CP2 domain]), we classified those Allomyces macrogynus, Spizellomyces punctatus, and The- Capsaspora genes with homology to metazoan genes. To camonas trahens (see Materials and Methods, fig. 1). These this end, we used published fungal, metazoan, and choano- results show that the complexity of TFs in Capsaspora is flagellate homologs. We also characterized bZIPs and Mef2 very high, indicating that some TFs thought to be meta- in M. brevicollis. zoan specific evolved prior to the metazoan and choano- flagellate divergence and were subsequently lost in the Gene Searches choanoflagellate lineage. A primary search was performed using the basic local align- ment sequence tool (BLAST: BlastP and TBlastN) using Materials and Methods Homo sapiens proteins as queries against Protein and Ge- nome databases with the default BLAST parameters and an 5 Taxonomic Sampling e value threshold of 10 10À at the National Center for We surveyed, and characterized, a list of metazoan TFs in Biotechnology Information (NCBI) and against completed Capsapora. In some cases, we extended our searches to the or on-going genome project databases at the Joint widest possible set of eukaryotic taxa. This was the case for Genome Institute (JGI), the Broad Institute, as well as those TF families with specific and unique domains: T-box the A. queenslandica genome database (www.metazome. (T-box DNA-binding domain), Runx (Runt DNA-binding net/amphimedon). In the case of T. trahens, A. macrogynus, domain), NF-kappaB (Rel homology domain [RHD]), and Acanthamoeba castellanii, we assembled the trace data Mef2 (MADS box Mef2 domain), p53 (p53 DNA-binding using the WGS assembler (‘‘http://sourceforge.net/apps/ domain), Stat (Statþ DNA-binding domain), Churchill mediawiki/wgs-assembler/index.php?title5Main_Page’’ (Churchill domain), Smad (MH1 MH2 domains), Ets http://sourceforge.net/apps/mediawiki/wgs-assembler/ (Ets domain), and NR. Our extendedþ searches included index.php?title5Main_Page). We then annotated the published and publicly available eukaryotic genomes, genes of interest using both Genescan (Burge and Karlin and other UNICORN taxa, such as the basal fungi A. macro- 1997) and Augustus (Stanke and Morgenstern 2005) and gynus, S. punctatus, and the apusozoan T. trahens (see performed local BLAST searches. When the BLAST searches http://www.broadinstitute.org/annotation/genome/multi- of the genome data described above returned significant

1242 Evolution of Metazoan Transcription Factors · doi:10.1093/molbev/msq309 MBE

‘‘hits,’’ the sequences obtained were then reciprocally model. Bayesian analyses were performed with MrBayes searched against the NCBI protein database by BLAST in 3.1 (Ronquist and Huelsenbeck 2003), using the WAG order to confirm the validity of the sequences retrieved C I model of evolution, with four chains, a subsamplingþ with the initial search. Hmmer searches using HMMER3. frequencyþ of 100 and two parallel runs. Runs were stopped 0b2 (Eddy 1998) were also performed, with standard when the average standard deviation of split frequencies of PFAM profiles in the case of widespread domains or with the two parallel runs was ,0.01, usually at around home-made profiles in the case of specific domains. 1,000,000 generations. The two LnL graphs were checked and an appropriate burn-in length established; stationarity Protein Domain Arrangements of the chain typically occurred after ;15% of the genera- For all proteins, the presence of specific protein domains tions. Bayesian posterior probabilities (BPP) were used to was further checked by searching the Pfam (‘‘http:// assess the confidence values of each bipartition. pfam.sanger.ac.uk/search’’http://pfam.sanger.ac.uk/search) and SMART (‘‘http://smart.embl-heidelberg.de/’’http:// Homeodomain Gene Assignment smart.embl-heidelberg.de/) databases. An alignment with members of the ANTP, Paired-like, POU, and LIM homeodomain classes was constructed using pub- Polymerase Chain Reaction confirmation of lished data from Amphimedon, Drosophila, and Nematos- C. owczarzaki T-box, Runx, and NF-kappaB Genes tella and other already classified sequences (Larroux et al. We confirmed the presence of the three Capsaspora TFs 2008). A RaxML best tree resulting from this phylogeny was that were formerly considered to be metazoan-specific produced to obtain the fixed topology, which recovered TFs now identified in Capsaspora (Runx, T-box, and NF- monophyly for all four classes. From this tree, we manually kappaB), using gene-specific oligonucleotide primers. The created constrained topologies that represented all the mRNA was extracted using a Dynabeads mRNA purifica- possible positions of Capsaspora non-TALE homeodo- tion kit (Invitrogen, Carlsbad, CA) and subsequent reverse mains. Site-wise log-likelihoods were calculated for all transcriptase-polymerase chain reaction (RT-PCR) was per- the generated topologies with RaxML. Best-scoring ML formed using a Superscript III First Strand Synthesis kit (In- trees were chosen using the likelihood-based approxi- vitrogen). The full sequence of the 5’ and 3’ ends of the mately unbiased (AU) test as implemented in CONSEL cited Capsaspora TF cDNAs were obtained by RACE, using (Shimodaira and Hasegawa 2001). The positions of Capsas- a nested PCR and with specific oligonucleotide primers de- pora homeodomain genes that could not be statistically signed from the original genome data. Both coding and excluded (P 0.05) were taken into account. Whenever noncoding strands were sequenced using an ABI PRISM the significant positions fell in the branches that connect BigDye Termination Cycle Sequencing Kit (Applied Biosys- the different classes of homeodomains (POU, LIM, ...), the tems, Foster City, CA). New sequences were deposited in homeodomain was not classified. When Capsaspora hits GenBank under the following accession numbers: fell inside just one cluster (e.g., Capsaspora6 inside GU985459 (Capsaspora Bra-like), GU985460 (Capsaspora paired-like), they were classified accordingly. double-tbox), GU985461 (Capsaspora Tbox3), GU985462 (Capsaspora Runx1), GU985463 (Capsaspora Runx2), and Quantitative TF analyses in Unikont Taxa GU985464 (Capsaspora NF-kappaB). To quantify the number of genes in each TF family, we used PfamScan using the PfamScan default parameters. The pre- Phylogenetic Analyses dicted proteomes used for PfamScan analysis were Amphi- Alignments were constructed for the following gene fam- medon queenslandica (JGI), Trichoplax adhaerens (JGI), ilies and classes: T-box, homeobox, Fox, Sox, bHLH, bZIP, Nematostella vectensis (JGI), H. sapiens (NCBI), Ciona intes- LAG, signal transducer and activator of transcription tinalis (JGI), Drosophila melanogaster (NCBI), Anopheles (STAT), Mef2, p53, NF-kappaB, Churchill, HMG box, gambiae (NCBI), Caenorhabditis elegans (NCBI), Helobdella GRH/LSF, and Runx. Alignments were obtained using robusta (JGI), Lottia gigantea (JGI), Saccharomyces cerevisiae the MAFFT v.6 online server (Katoh, Kuma, Miyata, and (NCBI), Schizosaccharomyces pombe (NCBI), Cryptococcus Toh 2005; Katoh, Kuma, Toh, and Miyata 2005) and then neoformans (JGI), Yarrowia lypolitica (NCBI), Ustilago may- manually inspected and edited in Geneious. Only those dis (JGI), Aspergillus niger (JGI), Neurospora crassa (NCBI), species and those positions that were unambiguously Phanerochaete chrysosporium (JGI), A. macrogynus (Broad aligned were included in the final analyses. Maximum likeli- Institute), S. punctatus (Broad Institute), Dictyostellium dis- hood (ML) phylogenetic trees were estimated by RaxML coideum (NCBI), Dictyostellium purpureum (JGI), Ent- (Stamatakis 2006) using the PROTGAMMAWAGI model, amoeba histolytica (NCBI), Entamoeba dispar (NCBI), and which uses the Whelan and Goldman (WAG) amino acid A. castellanii (home-made prediction). For M. brevicollis exchangeabilities and accounts for among-site rate varia- and Capsaspora, the analyses were performed by HMMER tion with a four category discrete gamma approximation 3.0 searches. For Smad proteins, containing one MH1 and a proportion of invariable sites (WAG C I). Sta- and one MH2 domain, the number was inferred by taking tistical support for bipartitions was estimatedþ byþ perform- the minimal number of either MH1 or MH2 present in the ing 100-bootstrap replicates using RaxML with the same proteomes.

1243 Sebe´-Pedro´s et al. · doi:10.1093/molbev/msq309 MBE

Results and Discussion DNA-binding amino acids in the Runt motif (Wheeler et al. 2000; Sullivan et al. 2008), although only one of the paralogs Rel/NF-kappaB (Co_Runx1) has the two Cys residues involved in redox reg- The RHD is a conserved DNA binding and dimerization do- ulation (Akamatsu et al. 1997)(supplementary fig. S3, Sup- main that is present in the N-terminal region of two pro- plementary Material online). Interestingly, as in A. tein families: nuclear factor activated T-cells (NFAT) and queenslandica and one of the two leech and planarian pa- Rel/NF-kappaB. NFAT and Rel/NF-kappaB are involved ralogs, both Capsaspora Runx lack the specific C-terminal in immune system processes in metazoans (Macian WRPY Groucho-interacting motif. In contrast to A. queens- 2005). Rel/NF-kappaB also plays different roles in develop- landica, however, Capsaspora does not encode Groucho in ment and cell differentiation, receiving inputs from several its genome. Neither does Capsaspora encode CBFb, the signaling pathways (Hayden and Ghosh 2004). Until now, heterodimeric-binding partner of the Runt domain that en- the RHD domain has not been identified outside metazo- hances its DNA affinity (Sullivan et al. 2008). This suggests ans and was thus considered a metazoan innovation that the Runt domain acts independently from CBFb in (Gauthier and Degnan 2008). Capsaspora. Our results show that Runx originated prior However, we identified a single RHD domain in Capsas- to the divergence of Capsaspora from choanoflagellates pora but failed to recover RHD from any other sequenced and metazoans, being secondarily lost in the choanoflagel- nonmetazoan taxa (fig. 1). Our phylogenetic analysis of the late lineage. We hypothesize that Runx originally func- RHD domain shows the Capsaspora homolog branching off tioned independently of Groucho and CBFb proteins as sister-group of all metazoan Rel/NF-kappaB (supplemen- and that the WRPY Groucho-interacting motif ap- tary fig. S1, Supplementary Material online). Furthermore, peared in the eumetazoan lineage, as previously suggested the Capsaspora RHD-domain-containing protein shares (Robertson et al. 2009). several key features with metazoan Rel and NF-kappaB homologs, such as 1) a highly conserved and specific rec- T-box ognition loop located within the RHD domain, which is in- T-box TFs are characterized by an evolutionary conserved volved in dimerization; 2) an IPTG or RHD2 domain, which DNA-binding motif of 180 200 amino acids, the T-box do- confers binding specificity; 3) a basic nuclear-localization main (Smith 1999). TheyÀ are key regulators of metazoan sequence; 4) a glycine serine rich region; and 5) several À development (Muller and Herrmann 1997). The most ankyrin repeats, which are exclusive to metazoan NF-kappaB well-known type of T-box is Brachyury, which has a key role proteins (supplementary fig. S2, Supplementary Material in mesoderm specification (Marcellini et al. 2003), although online). its ancestral function may have been blastopore determi- Thus, our data show that the RHD domain is not exclu- nation and gastrulation (Scholz and Technau 2003). T-box sive to metazoans as previously thought but rather it orig- genes were previously generally considered to be metazoan inated prior to the divergence of Capsaspora from specific (King et al. 2008; Larroux et al. 2008; Rokas 2008). choanoflagellates and metazoans. This implies that the Here, we report the discovery of T-box genes in two non- RHD domain was subsequently lost in the choanoflagellate metazoan species. Three T-box genes are present in Cap- lineage. saspora (one containing two consecutive T-box domains), and one gene exists in the basal chytrid fungus S. punctatus Runx (fig. 1). Our searches, however, failed to recover T-box ho- The Runt DNA-binding domain defines a family of meta- mologs from any other fungi (including the chytrids zoan TFs (Runx) with essential roles in animal development A. macrogynus and Batrachochytrium dendrobatidis) or other (Coffman 2003; Robertson et al. 2009). They can act as tran- eukaryote (including the choanoflagellate M. brevicollis). scriptional activators or repressors, in the latter case usually Remarkably, all the T-box homologs from both Capsaspora via corepressors of the Groucho/TLE family (Wheeler et al. and S. punctatus contain most of the key DNA-binding and 2000). Runx genes encode the Runt DNA-binding domain dimerization amino acids of the metazoan T-box (Muller and heterodimerization domain and a C-terminal WRPY and Herrmann 1997; Bielen et al. 2007)(supplementary motif that interacts with the Groucho/TLE corepressor fig. S4, Supplementary Material online). The phylogenetic (Coffman 2003), except in the demosponge A. queenslandica analysis of T-box domains (fig. 2) places one Capsaspora and some bilaterian paralogs (specifically one of the two homolog (Co-Bra) inside the Brachyury family (bootstrap leech and planarian paralogs), which all lack the C-terminal value [BV] 5 50%). The two T-box domains in the Capsas- WRPY motif (Robertson et al. 2009). A single Runx gene is pora ‘‘double-tbox’’ (Co-Dtbx1 and Co-Dtbx2) and the present in A. queenslandica, N. vectensis, and T. adhaerens, S. punctatus T-box clearly cluster together adjacent to although most bilaterians have several copies as a result of the Brachyury family. The third Capsaspora homolog independent duplications (Rennert et al. 2003). Runx was (Co-Tbx3) clusters within a group of unclassified T-box previously considered to be metazoan specific (Robertson genes from the sponge A. queenslandica that may represent et al. 2009). an independent and novel class of T-box genes. Our gen- We failed to recover Runx genes from any other se- eral topology supports the hypothesis that Brachyury is quenced nonmetazoan genome except Capsaspora, which probably the ancestral class within the T-box family (Adell has two genes (fig. 1). Both Capsaspora Runxs possess key et al. 2003; Adell and Muller 2005; Larroux et al. 2008).

1244 Evolution of Metazoan Transcription Factors · doi:10.1093/molbev/msq309 MBE

x1 Brachyury Co DtbDtbx1 Fungi Dm B a

a Filasterea Bra Sd BrBra Porifera ra Hs TTbx Om BraBr Cnidaria & Ctenophora a bx p Bra

l Br l PpP Bra a2

NvN Bra Co Bra

Ml Bra Ml 19 M

Placozoa v Ta Bra Bra v Bra2 HHv Br Bilateria Bra Co Dtbx2 Hs T e Bra Pc Bra Sr Bra Hv BraBra1HHe1 Bra AAq Tbx1/15/20 q bx T Av Tbx1/15/20Tbx Sp TTbx b x1 /1 5/201/1 100/1.00 5 /20 60/0.98

50/0.98 Tbx2/3 64/1.00 21 s Tbxx21 es DmD 34/0.84 HHs Tb HsEHsEomesom Tbr Pp Ta Tbx2/3T Hs TTbx3m bifid 100/1.00 Hs Tbr1 Tbx2/3Tbx2 bifid bx2 bx3 /3 Nv Tbx2/3/3 HsHs TbxTbx22 MMll Tbx2/3

77/1.00 -/0.90 -/- -/1.00 Aq TbxA Aq TbxE Hs Tbx NvNv Tbx2Tbx20 67/0.99 Tbx15 0 85/1.00 Hs TbxTbx1818 41/ Aq TbxC 1.00 AqAq TbxD s Tbx2222 HsH Tbx Nv Tbx20.3 HHs Tbx 20.2 s Tb Tbx20.2x Hs Tbx5 Hs Tbx4 Hs MGA Nv Tb Hs Tbx20 Tbx5 x66 bx1 Pc Tbx4/5 Tbx4 Dm dc2

l T Aq T Dm dc1 MMl Tbx1 Tb Dm dc3d

idline Tbx4/5 dc1 m x m 1p bx DDm midline 4/5 c3 Tbx 1

4/5 Sd Sd Nv Tbx1pTbx1 Sd Tbx6 Aq TbxB

Dm H15 Hs Tbx1 Nv Tbx T Nv Tbx1 10 Tbx2

x10 bx2 Tbx4/5 Nv Tbx20.1 b Tbx Dm org-1org- s 4/5 Tbx1/15/20 HHs T Tbx4/5

Co Tbx3 Tbx 3

0.4

FIG. 2. ML tree of T-box domains showing the different T-box families. The tree is rooted using the midpoint-rooted tree option. Statistical support was obtained by RAxML with 1,000 bootstrap replicates (BV) and BPP. Both values are shown on key branches. Colors show different taxonomic assignments. Aq (Amphimedon queenslandica), Av (Axinella verrucosa), Co (Capsaspora owzarzaki), Dm (Drosophila melanogaster), He (Hydractinia echinata), Hs (Homo sapiens), Hv (Hydra vulgaris), Ml (Mnemiopsis leydi), Nv (Nematostella vectensis), Om (Oopsacas minuta), Pc (Podocoryne carnea), Pp (Pleurobrachia pileus), Sd (Suberites domuncula), Sp (Spizellomyces punctatus), Sr (Sycon raphanus), Ta (Trichoplax adhaerens). Co-Dtbx1 and Co-Dtbx2 are the two T-box domains of the same T-box Capsaspora gene (for further details, see main text).

Moreover, our findings imply that T-box genes appeared such as DNA repair, cell cycle arrest, senescence, and ap- not in metazoans but in the common ancestor of opistho- optosis (Coutts and La Thangue 2005; Espinosa 2008). The konts and were subsequently lost in most fungi and in p53 family includes p53, p63, and p73, the last two being choanoflagellates. more closely related to each other than to p53. The three p53 members have some differences in function and in the Churchill protein domain architecture. The p63 and p73 share an ad- Churchill is a zinc-finger TF that is involved in cell move- ditional C-terminal sterile alpha motif (SAM) domain, ment and cell fate determination (Londin et al. 2007). In whereas all three share a transcriptional activation domain, Xenopus and chick, Churchill appears to regulate the a DNA-binding domain, and C-terminal tetramerization T-box gene brachyury (Sheng et al. 2003). We have found domain (Nedelcu and Tan 2007). Choanoflagellates have orthologs of Churchill in Capsaspora, T. trahens, and, inter- both a p53 and a p63/73 classes (Nedelcu and Tan 2007). estingly, also in the amoebozoan A. castellanii (fig. 1 and Here, we characterize a unique member of the p53 gene supplementary fig. S5, Supplementary Material online). family in Capsaspora (fig. 1 and supplementary fig. S6, Sup- This finding indicates a deeper origin of this gene than pre- plementary Material online), the gene encodes a SAM do- viously thought probably in the common ancestor of uni- main. The phylogenetic analysis places Capsaspora-p53/63/ konts. This suggests that Churchill was secondarily lost in 73 close to the choanoflagellate group (supplementary fungi, and choanoflagellates as well as in other amoebozo- fig. S7, Supplementary Material online). The tree topology ans. What role the Churchill orthologs play in Capsaspora, implies that the last common ancestor of holozoans had T. trahens, or A. castellanii, and whether, in Capsaspora, it is a single p53/63/73 gene, which followed independent di- related at all to its T-box genes is unknown. vergences in vertebrates and choanoflagellates. In the ab- sence of DNA damage, p53 appears to be downregulated p53 by ubiquitination, which in vertebrates is carried out by the The p53 tumor suppressor protein is a multifaceted TF that vertebrate-exclusive Mdm2 protein. However, other mech- is involved in different cellular responses to DNA damage, anisms of regulation have been proposed, such as ubiquitin

1245 Sebe´-Pedro´s et al. · doi:10.1093/molbev/msq309 MBE ligases or CREB-binding protein (CBP)/p300 (Shi et al. have identified 25 and 15 bZIP proteins in Capsaspora and 2009). Interestingly, we identified CBP/p300 both in Cap- M. brevicollis, respectively. Amphimedon queenslandica has saspora and M. brevicollis (see below), although whether 20, and the average bilaterian, 38 (fig. 1). Interestingly, the CBP/p300 downregulates p53 in these holozoans remains chytrid fungus A. macrogynus has 43 bZIP genes, whereas unknown. most Dikarya have approximately 13 (fig. 1). We could only classify unambiguously seven and six of the bZIP proteins Stat present in Capsaspora and M. brevicollis, respectively. STAT proteins are TFs that, in response to a wide variety of A phylogenetic analysis including only the classified proteins extracellular signaling proteins, regulate the action of sev- showed that Capsaspora bZIPs correspond to PAR, C/EBP, eral genes that are involved in cell growth and homeostasis Atf2, Oasis, Atf6, and CREB families, whereas the Monosiga (Bromberg 2002; Levy and Darnell 2002). Structurally, STAT homologs correspond to Atf4/5, Atf2, Oasis, and Atf6 fam- proteins have a N-terminal interacting domain, a STAT alpha ilies (fig. 3, see supplementary fig. S10, Supplementary Ma- domain with a coiled-coil structure involved in protein terial online for a tree with all Capsaspora genes). Based on protein interactions (e.g., it recruits HATs, specially CBP/À these analyses, we hypothesize that most, if not all, current p300), a STAT DNA-binding domain, a SH2 domain, and metazoan bZIP families were present in the holozoan ances- a C-terminal transactivation domain (Levy and Darnell tor, with some of them subsequently being lost in choano- 2002)(supplementary fig. S8, Supplementary Material on- flagellates and Capsaspora. Some families, such as CREB, line). The activation of STAT is mediated by the phosphor- most likely underwent a protein domain rearrangement ylation of a key tyrosine residue located after the SH2 within metazoans, similarly to that described in other gene domain (Levy and Darnell 2002). Our searches identified families (King et al. 2008; de Mendoza et al. 2010). Interest- well-conserved STAT proteins in Capsaspora, M. brevicollis, ingly, all bZIP proteins that we identified in the unicellular and the apusozoan T. trahens (fig. 1). The STAT proteins relatives of metazoans belong to families that act strictly from the latter two taxa appear, however, to be slightly (Atf6, PAR, CREB, Oasis) or facultatively (Atf4/5, Atf2, truncated at the 5’ end (see supplementary fig. S8, Supple- C/EBP) as homodimers. This suggests that bZIP proteins mentary Material online). STAT proteins had previously in unicellular organisms may work mostly as homodimers, been identified in amoebozoans (Kawata et al. 1997; Lee as already seen in yeast bZIP interactions (Deppmann et al. 2008; Araki et al. 2010), but the protein domain anal- et al. 2006). Our data suggest that although bZIP originated ysis clearly showed that amoebozoan STAT are quite differ- before the dawn of the Metazoa, their connectivity and com- ent from metazoan STAT proteins (supplementary fig. S8, binatorial interactions may have increased in animals. For Supplementary Material online). In contrast, the homologs example, the Capsaspora homolog of CREB does not have from M. brevicollis, Capsaspora, and T. trahens are very the kinase-inducible activation domain that allows its inter- similar to metazoan STATs. Moreover, a phylogenetic anal- action with p300/CBP (Giebler et al. 2000), even though ysis of STATs using amoebozoan CudA proteins as out- p300/CBP is present in the Capsaspora genome. group (Yamada et al. 2008) showed amoebozoan-specific STATs as a sister-group to the holozoan apusozoan clade bHLH (BV 5 92%) (supplementary fig. S9, Supplementaryþ Mate- bHLH is a domain that is present in a large superfamily of rial online). As STAT proteins are present in extant apuso- TFs that are widespread among eukaryotes. In metazoans, zoans, these are likely to have been lost early in the fungal they regulate critical developmental processes, such as neu- lineage. rogenesis, sex determination, myogenesis, and hematopoi- Metazoan STAT proteins form part of the JAK signaling esis (Jones 2004). This family of TFs has a DNA-binding pathway, which is absent in nonmetazoan lineages (King basic region followed by two alpha helices separated by et al. 2008). However, STAT proteins can interact with a variable loop region. Many bHLH proteins also include other receptor and nonreceptor tyrosine kinases (Kawata other domains that are involved in protein protein inter- et al. 1997; Levy and Darnell 2002). Indeed, the distribution actions (Simionato et al. 2007). The bHLH proteinsÀ can act of STATs coincides with the distribution of tyrosine kinases as homodimers or heterodimers to regulate gene expres- among eukaryotes, being present in amoebozoans (Kawata sion. Metazoan bHLH have been grouped into six different et al. 1997; Goldberg et al. 2006), apusozoans and Capsas- higher order clades (A to F) (Simionato et al. 2007; Degnan pora (Ruiz-Trillo I, unpublished data), choanoflagellates et al. 2009) (for a general overview, see fig. 4). Group A, (King et al. 2008; Manning et al. 2008; Suga et al. 2008), which includes genes such as MyoD and neurogenin, has and metazoans (Mayer 2008), all of which also have tyro- only a bHLH domain and is exclusive to metazoans. Group sine kinases. B, which includes Myc or SREBP, has a leucine zipper 3’ to the bHLH domain and is found throughout the eukaryotes. bZIP Group C, which includes Clock and ARNT, has two PAS bZIP TFs are named after the highly conserved structure domains (PAS and PAS3) 3’ to the bHLH domain and is containing a basic region and a leucine zipper (Hurst also thought to be exclusive to metazoans. Group D, which 1994). The bZIP proteins are ubiquitous among eukaryotes is metazoan specific, lacks the DNA-binding basic region, and are involved in several processes, such as environmen- and hence, their members are unable to bind to DNA, act- tal sensing and development (Deppmann et al. 2006). We ing as antagonists to Group A members. Most group

1246 Evolution of Metazoan Transcription Factors · doi:10.1093/molbev/msq309 MBE

Oasis Mb4 NxxSAxxSR

Mb1

Co5 Aq_Oasis2 Fos NxxAAxxCR XBP1 Atf6 NxxAAxx(A/S)R NxxSAxxSR

Aq_XBP1

Aq_Atf6 Aq_Oasis1 Co8

Aq_Fos Atf4/5 NxxAAxxYR Aq_Atf4/5 CREB 68/1.00 NxxAAxxCR Mb3 Mb6 Atf3 CREB Mb7 NxxAAxxCR Aq_ C/EBP NxxAVxxSR MAF 30/0.62 Co3 NxxYAxxCR Aq_C/EBP1 91/1.00 Aq_Maf Aq_C/EBP2 51/0.99 37/0.94 34/0.78 60/0.95 Co10 92/1.00 12/0.92 Co9 24/ - 26/0.97 62/0.98

69/1.00 71/0.96 Aq_PAR Co4 - /0.6 BACHNxxAAxxCR 19/0.84 94/1.00 NxxAAxxSRPAR

25/0.64 46/0.98 Nfe2 B-ATFNxxAAxxSR NxxAAxxCR A q_N Aq_Nfe2.2 fe Co1 2.1

Aq_ Jun

Mb2 NxxAAxxCRJun

Aq_Atf2

NxxAAxxCRAtf2 Co13

0.3

FIG. 3. ML tree of bZIP genes including the unambiguously assigned Capsaspora bZIP homologs. The tree is rooted using the midpoint-rooted tree option. Statistical support was obtained by RAxML with 1,000 bootstrap replicates (BV) and BPP. Both values are shown on key branches. Aq (Amphimedon queenslandica), Co (Capsaspora), Mb (Monosiga brevicollis). Metazoan branches depicted in red and fungal branches in green. For each family, the signature sequence for DNA recognition is indicated and only proteins with this conserved motif are included in the family (Fujii et al. 2000). A tree including all Capsaspora bZIP genes is shown in supplementary figure S10 (Supplementary Material online).

E proteins include a metazoan-specific orange domain and and 4, supplementary figs. S11 S13, Supplementary Mate- a WRPW peptide in their carboxyl terminal part. Finally, rial online). This is more than doubleÀ the bHLH genes found Group F lacks the DNA-binding basic region but possesses in Monosiga (11) and most fungi (around 10 in Dikarya). a COE domain, which is involved both in dimerization and Compared with Metazoa, Capsaspora has a wider bHLH binding (Simionato et al. 2007). repertoire than the sponge A. queenslandica (18), but half We identified 31 bHLH proteins in Capsaspora, including what is present in cnidarians (72 in N. vectensis) or bilat- orthologs of Myc, Mad, Max, SREBP, Mlx/TF4, MITF, and erians (average of 69) (fig. 1). The choanoflagellate M. USF group B families, nonspecific homologs (ARNT-like) brevicollis contains Max, SREBP, Myc, and a lineage-specific of group C, and some unclassifiable proteins (see figs. 1 group of bHLH genes. Thus, homologs of group C of

1247 Sebe´-Pedro´s et al. · doi:10.1093/molbev/msq309 MBE

A B

tensis

tostella vec a

Capsaspora owczarzakiMonosiga brevicollisAmphimedonNem queenslandiBilateriaca Group A Group B MYC/MAX/MAD/MNT SCL/MyoD/NeuroD/Neurogenin Group C Hand/PTFa/PTFb/ASC USF/MITF/SREBP/AP4 Group A ARNT/Clock/Sim/Arh MyoR/Mesp/Twist/Paraxis FigAlpha/SRC Bmal/Trh/Hif Net/Beta3/Oligo/Mist MLX/TF4 Atonal/(E12/E47) Group B MYC AP4 Co_ FigAlpha 17 Group E MAX 8 Hey/Spl 1 _ o C

Co_28 MAD Co_

1 MNT

11 Co_ Co_3 50/0.94 MITF

62/0.99 MLX/TF4 44/ 1.00 53/0.97 Co_ 9 SREBP 4 6/1.00 30/0.88

54/0.98 68/ USF

0.99 88 /1.00 C o AP4 1 _ .8 2

7 Co_19 1.00 81/ 40/0 8 SRC 92/1.00

Co_

M 7 b_ MITF Monosiga Group FigAlpha 1

55/0.71 80/0.8

2

26

_ b Group C ARNT/Bmal

Co_ M

Co_21

Co_13 0 MNT 5 Ahr _ SREBP Co_1 o

C

USF Clock 6 _ o C Hif/Sim/Trh MAX MYC MAD 0.3 Group D Emc

Group E Hey, H/ E(Spl)

Group F Coe

FIG. 4. (A) ML tree of the bHLH domain including the unambiguously assigned Capsaspora bHLH homologs. The tree is rooted using the midpoint-rooted tree option. Statistical support was obtained by RAxML with 100 bootstrap replicates (BV) and BPP. Both values are shown on key branches. Groups and families are defined by the classification of Simionato et al. (2007). The taxa used were Homo sapiens, Nematostella vectensis, Amphimedon queenslandica, Monosiga brevicollis (Mb) and Capsaspora (Co). For the sake of clarity, only the last two are specifically shown. (B) Table of the presence/absence of bHLH groups and families in some key taxa. Amphimedon queenslandica, N. vectensis, and Bilateria data obtained from Simionato et al. (2007). A tree including all Capsaspora bHLH genes is shown in supplementary figure S11 (Supplementary Material online). bHLH were already present in the common ancestor of a Mef2 domain following the N-terminal MADS domain Capsaspora, choanoflagellates, and metazoans. Our data (Alvarez-Buylla et al. 2000). Mef2 genes play important reveals that a basic Myc, MAX, Mxd/Mnt network of bHLH roles in metazoan development, especially in the meso- TFs was already present in the common ancestor of met- derm (Potthoff and Olson 2007). Some authors considered azoans and Capsaspora and became more complex in mul- Mef2 to be metazoan specific (Larroux et al. 2006; Degnan ticellular lifestyles, incorporating, for example, Mnt and et al. 2009), although other authors had proposed Saccha- Mga. Interestingly, Capsaspora bHLH proteins are all homo- romyces Smp1 and Rlm1 genes to be fungal homologs of logs of those implicated in cell cycle and metabolism, and metazoan Mef2 (Dodou and Treisman 1997). We identified none of those are involved in differentiation. From our sur- canonical metazoan-type Mef2 in Capsaspora and in the vey, we can also corroborate the two expansions periods (of cythrid fungi S. punctatus and A. macrogynus (fig. 1). bHLH groups and classes) in bHLH evolution previously in- The protein structure of Capsaspora and S. punctatus ferred by Simionato et al. (2007) and later revised by Mef2 closely resembles the canonical metazoan Mef2 (sup- Degnan et al. (2009), one before the divergence between plementary fig. S14, Supplementary Material online). We Capsaspora and choanoflagellates metazoans and an- also identified a putative Mef2 homolog in M. brevicollis other early in eumetazoan evolutionþ (fig. 1). In contrast and in the amoebozoans D. discoideum, D. purpureum, to bZIP TFs, there are some putative heterodimeric TF in- E. histolytica, and A. castellanii as well as in the oomycetes teractions among Capsaspora bHLH. For example, Myc, Phytophthora sojae, P. ramorum, P. infestans, and P. caspis, Mad, MAX, and Mlx can act as heterodimers in metazoans. although their sequences are divergent and have little sim- ilarity to the canonical metazoan Mef2 (supplementary Mef2 fig. S14, Supplementary Material online). The fact that Phy- Mef2 are the metazoan representatives of Type II MADS tophthora species encodes a Mef2 homolog may be ex- box genes. They are characterized by the presence of plained by a lateral gene transfer (LGT) event because

1248 Evolution of Metazoan Transcription Factors · doi:10.1093/molbev/msq309 MBE they are the only analyzed eukaryotes outside opisthokonts and the TALE superclass characterized by an insertion of and amoebozoans to have a mef2 gene. In fact, it has al- three amino acids between helix 1 and 2 of the homeodo- ready been shown that some Phytophthora genes have main (Mukherjee and Burglin 2007). Both TALE and non- a close relationship with amoebozoans genes (Tyler et al. TALE superclasses were already present in the ancestor of 2006; Torruella et al. 2009). A phylogenetic analysis using eukaryotes (Derelle et al. 2007). The two homeobox genes fungal sequences as outgroup yields a clade that comprises of the choanoflagellate M. brevicollis have already been all the taxa with a canonical metazoan-type Mef2 domain, characterized, both of them belonging to the TALE super- that is all metazoans plus Capsaspora, A. macrogynus, and class, although they cannot confidently be assigned to S. punctatus (supplementary fig. S15, Supplementary Ma- any major metazoan homeobox family (King et al. 2008; terial online). Our data show that the canonical metazoan Larroux et al. 2008). We identified nine homeodomain- Mef2 domain has a deeper origin than previously thought, containing genes in Capsaspora: three TALE and six with a conserved Mef2 domain present at least in the com- non-TALE (fig. 1 and supplementary figs. S18 S22, Supple- mon ancestor of opisthokonts. mentary Material online). A phylogenetic analysisÀ of these genes including members of all major families of homeo- Fox domains from metazoans, amoebozoans, and fungi failed Fox genes TFs are characterized by the presence of a DNA- to confidently assign Capsaspora homeobox genes to binding domain known as Forkhead box. Fox genes play any of the major metazoan classes, except for one clear or- important roles as regulators of both development and me- tholog to the longevity assurance homolog (LAG-1) class. tabolism, and they seem to be specific to opisthokonts To further improve the resolution and classify the remain- (Tuteja and Kaestner 2007a, 2007b; Shimeld et al. 2009). ing Capsaspora homeobox genes, we performed phyloge- We identified four Fox genes in Capsaspora, none of them netic analyses specific for TALE or non-TALE genes. This being part of the metazoan-specific class I but rather pres- allowed us to assign one Capsaspora TALE homeobox gene ent in the supposedly opisthokont specific class II that also to the PBC family, although it lacks the PBC N-terminal includes fungi (Larroux et al. 2008)(fig. 1 and supplemen- domain, and support is not very high. The remaining tary fig. S16, Supplementary Material online). Interestingly, two Capsaspora TALE genes have an unclear phylogenetic we identified three putative Fox genes in the amoebozoan relationship to other TALEs, although they appear to be A. castellanii (fig. 1), although their sequences are divergent closely related to the two M. brevicollis homeobox genes compared with opisthokont ones. Thus, our results show (supplementary fig. S19, Supplementary Material online). that Fox genes are not specific to opisthokonts and were Interestingly, the sponge A. queenslandica appears not already present before the divergence of amoebozoans and to have a homolog of PBC (supplementary fig. S19, Supple- opisthokonts. mentary Material online) (Larroux et al. 2008). Capsaspora non-TALE genes appeared in unclear phylogenetic posi- HMG Box Genes tions even with a restricted non-TALE only data set, al- HMG box containing genes are TFs that are involved in though there is a potential homolog of LIM and two genome stability, chromatin structure, and gene regulation potential homologs of POU (supplementary fig. S20, Sup- (Stros et al. 2007). Metazoan-specific families are Sox and plementary Material online). Thus, in order to classify Tcf/Lef (Larroux et al. 2008). We characterized nine HMG them, we constructed different phylogenetic trees in which box-containing proteins in Capsaspora (fig. 1 and supple- Capsaspora genes were forced to be members of a specific mentary fig. S17, Supplementary Material online), a similar family and then we compared the likelihood values among number as those found in Monosiga (12) and Amoebozoa all possible trees (for further details, see Material and Meth- (average of 10) and significantly less than those found in ods). Four Capsaspora non-TALE homologs appear to be at Bilateria (average of 45). Two of Capsaspora HMG box the root of the tree. Another one (Capsaspora-6) falls genes have strong similarities to MATalpha box, typical within the paired-like (Prd-like) clade with significant sta- sex-determinant genes that are present in Ascomycota tistical support, this gene product possesses five of the six (Fraser and Heitman 2003; Fraser et al. 2004). Capsaspora diagnostic amino acids of Prd-like genes (Galliot et al. 1999) also encodes a HMG-B, a SSRP-1, and a SWI/SNF homolog, (supplementary fig. S21, Supplementary Material online). plus some HMG box containing genes that cannot confi- However, it does not have the typical Q or K amino acid dently be assigned to any HMG box class. at position 50, and its intron is not located in the typical position (between codons 46 and 47), as consistently ob- Homeobox Genes served in metazoans. A specific phylogeny of ANTP, prd- Homeobox genes encode an acid helix-turn-helix DNA- like, LIM, and POU also supports this assignment but is binding motif known as the homeodomain. Homeobox not statistically significant (supplementary fig. S21, Supple- genes are known to have key roles in animal, plant, fungal, mentary Material online). The last Capsaspora non-TALE and amoebozoan development, such as regional pattern- homeobox gene has a C-terminal TRAM LAG1 CLN8 ing, regulation of cell proliferation, differentiation, adhe- (TLC) domain and a transmembrane domain, the charac- sion, and migration (Gehring et al. 1994; Derelle et al. teristic domain architecture of the lass (longevity assurance 2007). There are two large superfamilies, the canonical homologs of yeast [Lag-1]) genes, which are considered (non-TALE) class with a 60 amino acids homeodomain to be homologs to fungal Lag genes. Interestingly,

1249 Sebe´-Pedro´s et al. · doi:10.1093/molbev/msq309 MBE phylogenetic analysis of the TLC domain showed that LAG GRH genes in Capsaspora,aLSF-likeandaGRH-like(fig. 1 genes with homeodomain are exclusive to metazoans and and supplementary fig. S23, Supplementary Material on- Capsaspora, whereas genes with the TLC domains and line), although the Capaspora LSF-like gene lacks the char- TRAM1 domain are found in amoebozoans, fungi, and met- acteristic C-terminal SAM domain found in metazoan LSF azoans (supplementary fig. S22, Supplementary Material proteins. This domain may have been gained by domain online). Lass genes, however, are implicated in ceramide shuffling before the split between metazoans and choano- synthesis, the function of their homeodomain being un- flagellates, although the loss of this domain in Capsaspora clear and their specific TF activity unknown (Teufel cannot be ruled out. Our findings imply that the duplication et al. 2009). Our data show that the repertoire of homeo- of the LSF/GRH gene occurred before Capsaspora diversified box genes in metazoan unicellular relatives is larger than from choanoflagellates and metazoans, and that GRH was previously thought (see fig. 1), however, some specific ho- lost in choanoflagellates. Thus, the presence of a GRH gene meobox gene classes, such as ANTP appear to be exclusive antedates the origin of the metazoan epithelium. to the Metazoa. Genome data from additional unicellular relatives of metazoans will be needed to corroborate this. NRs, Smad, and Ets Our data show that these three TFs families remain, at this CBP/p300 time, metazoan specific because we did not identify any The CBP/p300 is a ubiquitous metazoan transcriptional co- homologs in nonmetazoan taxa. The complete genome se- activator that interacts with several TFs, acts as an acetyl- quences of additional nonmetazoan taxa are needed to transferase (Coutts and La Thangue 2005) and is involved corroborate this hypothesis. in cell growth and development (Goodman and Smolik 2000). Specifically, CBP/p300 interacts with such TFs such Origin and Early Evolution of Metazoan TFs as NF-kappaB (Perkins et al. 1997), Stat (Levy and Darnell The repertoire of TFs in the holozoan C. owczarzaki re- 2002; Wojciak et al. 2009), Runx (Jin et al. 2004; Makita et al. ported here, and its comparison with metazoan, fungal, 2008), p53 (Grossman 2001), CREB (Manna et al. 2009), and and choanoflagellate TFs provides important insights into C/EBP (Manna et al. 2009). For example, CBP/p300 acety- the origin and evolution of TFs that are essential for meta- lates Runx genes (Jin et al. 2004) and ubiquitinates p53 (Shi zoan multicellularity. This allows us to propose a new hy- et al. 2009). We have identified CBP/p300 homologs in both pothesis regarding the origin of key metazoan TFs (see Capsaspora and M. brevicollis. This implies that CBP/p300 fig. 5). Some metazoan TF domains have deep origins being originated prior to the divergence of Capsaspora from widespread in eukaryotes, such as HMG box, homeodo- choanoflagellates and metazoans. It is worth mentioning main (both TALE and non-TALE), bHLH, bZIP, or Mef2-like that this multifunctional cofactor seems to have evolved (see also Degnan et al. 2009). However, major diversifica- concomitant to the emergence of several holozoan TFs, tions of genes encoding some of these domains took such as Runx and NF-kappaB. This suggests that a relatively place along metazoan and fungal stems (see fig. 1), gener- high level of regulatory complexity was already emerging ating lineage-specific classes and subfamilies. In regards to on early in the holozoan lineage, well before the divergence metazoan-specific TF gene families, there appears to have of metazoan and choanoflagellate lineages. been two major expansions (fig. 5): one prior to the diver- gence of Capsaspora, choanoflagellates, and the Metazoa LSF/GRH (e.g., in bZIP and bHLH) and another within the metazoan The LSF/GRH family of TFs is characterized by the CP2 do- lineage (such as Sox, homeodomains, and further diversifica- main, and its members play important roles in bilaterians, tion of bZIP and bHLH). Several other TF domains, such as being involved in vertebrate organogenesis, cell cycle pro- Churchill, STAT, and, most likely, Fox, were already present in gression, and cell survival and differentiation (Bray and the common ancestor of unikonts (i.e., amoebozoans, apu- Kafatos 1991; Uv et al. 1997; Veljkovic and Hansen 2004; sozoans, and opisthokonts). This finding changes previous Traylor-Knowles et al. 2010). LSF/GRH can be divided into views in which Churchill was considered exclusive to meta- two groups, the LSF/CP2 and the GRH subfamilies (Shirra zoans and Fox exclusive to opisthokonts (although the as- and Hansen 1998; Traylor-Knowles et al. 2010). Members of signment of A. castellanii hits to Fox remain contentious). the LSF/CP2 subfamily act as tetramers and possess an Although STAT domains were present in the common an- extra SAM domain C-terminal to the specific CP2 DNA- cestor of unikonts, our data show that canonical metazoan- binding domain. Members of the GRH subfamily do not type STAT seem to be exclusive to apusozoans (T. trahens) have the SAM domain and act as dimers. and opisthokonts. A major challenge to previous proposals The CP2 domain is present throughout the opistho- that T-box genes are metazoan innovations is the discovery konts, including choanoflagellates (Traylor-Knowles et al. of T-box genes in S. punctatus and Capsaspora. This means 2010) and seems to be a synapomorphy of this group of that T-box genes appeared before the divergence of fungi eukaryotes. Interestingly, it has been hypothesized that and holozoans. What role are these T-box genes playing the GRH subfamily originated by duplication of an ances- in these nonmetazoan lineages remains to be studied. tral LSF/GRH-like gene at the origin of the Metazoa and Interestingly, some TFs appear to have evolved prior to was coopted to epidermal determination in metazoans the divergence of Capsaspora from choanoflagellates and (Traylor-Knowles et al. 2010). We identified two LSF/ metazoans, such as p53, Runx, and NF-kappaB; the latter

1250 Evolution of Metazoan Transcription Factors · doi:10.1093/molbev/msq309 MBE

FIG. 5. Cladogram representing TF evolution among the analyzed taxa. Colors are unique for each domain class. A colored dot means the hypothetical origin of the domain. A black-circled dot indicates where a specific protein family appears in our taxon sampling. A cross means the loss of the domain or specific protein family in a lineage. Metazoan apomorphies are shown as black dots. The phylogenetic relationships are based on several recent studies (Burki et al. 2008; Ruiz-Trillo et al. 2008; Brown et al. 2009; Liu et al. 2009; Minge et al. 2009). two previously being considered metazoan specific. This HMG box and bZIP families. Specific domain expansions pattern of gene families that are relevant to metazoan mul- have already been reported in the Viridiplantae for bHLH ticellularity evolving prior to the emergence of the meta- and homeodomain proteins (Mukherjee et al. 2009; Pires zoan stem lineage is not new and has been observed in and Dolan 2010). There are several theories about the cor- other cases, such as tyrosine kinases, cadherins, MAGUKs, relation of these expansions with the transition to multi- and integrins (Abedin and King 2008; King et al. 2008; cellularity (Derelle et al. 2007; Pires and Dolan 2010). On the Manning et al. 2008; Suga et al. 2008; Degnan et al. other hand, some TF domains, such as CP2, Runt, MADS- 2009; de Mendoza et al. 2010; Sebe-Pedros et al. 2010). Fi- box, Churchill, p53, and STAT, have similar number of nally, there are some TF domains that, under the current members in unicellular and multicellular holozoans. Cap- taxon sampling, appear to be metazoan innovations. These saspora TF complexity is quite high, with a wider range are ETS, Smad, and NRs. Moreover, some specific homeo- of bHLH and bZIP domain-containing proteins than in box genes (ANTP, LIM, POU, Irx, Meis, Tgif, Six), bZIP clas- some early-branching metazoans such as A. queenslandica ses (e.g., Jun, Fos), bHLH classes (A, D F groups), and HMG or T. adhaerens. Because Capsaspora has a complex (and box classes (Sox, TCL/lef) appear alsoÀ to be metazoan spe- not fully understood) life cycle, in which there is a symbiotic cific (fig. 5), although we cannot rule out the possibility that stage within the mollusc Biomphalaria glabrata, one may some of these may have a more ancient origin and second- wonder whether the complexity of TFs identified in Cap- arily lost in nonmetazoan lineages. This new evolutionary saspora is due to LGT from the host or even from the trem- scenario implies that significant lineage-specific TFs losses atode flatworm S. mansoni, a metazoan parasite of occurred within the choanoflagellate lineage. For example, B. glabrata. Based on our phylogenetic analyses, we do Runx, T-box, RHD domain, GRH-like, and Churchill appear not favor this hypothesis. None of the phylogenetic trees to have been lost in M. brevicollis. Whether this is specific to shown (all including bilaterians; some even B. glabrata ho- one choanoflagellate lineage (that of M. brevicollis) or to mologs) show the Capsaspora homolog grouping closer to choanoflagellates in general remains unknown. Only geno- bilaterians than to other metazoans. Instead, we hypothe- mic data from additional choanoflagellate taxa will resolve size that the common ancestor of Capsaspora, choanofla- this issue. A similar pattern of lineage-specific loss in choa- gellates, and metazoans had a richer TF repertoire than noflagellates has recently been shown for the integrin- previously believed and that some TFs were subse- mediated adhesion machinery (Sebe-Pedros et al. 2010). quently lost in the choanoflagellate lineage (or at least A quantitative analysis (fig. 1) of TFs evolution suggests in M. brevicollis). that several expansions occurred in Eumetazoa, such as In summary, our results show that the evolution of bHLH and homeobox gene families and to a lesser degree metazoan TFs includes the acquisition of new genes (some

1251 Sebe´-Pedro´s et al. · doi:10.1093/molbev/msq309 MBE of them via domain shuffling), gene cooption, and the di- Bielen H, Oberleitner S, Marcellini S, Gee L, Lemaire P, Bode HR, versification of ancestral domains increasing the combina- Rupp R, Technau U. 2007. Divergent functions of two ancient torial complexity. How these metazoan developmental TFs Hydra Brachyury paralogues suggest specific roles for their C- are functioning in unicellular organisms and how they were terminal domains in tissue fate induction. Development 134:4187 4197. exapted into new functions in multicellular animals re- Bray SJ, KafatosÀ FC. 1991. Developmental function of Elf-1: an mains to be answered. essential transcription factor during embryogenesis in Drosoph- ila. Genes Dev. 5:1672 1683. À Supplementary Material Bromberg J. 2002. Stat proteins and oncogenesis. J Clin Invest. 109:1139 1142. See supplementary material file 1 for figures S1 S7; file 2 Brown MW,À Spiegel FW, Silberman JD. 2009. Phylogeny of the for figures S8 S17; and file 3 for figures S18 S23À. Supple- ‘‘forgotten’’ cellular slime mold, Fonticula alba, reveals a key mentary materialÀ file 4 includes the annotationÀ of the Cap- evolutionary branch within Opisthokonta. Mol Biol Evol. 26:2699 2709. saspora sequences included in this study. They are available À at Molecular Biology and Evolution online (http:// Burge C, Karlin S. 1997. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 268:78 94. www.mbe.oxfordjournals.org/). Burki F, Shalchian-Tabrizi K, Pawlowski J.À 2008. Phylogenomics reveals a new ‘megagroup’ including most photosynthetic Acknowledgments eukaryotes. Biol Lett. 4:366 369. Coffman JA. 2003. Runx transcriptionÀ factors and the developmen- The genome sequences of C. owczarzaki, A. macrogynus, tal balance between cell proliferation and differentiation. Cell S. punctatus,andT. trahens are being determined by the Biol Int. 27:315 324. Broad Institute of MIT/Harvard University under the aus- Coutts AS, La ThangueÀ NB. 2005. The p53 response: emerging levels pices of the National Human Genome Research Institute of co-factor complexity. Biochem Biophys Res Commun. 331:778 785. and within the UNICORN initiative. We thank JGI, BI, and À BCM for making data publicly available. We thank Kim C. de Mendoza A, Suga H, Ruiz-Trillo I. 2010. Evolution of the MAGUK protein gene family in premetazoan lineages. BMC Evol Biol. Worley and the team of the A. castellanii genome project 10:93. for accession to the genome data. We thank Peter W. H. Degnan BM, Vervoort M, Larroux C, Richards GS. 2009. Early Holland for help in the assignment of homeodomain- evolution of metazoan transcription factors. Curr Opin Genet containing proteins and Romain Derelle, Jordi Paps, Dev. 19:591 599. À and Hiroshi Suga for helpful insights. This work was sup- Deppmann CD, Alvania RS, Taparowsky EJ. 2006. Cross-species ported by an ICREA contract, an European Research annotation of basic leucine zipper factor interactions: insight Council Starting Grant (ERC-2007-StG-206883), and a into the evolution of closed interaction networks. Mol Biol Evol. 23:1480 1492. grant (BFU2008-02839/BMC) from Ministerio de Ciencia Derelle R, LopezÀ P, Guyader HL, Manuel M. 2007. Homeodomain eInnovacio´n(MICINN)toI.R.-T.A.S.’ssalarywassup- proteins belong to the ancestral molecular toolkit of eukaryotes. ported by a pregraduate FPU grant and A.d.M.’s salary Evol Dev. 9:212 219. from a FPI grant both from MICINN. B.F.L.’s and Dodou E, TreismanÀ R. 1997. The Saccharomyces cerevisiae MADS- B.M.D.’s contributions were supported by the Canadian box transcription factor Rlm1 is a target for the Mpk1 Research Chair Program and the Australian Research mitogen-activated protein kinase pathway. Mol Cell Biol. Council, respectively. 17:1848 1859. Eddy SR. 1998.À Profile hidden Markov models. Bioinformatics 14:755 763. References Espinosa JM.À 2008. Mechanisms of regulatory diversity within the Abedin M, King N. 2008. The premetazoan ancestry of cadherins. p53 transcriptional network. Oncogene 27:4013 4023. À Science 319:946 948. Fraser JA, Diezmann S, Subaran RL, Allen A, Lengeler KB, Dietrich FS, Adell T, GrebenjukÀ VA, Wiens M, Muller WE. 2003. Isolation and Heitman J. 2004. Convergent evolution of chromosomal sex- characterization of two T-box genes from sponges, the determining regions in the animal and fungal kingdoms. PLoS phylogenetically oldest metazoan taxon. Dev Genes Evol. Biol. 2:e384. 213:421 434. Fraser JA, Heitman J. 2003. Fungal mating-type loci. Curr Biol. Adell T, MullerÀ WE. 2005. Expression pattern of the Brachyury and 13:R792 R795. À Tbx2 homologues from the sponge Suberites domuncula. Biol Fujii Y, Shimizu T, Toda T, Yanagida M, Hakoshima T. 2000. Cell. 97:641 650. Structural basis for the diversity of DNA recognition by bZIP Akamatsu Y, OhnoÀ T, Hirota K, Kagoshima H, Yodoi J, Shigesada K. transcription factors. Nat Struct Biol. 7:889 893. 1997. Redox regulation of the DNA binding activity in Galliot B, de Vargas C, Miller D. 1999. EvolutionÀ of homeobox genes: transcription factor PEBP2. The roles of two conserved cysteine Q50 paired-like genes founded the paired class. Dev Genes Evol. residues. J Biol Chem. 272:14497 14500. 209:186 197. Alvarez-Buylla ER, Pelaz S, LiljegrenÀ SJ, Gold SE, Burgeff C, Ditta GS, Gauthier M,À Degnan BM. 2008. The transcription factor NF-kappaB Ribas de Pouplana L, Martinez-Castilla L, Yanofsky MF. 2000. An in the demosponge Amphimedon queenslandica: insights on ancestral MADS-box gene duplication occurred before the the evolutionary origin of the Rel homology domain. Dev Genes divergence of plants and animals. Proc Natl Acad Sci U S A. Evol. 218:23 32. 97:5328 5333. Gehring WJ, AffolterÀ M, Burglin T. 1994. Homeodomain proteins. Araki T, vanÀ Egmond WN, van Haastert PJ, Williams JG. 2010. Dual Annu Rev Biochem. 63:487 526. regulation of a Dictyostelium STAT by cGMP and Ca2 Giebler HA, Lemasson I, NyborgÀ JK. 2000. p53 recruitment of CREB signalling. J Cell Sci. 123:837 841. þ binding protein mediated through phosphorylated CREB: a novel À 1252 Evolution of Metazoan Transcription Factors · doi:10.1093/molbev/msq309 MBE

pathway of tumor suppressor regulation. Mol Cell Biol. Marcellini S, Technau U, Smith JC, Lemaire P. 2003. Evolution of 20:4849 4858. Brachyury proteins: identification of a novel regulatory domain Goldberg JM,À Manning G, Liu A, Fey P, Pilcher KE, Xu Y, Smith JL. conserved within Bilateria. Dev Biol. 260:352 361. 2006. The dictyostelium kinome—analysis of the protein kinases Mayer BJ. 2008. Clues to the evolution ofÀ complex signaling from a simple model organism. PLoS Genet. 2:e38. machinery. PNAS 105:9453 9454. Goodman RH, Smolik S. 2000. CBP/p300 in cell growth, trans- Minge MA, Silberman JD, OrrÀ RJ, Cavalier-Smith T, Shalchian- formation, and development. Genes Dev. 14:1553 1577. Tabrizi K, Burki F, Skjaeveland A, Jakobsen KS. 2009. Evolutionary Grossman SR. 2001. p300/CBP/p53 interaction and regulationÀ of the position of breviate amoebae and the primary eukaryote p53 response. Eur J Biochem. 268:2773 2778. divergence. Proc Biol Sci. 276:597 604. Hayden MS, Ghosh S. 2004. Signaling toÀ NF-kappaB. Genes Dev. Mukherjee K, Brocchieri L, BurglinÀ TR. 2009. A comprehensive 18:2195 2224. classification and evolutionary analysis of plant homeobox Hurst HC.À 1994. Transcription factors. 1: bZIP proteins. Protein genes. Mol Biol Evol. 26:2775 2794. Profile. 1:123 168. Mukherjee K, Burglin TR. 2007.À Comprehensive analysis of animal Jin YH, Jeon EJ, LiÀ QL, Lee YH, Choi JK, Kim WJ, Lee KY, Bae SC. 2004. TALE homeobox genes: new conserved motifs and cases of Transforming growth factor-beta stimulates p300-dependent accelerated evolution. J Mol Evol. 65:137 153. RUNX3 acetylation, which inhibits ubiquitination-mediated Muller CW, Herrmann BG. 1997. CrystallographicÀ structure of the T degradation. J Biol Chem. 279:29409 29417. domain-DNA complex of the Brachyury transcription factor. Jones S. 2004. An overview of the basicÀ helix-loop-helix proteins. Nature 389:884 888. Genome Biol. 5:226. Nedelcu AM, TanÀ C. 2007. Early diversification and complex Katoh K, Kuma K, Miyata T, Toh H. 2005. Improvement in the evolutionary history of the p53 tumor suppressor gene family. accuracy of multiple sequence alignment program MAFFT. Dev Genes Evol. 217:801 806. Genome Inform. 16:22 33. Perkins ND, Felzien LK, BettsÀ JC, Leung K, Beach DH, Nabel GJ. 1997. Katoh K, Kuma K, TohÀ H, Miyata T. 2005. MAFFT version 5: Regulation of NF-kappaB by cyclin-dependent kinases associated improvement in accuracy of multiple sequence alignment. with the p300 coactivator. Science 275:523 527. Nucleic Acids Res. 33:511 518. Pires N, Dolan L. 2010. Origin and diversificationÀ of basic-helix-loop- Kawata T, Shevchenko A,À Fukuzawa M, Jermyn KA, Totty NF, helix proteins in plants. Mol Biol Evol. 27:862 874. Zhukovskaya NV, Sterling AE, Mann M, Williams JG. 1997. Potthoff MJ, Olson EN. 2007. MEF2: a central regulatorÀ of diverse SH2 signaling in a lower eukaryote: a STAT protein that developmental programs. Development 134:4131 4140. regulates stalk cell differentiation in dictyostelium. Cell Putnam NH, Srivastava M, Hellsten U, et al. (20 co-authors).À 2007. 89:909 916. Sea anemone genome reveals ancestral eumetazoan gene King N, WestbrookÀ MJ, Young SL, et al. (37 co-authors). 2008. The repertoire and genomic organization. Science 317:86 94. genome of the choanoflagellate Monosiga brevicollis and the Rennert J, Coffman JA, Mushegian AR, Robertson AJ.À 2003. The origin of metazoans. Nature 451:783 788. evolution of Runx genes I. A comparative study of sequences Larroux C, Luke GN, Koopman P, RokhsarÀ DS, Shimeld SM, from phylogenetically diverse model organisms. BMC Evol Biol. Degnan BM. 2008. Genesis and expansion of metazoan 3:4. transcription factor gene classes. Mol Biol Evol. 25:980 996. Robertson AJ, Larroux C, Degnan BM, Coffman JA. 2009. The Larroux C, Fahey B, Liubicich D, Hinman V, GauthierÀ MF, evolution of Runx genes II. The C-terminal Groucho recruitment Gongora M, Green K, WoA˜a`rheide G, Leys S, Degnan BP. 2006. motif is present in both eumetazoans and homoscleromorphs Developmental expression of transcription factor genes in but absent in a haplosclerid demosponge. BMC Res Notes. 2:59. a demosponge: insights into the origin of metazoan multicel- Rokas A. 2008. The origins of multicellularity and the early history of lularity. Evol Dev. 8:150 173. the genetic toolkit for animal development. Annu Rev Genet. Lee NSM, Rodriguez M, KimÀ B, Kim L. 2008. Dictyostelium kinase 42:235 251. DPYK3 negatively regulates STATc signaling in cell fate decision. Ronquist F,À Huelsenbeck JP. 2003. MrBayes 3: Bayesian phylogenetic Dev Growth Differ. 50:607 613. inference under mixed models. Bioinformatics (Oxford, England). Levy DE, Darnell JEJ. 2002.À Stats: transcriptional control and 19:1572 1574. biological impact. Nat Rev Mol Cell Biol. 3:651 662. Ruiz-TrilloÀ I, Burger G, Holland PW, King N, Lang BF, Roger AJ, Liu Y, Steenkamp ET, Brinkmann H, Forget L, PhilippeÀ H, Lang BF. Gray MW. 2007. The origins of multicellularity: a multi-taxon 2009. Phylogenomic analyses predict sistergroup relationship of genome initiative. Trends Genet. 23:113 118. nucleariids and fungi and paraphyly of zygomycetes with Ruiz-Trillo I, Inagaki Y, Davis LA, SperstadÀ S, Landfald B, Roger AJ. significant support. BMC Evol Biol. 9:272. 2004. Capsaspora owczarzaki is an independent opisthokont Londin ER, Mentzer L, Sirotkin HI. 2007. Churchill regulates cell lineage. Curr Biol. 14(22):R946 R947. movement and mesoderm specification by repressing Nodal Ruiz-Trillo I, Roger AJ, Burger G,À Gray MW, Lang BF. 2008. A signaling. BMC Dev Biol. 7:120. phylogenomic investigation into the origin of metazoa. Mol Biol Macian F. 2005. NFAT proteins: key regulators of T-cell de- Evol. 25:664 672. velopment and function. Nat Rev Immunol. 5:472 484. Scholz CB, TechnauÀ U. 2003. The ancestral role of Brachyury: Makita N, Suzuki M, Asami S, Takahata R, Kohzaki D,À Kobayashi S, expression of NemBra1 in the basal cnidarian Nematostella Hakamazuka T, Hozumi N. 2008. Two of four alternatively vectensis (Anthozoa). Dev Genes Evol. 212:563 570. spliced isoforms of RUNX2 control osteocalcin gene expression Sebe-Pedros A, Roger AJ, Lang FB, King N, Ruiz-TrilloÀ I. 2010. Ancient in human osteoblast cells. Gene 413:8 17. origin of the integrin-mediated adhesion and signaling machin- Manna PR, Dyson MT, Stocco DM. 2009. RoleÀ of basic leucine zipper ery. Proc Natl Acad Sci U S A. 107:10142 10147. proteins in transcriptional regulation of the steroidogenic acute Shalchian-Tabrizi K, Minge MA, EspelundÀ M, Orr R, Ruden T, regulatory protein gene. Mol Cell Endocrinol. 302:1 11. Jakobsen KS, Cavalier-Smith T. 2008. Multigene phylogeny of Manning G, Young SL, Miller WT, Zhai Y. 2008.À The protist, choanozoa and the origin of animals. PLoS One. 3:e2098. Monosiga brevicollis, has a tyrosine kinase signaling network Sheng G, dos Reis M, Stern CD. 2003. Churchill, a zinc finger more elaborate and diverse than found in any known metazoan. transcriptional activator, regulates the transition between Proc Natl Acad Sci U S A. 105:9674 9679. gastrulation and neurulation. Cell 115:603 613. À À 1253 Sebe´-Pedro´s et al. · doi:10.1093/molbev/msq309 MBE

Shi D, Pop MS, Kulikov R, Love IM, Kung AL, Grossman SR. 2009. Sullivan JC, Sher D, Eisenstein M, Shigesada K, Reitzel AM, Marlow H, CBP and p300 are cytoplasmic E4 polyubiquitin ligases for p53. Levanon D, Groner Y, Finnerty JR, Gat U. 2008. The evolutionary Proc Natl Acad Sci U S A. 106:16275 16280. origin of the Runx/CBFbeta transcription factors—studies of the Shimeld SM, Degnan B, Luke GN. 2010.À Evolutionary genomics of most basal metazoans. BMC Evol Biol. 8:228. the Fox genes: origin of gene families and the ancestry of gene Teufel A, Maass T, Galle PR, Malik N. 2009. The longevity assurance clusters. Genomics. 95(5):256 260. homologue of yeast lag1 (Lass) gene family (review). Int J Mol Shimodaira H, Hasegawa M. 2001.À CONSEL: for assessing the Med. 23:135 140. confidence of phylogenetic tree selection. Bioinformatics Torruella G, SugaÀ H, Riutort M, Pereto J, Ruiz-Trillo I. 2009. The 17:1246 1247. evolutionary history of lysine biosynthesis pathways within Shirra MK,À Hansen U. 1998. LSF and NTF-1 share a conserved DNA eukaryotes. J Mol Evol. 69:240 248. recognition motif yet require different oligomerization states to Traylor-Knowles N, Hansen U,À Dubuc TQ, Martindale MQ, form a stable protein-DNA complex. J Biol Chem. 273: Kaufman L, Finnerty JR. 2010. The evolutionary diversification 19260 19268. of LSF and Grainyhead transcription factors preceded the SimionatoÀ E, Ledent V, Richards G, Thomas-Chollier M, Kerner P, radiation of basal animal lineages. BMC Evol Biol. 10:101. Coornaert D, Degnan BM, Vervoort M. 2007. Origin and Tuteja G, Kaestner KH. 2007a. Forkhead transcription factors II. Cell diversification of the basic helix-loop-helix gene family in 131:192. metazoans: insights from comparative genomics. BMC Evol Biol. Tuteja G, Kaestner KH. 2007b. SnapShot: forkhead transcription 7:33. factors I. Cell 130:1160. Smith J. 1999. T-box genes: what they do and how they do it. Trends Tyler BM, Tripathy S, Zhang X, et al. (54 co-authors). 2006. Genet. 15:154 158. Phytophthora genome sequences uncover evolutionary origins Srivastava M, SimakovÀ O, Chapman J, et al. (33 co-authors). 2010. and mechanisms of pathogenesis. Science 313:1261 1266. The Amphimedon queenslandica genome and the evolution of Uv AE, Harrison EJ, Bray SJ. 1997. Tissue-specific splicingÀ and animal complexity. Nature 466:720 726. functions of the Drosophila transcription factor Grainyhead. Mol Stamatakis A. 2006. RAxML-VI-HPC:À maximum likelihood-based Cell Biol. 17:6727 6735. phylogenetic analyses with thousands of taxa and mixed models. Veljkovic J, HansenÀ U. 2004. Lineage-specific and ubiquitous Bioinformatics 22:2688 2690. biological roles of the mammalian transcription factor LSF. Stanke M, Morgenstern B.À 2005. AUGUSTUS: a web server for gene Gene 343:23 40. prediction in eukaryotes that allows user-defined constraints. Wheeler JC, ShigesadaÀ K, Gergen JP, Ito Y. 2000. Mechanisms of Nucleic Acids Res. 33:W465 W467. transcriptional regulation by Runt domain proteins. Semin Cell Stros M, Launholt D, Grasser KD.À 2007. The HMG-box: a versatile Dev Biol. 11:369 375. protein domain occurring in a wide variety of DNA-binding Wojciak JM, Martinez-YamoutÀ MA, Dyson HJ, Wright PE. 2009. proteins. Cell Mol Life Sci. 64:2590 2606. Structural basis for recruitment of CBP/p300 coactivators by Suga H, Sasaki G, Kuma K, NishiyoriÀ H, Hirose N, Su ZH, STAT1 and STAT2 transactivation domains. EMBO J. Iwabe N, Miyata T. 2008. Ancient divergence of animal 28:948 958. protein tyrosine kinase genes demonstrated by a gene Yamada Y,À Wang HY, Fukuzawa M, Barton GJ, Williams JG. 2008. family tree including choanoflagellate genes. FEBS Lett. A new family of transcription factors. Development 135: 582:815 818. 3093 3101. À À

1254 Results R6

Resum de l’article R6: Evolució dels factors de transcripció als eucariotes i l’assemblatge del sistema de regulació de la transcripció als llinatges multicel·lulars.

Els factors de transcripció (FT) són els principals actors en la regulació transcripcional dels eucariotes, però no està clar quin paper van jugar els FT en l'origen dels diferents llinatges multicel·lulars. En aquest article, explorem l’origen i diversitat de FT eucariotes i ens focalitzem en el seu patrons evolutius en els diferents llinatges multicel·lulars. Mitjançant el marc filogenètic més actual i usant dades genòmiques recentment disponibles, hem rastrejat l'origen, expansió i diversificació estructural de tots els TFS coneguts, inferint els continguts genòmics ancestrals dels diferents llinatges eucariotes. Els nostres resultats mostren que els llinatges multicel·lulars més complexos (és a dir, els que tenen desenvolupament embrionari, metazous i plantes) tenen els repertoris de FT més complexes. A més a més, trobem que els repertoris de plantes i animals van evolucionar en dues tongades, una onada de diversificació als ancestres unicel·lulars seguida d’una altre a la base dels grups respectius. Com els FT son molt importants al desenvolupament embrionari, vam analitzar els patrons d'expressió de tots els FT durant la ontogènia d’organismes model tant animals (Danio rerio i Drosophila melanogaster) com vegetals (Arabidopsis thaliana). Els patrons d'expressió a tots dos grups recapitulen els de tot el transcriptoma, però revelen algunes diferències importants. Els animals tenen un desenvolupament determinat, on l’activitat de FT al adult és menor que al desenvolupament, mentre que les plantes presenten un desenvolupament indeterminat, on l’adult expressa més FT degut a que mai deixa de produir noves estructures. Els nostres resultats basats en genòmica comparada i dades d’expressió canvien la visió de com FT van contribuir a l'evolució eucariota, i revelen la importància dels FT als diversos orígens de la multicel·lularitat i el desenvolupament embrionari.

109

110

Transcription factor evolution in eukaryotes and the assembly of the regulatory toolkit in multicellular lineages

Alex de Mendoza 1,a,b , Arnau Sebé-Pedrós1,a,b , Martin Sebastijan Šestak c , Marija MatejÏiÉc , Guifré Torruella a,b , Tomislav Domazet-Lošoc,d,Iñaki Ruiz-Trillo 2,a,b,e a. Institut de Biologia Evolutiva (CSIC-Universitat Pompeu Fabra) Passeig Marítim de la Barceloneta, 37‒49 08003 Barcelona, Spain b. Departament de Gen ètica, Universitat de Barcelona Av. Diagonal 645 08028 Barcelona, Spain c. Laboratory of Evolutionary Genetics, RuÓer Bo škovi É Institute, BijeniÏka cesta 54 HR-10000 Zagreb, Croatia d. Catholic University of Croatia, Ilica 242 HR-10000 Zagreb, Croatia e. Institució Catalana de Recerca i Estudis Avançats (ICREA) Passeig Llu ís Companys, 23 08010 Barcelona, Spain

Submitted to Proceedings of the National Academy of Sciences of the United States of America

Transcription Factors (TFs) are the main players in transcriptional regulation in eukaryotes. However, it remains unclear what role and whether TF expression pro!les correlate with the general TFs played in the origin of all the di!erent eukaryotic multicel- transcriptome pro!les. lular lineages. In this paper, we explore how the origin of TF repertoires shaped eukaryotic evolution and, in particular, their Previous studies have analysed the evolutionary history and role into the emergence of multicellular lineages. We traced the phylogenetic distribution of TFs in various branches of the tree origin and expansion of all known TFs through the eukaryotic o# ife (6, 15 ‒22). However, it is not yet clear whether the evolu- tree o" ife, using the broadest possible taxon sampling and an tionary scenarios previously proposed are robust to the incorpo- updated phylogenetic background. Our results show that the most ration of new genome data from key phylogenetic taxa that were complex multicellular lineages (i.e., those with with embryonic previously unavailable. development, Metazoa and Embryophyta) have the most complex In this paper, we present an updated analysis of TF diver- TF repertoires, and that these repertoires were assembled in a sity and evolution in di"erent eukaryote supergroups, focus- step-wise manner. We also show thatasignificant part of the ing on various unicellular-to-multicellular transitions. We report metazoan and embryophyte TF toolkits evolved earlier, in their new genome and/or transcriptome data from several unicellu- respective unicellular ancestors. To gain insights into the role of lar relatives of Metazoa and Fungi (including one !lasterean, TFs in the development of both embryophytes and metazoans, we !ve ichthyosporeans, a nucleariid and the incertae sedis Coral- analysed TF expression patterns throughout their ontogeny. The lochytrium limacisporum), and use published data from key, but expression patterns observed in both groups recapitulate those of previously unsampled, eukaryotic lineages, such as Glaucophyta, the whole transcriptome, but reveal some important di!erences. Haptophyta, Rhizaria and Cryptophyta. We show that an impor- Our comparative genomics and expression data re-shapes our view tant fraction of the metazoan TF toolkit is not novel but rather on how TFs contributed to eukaryotic evolution and reveals the appeared in the root of Opisthokonta and/or Holozoa. Similarly, importance of TFs to the origins of multicellularity and embryonic we show that many plant TFs are present in unicellular chloro- development. phytes, and many fungal TFs are present in microsporidians and nucleariids. Finally, we analyze the TFome (i.e., the general Metazoa | Embryophyta | Phylotypic stage | Origins of Multicellularity TF repertoire) expression throughout embryonic development | eukaryotes in embryophytes and metazoans and show that each phylostrati- graphic layer of TFs di"erentially contributes to successive stages Introduction Transcription factors (TFs) are proteins that bind to DNA in a sequence-speci!c manner (1) and enhance or repress gene Significance expression (2‒4). In response toabroad range of stimuli, TFs coordinate many important biological processes, from cell cycle Independent transitions to multicellularity in eukaryotes in- progression and physiological responses, to cell di"erentiation volved the evolution of complex transcriptional regulation and development (5, 6). Thus, TFs have a central role in the tran- toolkits in order to control cell di!erentiation. By using com- parative genomics we show that plants and animals required scriptional regulation of all cellular organisms, being present in richer transcriptional machineries compared to other eukary- all branches of the tree o# ife (bacteria, archaea, and eukaryotes). otic multicellular lineages. We suggest this is due to their There appears to beacorrelation between elaborate regulation orchestrated embryonic development. Moreover, our analysis of gene expression and the complexity of organisms (7), such that of transcription factor (TF) expression patterns during the development of both animals and plants reveal links between the amount (asaproportion of an organism ’s total gene content) TF evolution, species ontogeny and the phylotypic stage. and diversity of TF proteins is expected to be directly correlated with this complexity (8). Indeed, TFs playacrucial role in multi- Reserved for Publication Footnotes cellular eukaryotes. For example, TFs are the master regulators of embryonic development in embryophytes and metazoans (9), and analyses of their embryonic transcriptional pro!les supports the presence of a phylotypic stage in both lineages (10‒14). These studies have also shown that evolutionarily younger genes tend to be expressed at earlier and later stages of development, while the transcriptomes of the middle stages (the phylotypic stage) are dominated by ancient genes (10, 13) It remains to be investigated how the evolutionary age and the expression patterns of the di"erent TFs shift throughout the ontogeny of these lineages

Issue Date Volume Issue Number 1--?? Submission PDF

Fig. 1. Presence and abundance of Transcription Factors (TF) in eukaryotes. The heatmap depicts absolute TF counts according to the color scale. TFs/DBDs (rows) are clustered according to abundance and distribution, and species (columns) are grouped according to phylogenetic affinity. Major eukaryotic lineages are indicated (top). Raw data in Table S1. Further taxonomic information in Table S2. of embryonic development, linking the evolutionary history of Results TFs to organismal ontogeny. Lineage-speci!c and paneukaryotic transcription factors 2 Footline Author Submission PDF

Fig. 2. Phylogenetic patterns of TFome composition across eukaryotes. For the sake of clarity and simplicity, the TF types that represent less than 2% of the corresponding TFome are not considered. For the same reason some TF types are summarized in higher-level categories according to structural similarities. This is the case of the 1) Homeobox supergroup, which comprise Homeobox and Homeobox KN/TALE; 2) the bZIP supergroup, wich comprise bZIP 1, bZIP 2 and bZIP Maf; and 3) the p53-like supergroup, which comprise p53, STAT, Runx, NDT80, LAG1 and RHD .To the left, total number of TF types present in each taxa and the relative abundance of each DBD type (reduced as specified in Methods section 2) in the TFome of every species is depicted using the colour code in the legend of DBDs. To the right,abar graph indicates the total number of TFs in each species, and dots indicate the percentage of total TFs/ total number of proteins. Black asterisks indicate species with Whole Genome Duplications. Red asterisks indicate strict parasites. Raw data in Table S1. Further taxonomic information in Table S2.

DNA-binding domains (DBDs) are univocal signatures of the DBDs across eukaryotes to survey the presence, abundance, and presence of a particular type of TF (8, 16). Therefore, we analysed relative contribution of each TF class (71 in total) throughout

Footline Author Issue Date Volume Issue Number 3 Fig. 3. TFs gains/losses and quantitative enrichments in eukaryote evo- lution, based on our 77-species taxonomic sampling. The reconstruction of evolutionary gains/losses of given TFs types in each node and ancestral states has been inferred using Dollo parsimony. The phylogenetic framework was taken from up-to-date eukaryote phylogenomic studies (70 ‒74). DBDs gains are shown in green and lineage specific DBDs secondary losses are shown in red boxes. See Table S3 for complete lists of TFs indicated as numbers. DBDs of unclear origin are shown in yellow boxes (e.g. also present in very distant, unrelated, species). Quantitative enrichments DBDs are shown in blue boxes, and were obtained using Wilcoxon rank-sum tests, with p-value threshold of < 0,01.

Fig. 4. A) Average expression level of the TFome during development various eukaryotic lineages (Figure 1). We used a wide taxon in Danio rerio , the area shaded in red is the proposed phylotypic stage sampling strategy optimized to have the largest possible diversity, (10). B) Transcriptome Age Index (TAI) of the TFome during development especially around multicellular transitions. in Danio rerio . The higher the TAI, the younger the TFs that are expressed, Our data divides eukaryotic TFs into two main groups accord- and vice versa. The grey area around TAI values corresponds to the ± ing to their phylogenetic distribution, paneukaryotic and lineage- one standard error of mean. C) Relative expression of transcription factors speci!c TFs (Figure 1). Paneukaryotic TFs are widely distributed sorted by significantly enriched phylostrata during development in Danio and were already present in the Last Common Ancestor of rerio . For easier comparison, relative expression is shown in relation to the highest (0) and lowest (1) expression values across developmental stages (see Eukaryotes (LECA). These paneukaryotic TFs include HLH, Methods). Bl, blastula; Cl, cleavage; G, gastrula; H, hatching; Juv, juvenile; Ph, GATA, SRF-TF, bZIP, Homeobox, HMGbox and zf-C2H2 (Fig- pharyngula; Se, segmentation; Z, zygote. Supplementary Material ure 1). Although common to most eukaryotes, the abundance of paneukaryotic TFs tends to be highly variable, with independent expansions in di"erent lineages.Agood example is the expansion 1). A few paneukaryotic TFs, such as CFBF NFYA, YL1 or TBP, of homeobox TFs in both metazoans and embryophytes (Figure are less prone to diversi!cation and are often lost secondarily

4 Footline Author (Figure 1). Regarding the lineage speci!c TFs, we could de!ne di"erentiation (19, 20). Conversely, p53-like TFs (including p53, six taxonomically restricted clusters of TFs: 1) TFs exclusive or runx, T-box, NDT80 and STAT) and bZIP seem to represent a mostly present in unikonts; 2) TFs that are speci!c to, or very larger fraction of the TFome in early-branching metazoans, but a uncommon outside, holozoans; 3) metazoan-speci!c TFs; 4) TFs minor fraction in bilaterians. Sponges, which are possibly the ear- mostly present in Archaeplastida; 5) Embryophyta-speci!c TFs, liest branching metazoans (30‒33), have some rather uncommon and 6) fungal-speci!c TFs or TFs that are predominant in fungal TFs expanded in comparison to other metazoans. For example taxa (Figure 1). HTH psq represents more than 9% of the TFome of Amphime- Protein domain architecture provides an additional layer of don queenslandica(Figure 2). In any case, most metazoans have TFome complexity (8, 17). For example, it is known that dur- similar TFome pro!les, with a few DBDs representingalarge ing their evolution homeobox TFs acquired extra domains that percentatge of their TF diversity, with some exceptions such as provided new binding targets and speci!city (23). Therefore, we zf-C4 (Hormone receptors) in C. elegansor MADF in Drosophila analysed the complexity of the gene architecture of TFs across melanogaster(Figure 2). eukaryotes. Our data show that the complexity of the protein The TFome pro!les of unicellular holozoans such domain architecture of some paneukaryotic TFs, including Myb, as choano#agellates, !lastereans, ichthyosporeans and zf-C2H2, Homeobox and bHLH, is signi!cantly enriched in both Corallochytrium limacisporum are di"erent to those of metazoans metazoans and embryophytes (Figure S1 and Figure S2). and fungi, but similar to each other, with a small deviation of Phylogenetic patterns of TF numbers ichthyosporeans (from Sphaeroforma arctica to Amoebidium The total number of TFs varies markedly between taxa (Fig- parasiticum in !gure 2), which haveahigher proportion of ure 1), raising the possibility that these variations are corre- GATA and zf-TAZ than others. Despite having the smallest TF lated with some speci!c features, such as organismal complexity. repertoire among of all holozoans, due mainly to the reduction Our data show that complex developing multicellular taxa (i.e. or loss of some TF classes (19), choano#agellates haveavery Embryophyta and Metazoa) present a dramatic increase in TF similar TFome pro!le compared to other holozoans. numbers compared to other eukaryotes. Moreover, morphologi- Fungi have a very particular TFome pro!le including some cally simpler forms within these lineages, such as sponges within fungus-speci!c TFs (Figure 2), some of which were further ex- metazoans, and mosses within embryophytes, have less TFs than panded in Dikarya. For example, this is the case in ZnClus morphologically more complex groups (Figure 2). In contrast, (also known as zf-Z2-C6) and Fungal transcription factor, which irrespective of their phylogenetic position, parasitic eukaryotes become more abundant during fungal evolution (Figure 1). In have signi!cantly fewer TFs (Figure 2). Notably, some species parallel, Dikarya presents important TF losses (Figure 3), such described as parasitic or endosymbiotic, such as the di"erent as that of E2F-TDP, zf-TAZ, CSD and Tub, and represents an Ichthyosporea or Capsaspora owczarzaki (24, 25) have relatively interesting case of simpli!cation and divergence. Indeed, most rich TF repertoires, perhaps revealing a more complex life cy- early-branching Fungi, like chytrids, blastocladiomycetes and zy- cle or an undescribed free-living stage. Finally, Whole Genome gomycetes, show distinct TF diversity, in some cases resembling Duplication (WGD) also partly explains the existence of some that of the unicellular holozoans. Microsporidians showadrastic particularly rich TF repertoires in groups such as the verte- reduction in their TF repertoire but conserve STE-like, one of brates and the embryophytes, or Allomyces macrogynus, the Mu- the fungus-speci!c TFs (Figure 1). Nuclearia sp., the unicellular coromycotina, and Paramecium (Figure 2)(26, 27). Species with sister group to Fungi (24) have typical fungus TFs such as STE- or without WGD that branch within the same group, such as like and Fungal transcription factor 1. Interestingly, Nuclearia sp. the deuterostomes Homo sapiens and Ciona intestinalis or the also has STAT and NFkappaB, both secondarily lost in fungi and fungi Allomyces macrogynusand the chytrids, show similar TFome previously thought to be exclusive to holozoans (19). pro!les in terms of the proportions of each TF class, suggesting A common feature of all opisthokonts is the relative abun- that this tendency is independent of TF type. This is consistent dance of Forkhead TFs compared to other eukaryotes. In con- with the !nding that TFs are one of the gene classes that are most trast, the four amoebozoans analysed have quite di"erent TFome resilient to loss after WGD (28, 29). As for the total number of pro!les, although this is not unexpected since genome data from TFs, the proportion of TFs in a genome (as a fraction of the total Amoebozoa is still scarce. Entamoeba histolyticaisastrict parasite number of proteins) also varies considerably. This proportion is and therefore most likely derived, while A. castellanii, the only high in Embryophyta and toalesser extent in Metazoa, similar non-protostelid amoebozoan sampled to date, shows a striking to the pattern observed for the total number of TFs. In contrast, increase in the number of RFX TFs, and it also has Forkhead Fungi, which have low numbers of TFs, have a high proportion and TEA/Sd TFs (34). of TFs in their genomes. Notably, these measures are strongly The observed TF content of the group Archaeplastida, which a"ected by the quality of genome sequencing and annotation, comprises Embryophyta and their unicellular relatives (Chloro- which di"ers markedly between the taxa studied, and which can phyta, Rhodophyta and Glaucophyta), is consistent with previous result in under- or overestimation of the total number of genes in analyses (17). Our data show that Viridiplantae (Embryophyta a genome. + Chlorophyta) share a unique set of TFs, which contrasts TFome pro!les in eukaryotic groups with the TFome pro!le of the glaucophyte C. paradoxa (35), We next evaluated the contribution of di"erent TF types in while the rhodophyte C. merolae contains only one of those each organism’s TFome, in terms of the abundance of each TF Viridiplantae-speci!c TFs (PLATZ). Similar to what is observed type asaproportion of the total number of TFs in that particular in metazoans and fungi, embryophytes are particularly enriched genome (the TFome pro!le (Figure 2). Our analysis shows a in some TF families, especially SBP, AP2, B3, GRAS, zf-Dof and clear phylogenetic pattern of TF diversity that is recovered by SRF/MADS. This supports the !nding that the TF pro!les of clustering based on TF content, with a few exceptions (Figure S3). independent multicellular lineages are a combination of enrich- Metazoans haveavery distinctive TFome pro!le, with Home- ing particular paneukaryotic TF families and evolving new TF obox, zf-C2H2 and bHLH representing the largest fraction, es- families. pecially in Bilateria, in which these three TF classes represent Despite the poor taxon sampling and large evolutionary dis- more than 50% of the total number of TFs (Figure 2). These are tances between the heterokont species analyzed here, there is a the same TFs that are signi!cantly enriched in protein domain phylogenetic pattern in their TF pro!les. The Heat Shock Factors architecture in metazoans (Figure S2). As suggested elsewhere, (HSF) and the zf-TAZ are signi!cantly enriched in all heterokont the main role of these types of TF is in patterning and cell-type genomes (see also Figure 3). The multicellular brown algae Ecto-

Footline Author Issue Date Volume Issue Number 5 carpus silicosus(36) and its unicellular relative Nanochloropsis ga- We de!ned the phylostratigraphy of the TFome of these three ditana (37) are enriched not only for HSF but also for Zn cluster species as previously described (10)(Figure S4). Consistent with TFs, which are also abundant in fungi (see above). Although the domain-based analysis, we recoveredasimilar evolutionary brown algae have complex multicellularity, the total number of history of the TFome: two major steps of diversi!cation for their TFs is not signi!cantly increased, in contrast to what is found both embryophytes and metazoans (Figure S4). Metazoans show in metazoans and embryophytes. This may be explained by the overrepresented gains in TFs in phylostrata ps2, ps5, ps6 and ps8 presence of heterokont or brown-algae speci!c TFs that have (Eukaryota, Opisthokonta, Holozoa, and Metazoa, respectively) not yet been described due to the paucity o" unctional studies and embryophytes in phylostrata ps2, ps3, ps4, ps5 and ps6 (Eu- in this group (38, 39). Alternatively, the fact that Ectocarpus karyota, Bikonta, Archaeplastida, Viridiplantae, Embryophyta). siliculosus has modular growth instead of stereotypical embryonic Moreover founder-gene analysis suggests that the evolution of the development (38) may account for the lack of complexity in their TFome was marked by periods of novel protein emergence as well TFome. Data from multicellular brown algae with embryonic as phases of extensive duplication (Figure S4). development, such as Fucus spiralis (40), will be key to answering Next, we used previously published gene expression datasets this question. covering a series of developmental stages for these three model Reconstruction of TF gain/loss and expansion across eukary- species (10, 43‒45) to describe the average expression of the otes whole TFome during development (Figure 4 and Figures S5 We reconstructed the evolutionary history of the TFs in eu- and S6). However, the data used for D. melanogaster and A. karyotes by mapping gains and losses using Dollo parsimony. thaliana have less staging density than D. rerio data, increasing We used Wilcoxon rank-sum tests to detect signi!cant lineage- the possible noise. D. rerio has two peaks of TF activity during speci!c enrichments (Figure 3). In terms of new gene gains, a embryogenesis, one during gastrulation and a less pronounced strikingly similar pattern is found in embryophytes and meta- one in the pharyngula stage, known as the vertebrate phylotypic zoans, and toalesser extent in Fungi. The emergence of these stage (Figure 4A). D. melanogaster has a similar TFome pro!le lineages with complex multicellularity is thought to be linked during development to that observed in D. rerio, with a peak of to a burst in their respective multicellularity toolkits, including TF expression during gastrulation and a subsequent peak during TFs (41, 42). Our data, however, suggest an increase in TFome metamorphosis (Figure S5A). In both these metazoan taxa, TF complexity in two steps. First, some innovation already took place expression increases after early development and decreases in in the ancestors of complex multicellular organisms given that adult stages, when most structures have already developed. As their closest extant unicellular relatives (i.e, in holozoans and in A. thaliana datasets where much poorer in stages, we used two the branch leading to Embryophyta+Chlorophyta) already have di"erent expression datasets (44, 45). Both datasets show an complex TFomes.Asecond innovation step took place at the increase in TF expression in the mature stage, in contrast to what origin of metazoans and embryophytes. Similarly, the enrichment is observed in metazoans (Figure S6A). This can be explained by of some TF families follows the same two-step process, such as the fact that embryophytes have an indeterminate development the enrichment of T-box and CSD TFs in holozoans. Later on, (46), in which organogenesis and the formation of new structures at the origin of Metazoa, Forkhead, RHD, homeobox and T- take place later in development, even during the adult stage. box TFs were signi!cantly enriched. On the other hand, GATA, Nevertheless variability in A. thaliana datasets is still high, and ARID, HLH, AP2 and zf-GRF were enriched in Archaeplastida, there is lack of data o# ater developmental stages, so conclusions while SRF-TFs, homeobox, and TBPs were enriched at the on- from A. thaliana should be taken with caution. set of Embryophyta. This means that a large percentage of the To evaluate the contribution of the phylostrata that showed metazoan and embryophyte regulatory toolkits were already in statistically signi!cant signals in the overrepresentation analyses place before their origin and divergence. However, TFs that are (Figure S4), we decomposed the data and calculated relative ex- enriched in metazoans fall into ancient TF classes, whereas new pressions of each phylostrata along development (10). D. rerio and domains account for more than 50% of TF diversity in Archae- D. melanogastershow similar expression patterns in early develop- plastida. The Fungi also gained several speci!c TF families, such ment, withapeak of opisthokont genes (ps5) before gastrulation as STE-like or Copper-!st TFs, although some such as STAT and a peak of metazoan genes (ps9) during gastrulation (Figure or zf-TAZ were also lost, in contrast to what is observed in 4C and Figure S5C), a de!ning developmental process that is metazoans and embryophytes. Moreover, Zn clust and NDT80 essential for metazoan embryogenesis. The relative expression were signi!cantly enriched in Fungi (Figure 3). Finally, the only of genes from these two phylostrata (opisthokont and metazoan) signi!cant case of depletion observed is that of Myb TFs at the again shows an increase in expression in later stages of develop- stem of Opisthokonta+Apusozoa. It is worth mentioning that ment. Despite D. melanogasterbeing poorer in staging density, the this reconstruction remains tentative and is based on our current patterns seem to be consistent with those in D. rerio. In contrast genomic taxonomic sampling. The sequencing of new genomes to metazoans, A. thaliana shows a very di"erent pattern, in which from additional taxa, especially in cases where a lineage is covered eukaryotic TFs (ps2) predominate early development, followed by onlyasingle taxon, will help improve the resolution of the by bikont TFs (ps3), and !nally plant-speci!c TFs in the mature evolutionary scenario here presented. stage (ps4-6) (Figure S6C). To evaluate general trends of TF Phylostratigraphic analysis of TFome expression in multicel- age along development we used the Transcriptome Age Index lular ontogenies (TAI), from which we identi!ed complementary patterns to the contribution of speci!c phylostrata, with newer genes being more Our results show that complex multicellular organisms with predominant in later development of all three species (Figure 4B embryonic development (i.e., embryophytes and metazoans) have and Figures S5 and S6). The younger genes in later development the richer TFomes that evolved through two bursts o! nnovation of D. rerio may have been in#uenced by the presence of the from the ancestral eukaryotic repertoire: one at the stem shared reproductive tract in the samples, as numerous TF innovations with their unicellular relatives and the other during the transition have been described in the reproductive tract (47). to multicellularity. To gain further insights into the similarities and di"erences between embryophytes and metazoans, we anal- Discussion ysed the deployment of TFome during the development of three model organisms, zebra!sh ( Danio rerio), fruit #y ( Drosophila Phylogenetic inertia, lifestyle and genome structure in!uence the TF melanogaster) and mouse-ear cress ( Arabidopsis thaliana). repertoire of eukaryotic taxa

6 Footline Author Our study reconstructs with unprecedented detail the phy- Analysis of TF expression patterns highlight di!erences on the logenetic distribution of TF classes during eukaryote evolution, ontogeny of both Metazoa and Embryophyta based on the presence or absence of their speci!c DBDs. We The role of the di#erent TFs in the ontogeny of the organisms provide evidence of a paneukaryotic TF complement that is analysed varies according to the evolutionary origin of the di#er- present in almost all eukaryotic groups analysed, and which was ent TFs in the tree o" ife (their phylostrata). Thus, the relative further expanded in some speci!c lineages, such as metazoans importance of TFs corresponding to di#erent phylostrata changes and embryophytes. These expansions were mainly due to gene between developmental stages. For example, gastrulation shows duplications, and, in some lineages, to diversi!cation in protein remarkably high expression of TFs of metazoan origin. In general, domain architecture. In contrast, other TF classes present sharp both in D. rerio and D. melanogaster, evolutionary younger TFs phylogenetic boundaries, to the point that some well-de!ned seem to be more important in later stages of development, some- eukaryotic lineages can be de!ned by their shared and speci!c how being added to terminal speci!cations, while evolutionary TFome pro!les. This suggests an important role for phylogenetic older genes are prevalent in the earlier stages of development. inertia in de!ning the TFome composition ofagiven clade (48). Indeed, TF expression by itself marks the phylotypic stage of Our results suggest a relationship between TFome content, metazoans. life-style, genome dynamics and multicellularity. For example, Conclusions strict parasitism, which has been shown to be linked to genome re- duction and extensive gene loss (49), is associated with a marked To sum up, we show that the evolution of the TFome is charac- reduction of TFs. The opposite, however, is not always true, in terised by important similarities between the independent tran- that free-living species do not always have rich TFomes. Instead, sitions to multicellularity of embryophytes and metazoans. Not Whole Genome Duplications are associated with rich TFomes. only do they share a common pattern of TF origin and expansion, Moreover, the relative abundance of each TF class as a propor- which is richer than other eukaryotes, but also express their TFs tion of overall TFome of each species further highlights a strong di#erentially during development, depending on their evolution- phylogenetic patterning, probably asaresult of the combined ef- ary age. This suggests that common evolutionary forces drove the fects of phylogenetic inertia and system-level adaptation. In fact, unicellular-to-multicellular transition in two phylogenetically dis- each particular clade has a characteristic TFome pro!le, although tinct lineages. We hypothesize that is due to their complex embry- a few lineages, such as Metazoa, Embryophyta and Dikarya, show onic development rather than just their multicellular lifestyle or drastic shifts in their TFome pro!les compared to their ances- number of cell types. The success of metazoans and embryophytes tors. Unicellular holozoans (Choano"agellata, Filasterea, and in producing extensive morphological diversi!cation is due to Ichthyosporea) and early-branching fungi (Mucoromycota, Blas- the speci!c adaptations of their genomes, being the TFome a tocladiomycota and Chytridiomycota) have rather homogeneous key aspect into the acquisition of a complex multicellularity and TF pro!les despite being paraphyletic. Conversely, the crown phenotypic plasticity. groups of these lineages, metazoans and Dikarya, respectively, Methods changed their ancestral TF pro!le, suggesting that a genome-wide TF identification re-shaping of the TF cellular function occurred at the boundaries We obtained data on complete proteomes from of key evolutionary transitions followed by a posterior stasis. publicly available databases. We also used RNAseq data for some taxa sequenced by the Broad Institute Metazoa and Embryophyta have the richer TFome among eu- (http://www.broadinstitute.org/annotation/genome/multicellularity project/MultiHome.html). karyotes To haveabetter representation of unicellular taxa close to Metazoa, we also screened some taxa whose RNAseq were sequenced in-house. These Notably, our data show that embryophytes and metazoans include the nucleariid Nuclearia sp. (sister group of Fungi) and several have the most complex TF repertoires of all eukaryotes. This unicellular holozoans ( Ministeria vibrans, Pirum gemmatta, Abeoforma contrasts with the relatively less complex TF repertoire of other whisleri, Amoebidium parasiticum and Corallochytrium limacisporum) . In multicellular lineages, such as Fungi, the brown algae E. siliculo- this case, non-filtered reads were assembled using Trinity software (52) and a six-frame translation of the assembly was generated. A PfamScan was sus, or even the red alga Chondrus crispus (50). There exists the performed on all proteomes and transcriptomes using PFAMAversion 26 possibility that our analysis has missed as yet undescribed TFs that and selecting the gathering threshold option asaconservative approach to are unique to those lineages, as TFs that are restricted to small minimize false positives (53). clades may be more prevalent than expected (51). In any case, Transcription factors were selected from two database resources (Tran- fungi, and brown and red algae did not expand their repertoire of scriptionfactor.org, PfamTOGo website) as well as previous studies on this topic (6, 16, 54). In all cases we defined a univocal one-to-one relationship paneukaryotic TFs (Homeoboxes, bHLH and bZIPs) nor their between TF class and DBD class. DBDs that appeared just in combination protein domain architectures, in contrast to embryophytes or with other DBDs were neglected in order to avoid an overestimation of TF metazoans. We suggest that this di#erence is due to the fact numbers in some genomes. that both embryophytes and metazoans go through a complex We counted the number of genes containing a given DBD, and the num- ber of di!erent associated domain architectures associated with each DBD embryonic development, in contrast to the modular development in each species using custom Perl scripts (Figure 1, Figure S1). In cases where o# ungi, brown and red algae. The orchestrated embryonic devel- two or more DBDs were found in the same gene, the evolutionary older DBD, opment of both metazoans and embryophytes may needamore or the larger when age was identical, was considered the defining of the TF complex regulation, and thus, a more complete transcriptional type. To avoid overestimating architecture numbers due to problems with detecting small repeated domains, consecutively repeated domains were regulation molecular toolkit. counted only once. Therefore, domain architectures were defined on the Pre-adaptive expansion of the TF repertoire in the unicellular basis of the number and order of their domains, but not by the number of ancestors of both Metazoa and Embryophyta repeats of each domain. Ancestral genome reconstruction and enrichment Moreover, our data show that the TF toolkit of metazoans Statistical analyses were performed using custom R scripts. We tested and embryophytes was assembled inastep-wise manner from the for enrichment of TF numbers using the Wilcoxon rank-sum test, with a significance threshold of p<0.01. Cluster analysis was performed using the ancestral eukaryotic TFome, with bursts of TF innovation in 1) pvclustRpackage, using the absolute value of sample correlation, complete their unicellular relatives, and 2) at the origin of the metazoan or hierarchical clustering, and 100,000 replications. embryophyte lineages. Notably, both unicellular holozoans and Analysis of gain, loss of TFs and reconstruction of ancestral state was chlorophytes already have complex TFomes, both with regards performed using Mesquite (55), and the most parsimonious assumption was the number of genes and in the complexity of the protein domain taken, except in the cases highlighted in figure 3. In cases whenaDBDs secondary loss was inferred based on its absence inasingle species (lin- architectures. Further expansions, however, occurred at the origin eages represented by a single taxon are Rhizaria, Haptophyta, Cryptophyta, of both metazoans and embryophytes. Glaucophyta and Rhodophyta), we confirmed its absence by performing

Footline Author Issue Date Volume Issue Number 7 tblastn against Expressed sequence tags (EST) and nucleotide collection Overrepresentation analysis and statistics (nr/nt) databases at NCBI website, which include other taxa from those We analysed overrepresentation of transcription factors both at the lineages. The results are indicated in Table S3. level of total genome and at the level o! ounder genes. In both cases, and for Phylostratigraphic analyses every group of transcription factors o" nterest, we performed overrepresen- The theoretical basis of genomic phylostratigraphy and detailed phy- tation analyses by comparing the frequency of transcription factors in each lostratigraphic procedures has been described previously (10, 13, 56 ‒58). phylostratum to that of all genes in that phylostratum (expected frequency) Protein coding sequences for Danio rerio (25,638 genes), Drosophila (56, 57). The deviations obtained were expressed as log-odds ratios, and their melanogaster (13,413 genes), and Arabidopsis thaliana (27,148 genes) were significance was tested using a two-tailed hypergeometric test (68), corrected retrieved from the Ensembl database (version 69) (59). We compared these for multiple comparisons using a false discovery rate (FDR) of 0.05 (69) (Figure protein sequences against the non-redundant (nr) database from NCBI using S4). the BLASTP algorithm (BLAST program) with an E-value cut-o!of 10 -3 (60). Transcriptome age index (TAI) This database contains the most exhaustive set of known proteins across all TAI isameasure that reflects the evolutionary age ofatranscriptome organisms and is therefore the most suitable dataset for phylostratigraphic atagiven ontogenetic stage (10). TAI is the weighted mean of phylogenetic analysis. Prior to performing sequence similarity searches, we excluded all ranks (phylostrata) and is calculated for every ontogenetic stage s as follows sequences of viral or unknown taxonomic origin, as well as those from (Eq2): metazoan taxa withacurrently unreliable phylogenetic position (Myxozoa, Mesozoa, Chaetognatha, and Placozoa). Following this clean-up procedure, we complemented the nr database with sequenced genomes that were not present in the database but were otherwise available in other public repositories. The final database contained 8,393,531 protein sequences. Using the BLAST output obtained above, we mapped the TFs obtained from PfamScan analysis for each species separately onto the consensus phylogeny. We used the phylogenetically most distant BLAST match using an where ps i is an integer representing the phylostratum of gene i (e.g. 1, the oldest; 18, the youngest for D. rerio ), e is the microarray signal intensity value E-value cut-o!of 10 -3 as the criterion for assigning the evolutionary origin i of gene i, which acts as weightinng factor, and n is the total number of genes of a gene. This isaquite conservative method for sorting genes that aims to analysed. Note that TAI=1 indicates that all expressed genes inaspecific detect novelty in the protein sequence space (10, 56, 61). The number and stage coming from ps1 (origin of cellular organisms), and TAI=18 (zebrafish) choice o" nternodes in the phylogeny isaresult of balancing the robustness indicates that a given stage expresses only D.rerio -specific genes. Therefore, of these internodes in phylogenetic studies (32, 62 ‒67), the availability of lower TAI values correspond toaphylogenetically older transcriptome. sequence data for the sequence similarity searches, and the importance TAI statistical analysis of evolutionary transitions (Table S4). Our consensus phylogenies span 16 evolutionary levels (phylostrata) for Arabidopsis thaliana , 18 for Danio rerio , To test for significance di!erences in TAI between stages we used and 20 for Drosophila melanogaster , starting from the origin of cellular repeated measures ANOVA. This approach is suitable for this type of data organisms (ps1). because the expression levels of the same set of probes are measured at We also calculated the number o! ounder TFs in each phylostratum every stage, leading to the interdependence between measurements. We by self-comparing sequences withinaphylostratum by BLAST analysis (E- multiplied expression frequencies by the total number of probes analysed (a value cut-o!1e 10-3). For instance, we compared all of the TFs in the first constant, n) before comparing the means of these products between stages phylostratum (ps1) to TFs genes from phylostratum 1 (ps1). The number of using ANOVA. This transformation does not influence the ANOVA and its sole founder TFs in each phylostratum, G f , was calculated from the number of purpose is to make the means of gene expression frequencies compared in hits ( H) obtained for every TF using Eq1: the ANOVA equal to their corresponding TAI values. Since the assumption of sphericity was violated in data sets analysed by repeated measures ANOVA, we applied the Greenhouse ‒Geisser correction. The grey area around TAI values in figures corresponds to the ± one standard error of mean.

Acknowledgements. We thank Joint Genome Institute and Broad Institute for making these data publicly available. We thank Andy Baxevanis for sharing M. leidyi sequences, and Scott A. Nichols for sharing O. carmela sequences. This where G represents the number of TFs in the phylostratum and 1 ื H ื G. The work was supported byacontract from the Instituci ó Catalana de Recerca lowest possible value of Gf is 1, indicating that all genes in the phylostratum i Estudis Avan çats, a European Research Council Starting Grant (ERC-2007- are related, and the highest possible value is G, indicating that all genes in StG-206883), and a grant (BFU2011-23434) from Ministerio de Economía y the phylostratum are founders. However, gene founder analysis encompass Competitividad (MINECO) to I. R.-T.. AdM was supported byaFormaci ón full sequence length not only domains that are signatures of TFs, providing Personal Investigador grant from MINECO. A.S-P. was supported byapre- a complementary view on the DBD approach. graduate Formaci ón Profesorado Universitario grant from MINECO.

1. Wunderlich Z, Mirny LA (2009) Di!erent gene regulation strategies revealed by analysis of 17. LangDet al. (2010) Genome-wide phylogenetic comparative analysis of plant transcriptional binding motifs. Trends Genet 25:434‒440. regulation:atimeline o" oss, gain, expansion, and correlation with complexity. Genome Biol 2. Latchman DS (1997) Transcription factors: An overview. Int J Biochem Cell Biol Evol 2:488‒503. 29:1305‒1312. 18. Shelest E (2008) Transcription factors in fungi. FEMS Microbiol Lett 286:145‒51. 3. Latchman DS (2008) Eukaryotic transcription factors (Elsevier/Academic Press, Nether- 19. Sebé-Pedrós A, de Mendoza A, Lang BF, Degnan BM, Ruiz-Trillo I (2011) Unexpected lands)5th editio. Repertoire of Metazoan Transcription Factors in the Unicellular Holozoan Capsaspora 4. Papavassiliou A (1997) Transcription factors in eukaryotes(Landes Bioscience, Austin, TX). owczarzaki. Mol Biol Evol 28:1241‒54. 5. Spitz F, Furlong EEM (2012) Transcription factors: from enhancer binding to developmental 20. Degnan BM, Vervoort M, Larroux C, Richards GS (2009) Early evolution of metazoan control. Nat Rev Genet 13:613‒626. transcription factors. Curr Opin Genet Dev 19:591‒9. 6. Vaquerizas JM, Kummerfeld SK, TeichmannSa, Luscombe NM (2009) A census of human 21. Rayko E, Maumus F, Maheswari U, Jabbari K, BowlerC(2010) Transcription factor families transcription factors: function, expression and evolution. Nat Rev Genet 10:252‒63. inferred from genome sequences of photosynthetic stramenopiles. New Phytol188:52‒66. 7. Levine M, Tjian R (2003) Transcription regulation and animal diversity. Nature 424:147‒51. 22. Shiu S, Shih M, Li W (2005) Transcription factor families have much higher expansion rates 8. Charoensawan V, Wilson D, TeichmannSa (2010) Genomic repertoires of DNA-binding in plants than in animals. Plant Physiol 139:18‒26. transcription factors across the tree o" ife. Nucleic Acids Res 38:7364‒77. 23. B ürglin TR (2011) in A Handbook of Transcription Factors , ed Hughes TR (Springer Nether- 9. Meyerowitz EM (2002) Plants compared to animals: the broadest comparative study of lands, Dordrecht), pp 95 ‒122. development. Science 295:1482‒5. 24. Ruiz-Trillo I et al. (2007) The origins of multicellularity:amulti-taxon genome initiative. 10. Domazet-Lo šo T, Tautz D (2010)Aphylogenetically based transcriptome age index mirrors Trends Genet 23:113‒8. ontogenetic divergence patterns. Nature 468:815‒8. 25. Glockling SL, Marshall WL, Gleason FH (2013) Phylogenetic interpretations and ecological 11. Kalinka AT et al. (2010) Gene expression divergence recapitulates the developmental potentials of the Mesomycetozoea (Ichthyosporea). Fungal Ecol :1‒11. hourglass model. Nature 468:811‒4. 26. Aury J-M et al. (2006) Global trends of whole-genome duplications revealed by the ciliate 12. Irie N, Kuratani S (2011) Comparative transcriptome analysis reveals vertebrate phylotypic Paramecium tetraurelia. Nature 444:171‒178. period during organogenesis. Nat Commun 2. 27. Albertin W, Marullo P (2012) Polyploidy in fungi: evolution after whole-genome duplication. 13. QuintMet al. (2012)Atranscriptomic hourglass in plant embryogenesis. Nature 490:98‒101. Proc Biol Sci 279:2497‒509. 14. Levin M, Hashimshony T, Wagner F, YanaiI(2012) Developmental milestones punctuate 28. Smet R De et al. (2013) Convergent gene loss following gene and genome duplications creates gene expression in the Caenorhabditis embryo. Dev Cell 22:1101‒8. single-copy families in "owering plants. Proc Natl Acad SciUSA 110:2898‒903. 15. Charoensawan V, Wilson D, Teichmann Sa(2010) Lineage-speci#c expansion of DNA- 29. Van de Peer Y, Maere S, Meyer A (2009) The evolutionary signi#cance of ancient genome binding transcription factor families. Trends Genet 26:388‒93. duplications. Nat Rev Genet 10:725‒732. 16. Weirauch MT, Hughes TR (2011) in A Handbook of Transcription Factors , ed Hughes TR 30. Srivastava M et al. (2010) The Amphimedon queenslandica genome and the evolution of (Springer Netherlands, Dordrecht), pp 25 ‒73. animal complexity. Nature 466:720‒726.

8 Footline Author 31. Pick KS et al. (2010) Improved phylogenomic taxon sampling noticeably a!ects nonbilaterian a reference genome. Nat Biotechnol 29:644‒652. relationships. Mol Biol Evol 27:1983‒7. 53. Punta M et al. (2012) The Pfam protein families database. Nucleic Acids Res 40 :D290 ‒D301. 32. PhilippeHet al. (2009) Phylogenomics revives traditional views on deep animal relationships. 54. Wilson D, Charoensawan V, Kummerfeld SK, Teichmann SA (2008) DBD ‒‒taxonomically Curr Biol 19:706‒12. broad transcription factor predictions: new content and functionality. Nucleic Acids Res 36 33. Nosenko T et al. (2013) Deep metazoan phylogeny: When di!erent genes tell di!erent :D88 ‒D92. stories. Mol Phylogenet Evol67:223‒233. 55. Maddison WP, Maddison DR (2011) Mesquite: a modular system for evolutionary analysis. 34. Sebé-Pedrós A, Zheng Y, Ruiz-Trillo I, Pan D (2012) Premetazoan Origin of the Hippo Version 275. Available at: http://mesquiteproject.org. Signaling Pathway. Cell Rep 1:1‒8. 56. Domazet-Lo šo T, Brajkovi É J, Tautz D (2007)Aphylostratigraphy approach to uncover the 35. Price DC et al. (2012) Cyanophora paradoxa Genome Elucidates Origin of Photosynthesis genomic history of major adaptations in metazoan lineages. Trends Genet 23:533‒539. in Algae and Plants. Science 335:843‒847. 57. Domazet-Lo šo T, TautzD(2010) Phylostratigraphic tracking of cancer genes suggests a link 36. Cock JM et al. (2010) The Ectocarpus genome and the independent evolution of multicellu- to the emergence of multicellularity in metazoa. BMC Biol 8:66. larity in brown algae. Nature 465:617‒21. 58. Carvunis A-R et al. (2012) Proto-genes and de novo gene birth. Nature 487:370‒4. 37. RadakovitsRet al. (2012) Draft genome sequence and genetic transformation of the 59. Flicek P et al. (2011) Ensembl 2011. Nucleic Acids Res 39:D800 ‒806. oleaginous alga Nannochloropis gaditana. Nat Commun 3. 60. Altschul SF et al. (1997) Gapped BLAST and PSI-BLAST:anew generation of protein 38. Peters AF et al. (2008) Life-cycle-generation-speci"c developmental processes are modi"ed database search programs. Nucleic Acids Res 25:3389‒3402. in the immediate upright mutant of the brown alga Ectocarpus siliculosus. Development 61. Tautz D, Domazet-Lo šo T (2011) The evolutionary origin of orphan genes. Nat Rev Genet 135:1503‒12. 12:692‒702. 39. Coelho SM et al. (2011) OUROBOROS isamaster regulator of the gametophyte to 62. Blair J, Hedges S (2005) Molecular phylogeny and divergence times of deuterostome animals. sporophyte life cycle transition in the brown alga Ectocarpus. Proc Natl Acad SciUSA Mol Biol Evol 22:2275‒2284. 108:11518‒23. 63. Delsuc F,Brinkmann H, Chourrout D, Philippe H (2006) Tunicatesand not cephalochordates 40. Bouget FY, Berger F, BrownleeC(1998) Position dependent control of cell fate in the Fucus are the closest living relatives of vertebrates. Nature 439:965‒968. embryo: role o" ntercellular communication. Development125:1999‒2008. 64. Delsuc F, Tsagkogeorga G, Lartillot N, PhilippeH(2008) Additional molecular support for 41. KingN(2004) The unicellular ancestry of animal development. Dev Cell 7:313‒25. the new chordate phylogeny.Genesis 46:592‒604. 42. Rokas A (2008) The origins of multicellularity and the early history of the genetic toolkit for 65. Telford M, Bourlat SJ, Economou A, Papillon D, Rota-Stabelli O (2008) The evolution of animal development. Annu Rev Genet 42:235‒51. the Ecdysozoa. Philos Trans R Soc Lond B Biol Sci 363:1529 ‒1537. 43. Arbeitman MN et al. (2002) Gene Expression During the Life Cycle of Drosophila 66. Budd G, Telford M (2009) The origin and evolution of arthropods. Nature 457:812‒817. melanogaster. Science 297 :2270‒2275. 67. Heimberg A (2010) microRNAs reveal the interrelationships of hag"sh, lampreys, and 44. Xiang D et al. (2011) Genome-Wide Analysis Reveals Gene Expression and Metabolic Net- gnathostomes and the nature of the ancestral vertebrate. Proc Natl Acad SciUSA 107:19379 work Dynamics during Embryo Development in Arabidopsis. Plant Physiology156 :346‒356. ‒19383. 45. Belmonte MF et al. (2013) Comprehensive developmental pro"les of gene activity in regions 68. Rivals I, Personnaz L, Taing L, Potier M (2007) Enrichment or depletion ofaGO category and subregions of the Arabidopsis seed. Proc Natl Acad SciUSA 110:435‒44. within a class of genes: which test? Bioinformatics 23:401 ‒407. 46. Valentine JW, Ti!ney BH, Sepkoski JJ (1991) Evolutionary dynamics of plants and animals: 69. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate:apractical and a comparative approach. Palaios 6:81‒8. powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol 57:289‒300. 47. Turner LM, Hoekstra HE (2008) Causes and consequences of the evolution of reproductive 70. HamplVet al. (2009) Phylogenomic analyses support the monophyly of Excavata and resolve proteins. Int J Dev Biol 52:769‒80. relationships among eukaryotic “supergroups”. Proc Natl Acad SciUSA 106:3859‒64. 48. Burt DB (2001) Evolutionary stasis , constraint and other terminology describing evolution- 71. Derelle R, Lang FB (2012) Rooting the eukaryotic tree with mitochondrial and bacterial ary patterns. BiolJLinn Soc 72:509‒517. proteins. Mol Biol Evol 29:1277‒89. 49. Kurland CG, Collins LJ, Penny D (2006) Genomics and the Irreducible Nature of Eukaryote 72. Burki F,Okamoto N, Pombert J-F,Keeling PJ (2012) The evolutionary history of haptophytes Cells. Science 312 :1011‒1014. and cryptophytes: phylogenomic evidence for separate origins. Proc Biol Sci 279:2246‒54. 50. CollenJet al. (2013) Genome structure and metabolic features in the red seaweed Chondrus 73. TorruellaGet al. (2012) Phylogenetic relationships within the Opisthokonta based on crispus shed light on evolution of the Archaeplastida. Proc Natl Acad Sci USA 110:5247‒52. phylogenomic analyses of conserved single-copy protein domains. Mol Biol Evol 29:531‒44. 51. Lohse MB et al. (2013) Identi"cation and characterization of a previously undescribed family 74. Brown MW,Kolisko M, Silberman JD, Roger AJ (2012) Aggregative Multicellularity Evolved of sequence-speci"c DNA-binding domains. Proc Natl Acad SciUSA 110 :7660‒7665. Independently in the Eukaryotic Supergroup Rhizaria. Curr Biol 22:1‒5. 52. Grabherr MG et al. (2011) Full-length transcriptome assembly from RNA-Seq data without

Footline Author Issue Date Volume Issue Number 9 300 Apusozoa 200 Opisthokonta Rhizaria 100 Holomycota Haptophyta Holozoa Archaeplastida SAR Cryptophyta 0 0 5 10 15 20 Metazoa Fungi AmebozoaEmbryophyta Heterokonta Alveolata Excavata Color key & Myb_DNA−binding histogram TBP HMG_box bZIP_1 zf−MIZ zf−C2H2 ARID YL1 nuclear protein HLH HSF_DNA−bind bZIP_2 zf−GRF Homeobox_KN/TALE GATA Paneukaryotic SRF−TF cluster Homeobox CBFB_NFYA CSD Tub E2F_TDP zf−TAZ zf−BED zf−NF−X1−type HTH_psq CG−1 Fork_head RFX_DNA_binding LAG1−DNAbind Unikonta cluster TEA CP2 NDT80_PhoG T−box Runt P53 zf−C2HC Holozoa cluster STAT_bind bZIP_Maf Ets MH1 IRF MADF_DNA_bdg DM TF_AP−2 zf−C4 Metazoa RHD cluster CUT GCM TSC22 SAND Zn_clus YABBY GCR1_C Fungal_trans Copper_first STE−like TF Fungi cluster Fungal_trans_2 MAT_Alpha1 Dict−STAT−coil AP2 WRKY Whirly B3 SBP zf−Dof PLATZ Archaeplastida GRAS cluster FLO_LFY NAM TCP Embryophyta EIN3 cluster S1FA era risii idyi tica oxa nus nus nus ruzi rum rum ydis i r rans vskii e yzae mela r .tauri uberi rassa iabilis r .l r ad estans s ticillata evisiae .pusilla f ale .or evicollis mophila s a.c istolytica var o.ma w .cinereus .gaditana vectensis .vaginalis ma.a ia.g o.sapiens s.elegans ia.vib h wczarzaki yza.sativa ver o esleea r s r .siliculosus f ra.digiti Danio.rerio o a.par k k or us ra.in s.cer f O Nuclearia.sp n la um.gemmata i ragrantissima orma.whisleri zopus r rtieri Volvox.ca s.macrogy r r.circinelloides b Hom pus Lottia.gigantea . kinsus.ma Giardia.lamblia Guillardia.theta yce r Ustilag r Pi Leishmania.major Emiliania.huxleyi Rhi Capitella.teletaæ Ciona.intestinalis m tierella. Naegler oxoplasma.gondii yce Ministe ium.limacispo Mnemiopsi ium.dendrobatidis Cop r Acropo Oscarella.ca Chlorella. r Pe Nematocida.pa r T Ostreococcus ytophto Neurospor Tr ypanosoma.c Abeof Micromona Bigelowiella.natans ichoplax.adhaerens Muco uber.melanospo h Monosiga.br Arabidopsis.thaliana r yt Sphaero Thecamonas.trahens richomona Salpingoeca.rossetta T Hydra.magnipapillata hyt Mo T zellomyces.punctatus yostelium.discoideum h P Entamoeba. Physcomitrella.patens T Cyanophor Ectoca omyces Allom Nematostella. Paramecium.tetraurelia rum Plasmodium.falcipa Caenorhabditi Capsaspora. etrahymena.ther Drosophila.melanogaster Amoebidium.parasiticum Spi Creolimax.f Encephalitozoon.cuniculi

Figure S1 Selaginella.moellendorffii T Saccoglossus. Dict Acanthamoeba.castellanii Nannochloropsis Polysphondylium.pallidum Cryptococcus.neoformans Cyanidioschyzion.merolae Saccharo Thalassiossira.pseudonana Phaeodactylum.tricornutum Phyc Chlamydomonas.reinhardtii rachoc Amphimedon.queenslandica Coralloc Schizosaccharomyces.pombe Aureococcus.anophagefferens Bat ! Figure S1. Heatmap depicting the number of different protein domain architectures of each TF in a genome. For each TF type, the accompanying domains to the DBD have been analysed, defining complete protein domain architectures. The total number of different domain architectures (e.g. PBC domain//Homeobox_KN, MEIS//Homeobox_KN, …) for each DBD domain (e.g. Homeobox_KN) have been counted for each species. Raw data in Table S1. Further taxonomic information in Table S2.

Myb DNA-binding 250 16

14 200 12

150 10 8

100 6 Number genes

4 Number architectures 50 2

0 0 fferens fferens Oryza sativa Volvox cartieri Volvox Homo sapiens Lottia gigantea Giardia lambiia Guillardia theta Pirum gemmata Ustilago maydis Emiliania huxleyi Rhizopus oryzae Capitella teletaæ Ciona intestinalis Mnemiopsis leydi Naegleria gruberi oxoplasma gondii Ministeria vibrans Leishmania major Oscarella carmela Coprinus cinereus Chlorella variabilis Acropora digitifera Nematocida parisii Perkinsus marinus Micromonas pusilla T Ostreococcus tauri Neurospora crassa Trypanosoma cruzi Trypanosoma Abeoforma whisleri Bigelowiella natans Mucor circinelloides echamonas trahens Monosiga brevicollis Arabidopsis thaliana Sphaeroforma artica T Salpingoeca rossetta Mortierella verticillata Hydra magnipapillata Tuber melanosporum Tuber Trichoplax adhaerens Trichoplax Phytophtora infestans Physcomitrella patens Ectocarpus siliculosus Cyanophora paradoxa Trichomonas vaginalis Trichomonas Allomyces macrogynus Nematostella vectensis Paramecium tetraurelia Enthamoeba hystolitica Plasmodium falciparum Caenorhabditis elegans Capsaspora owczarzaki Amoebidium parasiticum Creolimax fragantissima Spizellomyces punctatus Encephalitozoon cuniculi Selaginella moellendorffii Selaginella moellendorffii Drosophila melanogaster Tetrahymena thermophila Tetrahymena Acanthamoeba castellanii Saccoglossus kowalevskii Polysphondylium pallidum Nannochloropsis gaditana Cryptococcus neoformans Cyanidioschyzion merolae Dicthyostelium discoideum Phyomyces blakesleeanus Saccharomyces cerevisiae Phaeodactylum tricornutum Thalassiossira pseudonana Chlamydomonas reindhardti Amphimedon queenslandica Corallochytrium limacisporum Schizosaccharomyces pombe Aureococcus anophage Batrachochytrium dendrobatidis

zf C2H2 250 80

70 200 60

150 50 40

100 30 Number genes

20 Number architectures 50 10

0 0 fii f Oryza sativa Volvox cartieri Volvox Homo sapiens Lottia gigantea Giardia lambiia Guillardia theta Pirum gemmata Ustilago maydis Emiliania huxleyi Rhizopus oryzae Capitella teletaæ Ciona intestinalis Naegleria gruberi Mnemiopsis leydi Ministeria vibrans Leishmania major Oscarella carmela Coprinus cinereus Chlorella variabilis Acropora digitifera Perkinsus marinus Nematocida parisii Micromonas pusilla Toxoplasma gondii Ostreococcus tauri Neurospora crassa Trypanosoma cruzi Abeoforma whisleri Bigelowiella natans Mucor circinelloides echamonas trahens Arabidopsis thaliana Monosiga brevicollis Sphaeroforma artica T Salpingoeca rossetta Mortierella verticillata Hydra magnipapillata Tuber melanosporum Trichoplax adhaerens Trichoplax Phytophtora infestans Physcomitrella patens Ectocarpus siliculosus Cyanophora paradoxa Trichomonas vaginalis Trichomonas Allomyces macrogynus Nematostella vectensis Paramecium tetraurelia Enthamoeba hystolitica Plasmodium falciparum Caenorhabditis elegans Capsaspora owczarzaki Amoebidium parasiticum Creolimax fragantissima Spizellomyces punctatus Encephalitozoon cuniculi Selaginella moellendor Drosophila melanogaster Tetrahymena thermophila Acanthamoeba castellanii Saccoglossus kowalevskii Polysphondylium pallidum Nannochloropsis gaditana Cryptococcus neoformans Cyanidioschyzion merolae Dicthyostelium discoideum Phyomyces blakesleeanus Saccharomyces cerevisiae Phaeodactylum tricornutum Thalassiossira pseudonana Chlamydomonas reindhardti Amphimedon queenslandica Corallochytrium limacisporum Schizosaccharomyces pombe Aureococcus anophagefferens Aureococcus anophagefferens Batrachochytrium dendrobatidis

Homeobox 250 30

25 200

20 150

15

100

Number genes 10 Number architectures 50 5

0 0 fii f ferens f Oryza sativa Volvox cartieri Volvox Homo sapiens Lottia gigantea Giardia lambiia Guillardia theta Pirum gemmata Ustilago maydis Emiliania huxleyi Rhizopus oryzae Capitella teletaæ Ciona intestinalis Naegleria gruberi Mnemiopsis leydi Ministeria vibrans Leishmania major Coprinus cinereus Oscarella carmela Chlorella variabilis Acropora digitifera Perkinsus marinus Nematocida parisii Micromonas pusilla Toxoplasma gondii Ostreococcus tauri Neurospora crassa Trypanosoma cruzi Abeoforma whisleri Bigelowiella natans Mucor circinelloides echamonas trahens Monosiga brevicollis richoplax adhaerens Arabidopsis thaliana Sphaeroforma artica T Salpingoeca rossetta Mortierella verticillata Hydra magnipapillata Tuber melanosporum T Phytophtora infestans Physcomitrella patens Ectocarpus siliculosus Cyanophora paradoxa paradoxa Cyanophora Trichomonas vaginalis Trichomonas Allomyces macrogynus Nematostella vectensis Paramecium tetraurelia Enthamoeba hystolitica Plasmodium falciparum Caenorhabditis elegans Capsaspora owczarzaki Amoebidium parasiticum Creolimax fragantissima Spizellomyces punctatus Encephalitozoon cuniculi Selaginella moellendor Drosophila melanogaster Tetrahymena thermophila Acanthamoeba castellanii Saccoglossus kowalevskii Nannochloropsis gaditana Polysphondylium pallidum Cryptococcus neoformans Cyanidioschyzion merolae Dicthyostelium discoideum Phyomyces blakesleeanus Saccharomyces cerevisiae Phaeodactylum tricornutum Thalassiossira pseudonana Chlamydomonas reindhardti Amphimedon queenslandica Corallochytrium limacisporum Schizosaccharomyces pombe Aureococcus anophage Batrachochytrium dendrobatidis

HLH 160 14

140 12 120 10 100 8 80 6 60 Number genes 4

40 Number architectures

20 2

0 0 fferens fferens Oryza sativa Volvox cartieri Volvox Homo sapiens Lottia gigantea Giardia lambiia Guillardia theta Pirum gemmata Ustilago maydis Rhizopus oryzae Emiliania huxleyi Capitella teletaæ Ciona intestinalis Mnemiopsis leydi Naegleria gruberi Ministeria vibrans Leishmania major Oscarella carmela Coprinus cinereus Chlorella variabilis Acropora digitifera Nematocida parisii Perkinsus marinus Micromonas pusilla Toxoplasma gondii Ostreococcus tauri Neurospora crassa Trypanosoma cruzi Trypanosoma Abeoforma whisleri Bigelowiella natans Mucor circinelloides Monosiga brevicollis Arabidopsis thaliana Sphaeroforma artica Techamonas trahens Techamonas Salpingoeca rossetta Mortierella verticillata Hydra magnipapillata Tuber melanosporum Trichoplax adhaerens Trichoplax Phytophtora infestans Physcomitrella patens Ectocarpus siliculosus Cyanophora paradoxa Trichomonas vaginalis Trichomonas Allomyces macrogynus Nematostella vectensis Enthamoeba hystolitica Paramecium tetraurelia Plasmodium falciparum Caenorhabditis elegans Capsaspora owczarzaki Amoebidium parasiticum Creolimax fragantissima Spizellomyces punctatus Encephalitozoon cuniculi Selaginella moellendorffii Selaginella moellendorffii Drosophila melanogaster Tetrahymena thermophila Tetrahymena Acanthamoeba castellanii Saccoglossus kowalevskii Polysphondylium pallidum Nannochloropsis gaditana Cryptococcus neoformans Cyanidioschyzion merolae Dicthyostelium discoideum Phyomyces blakesleeanus Saccharomyces cerevisiae Phaeodactylum tricornutum Thalassiossira pseudonana Chlamydomonas reindhardti Amphimedon queenslandica Corallochytrium limacisporum Schizosaccharomyces pombe Aureococcus anophage Batrachochytrium dendrobatidis

a

Total number o

z

o

Architecture number b

e

o

m

holozoa Metazoa A Fungi Embryophyta Chlorophya Stramenopila Alveolata Excavata Unicellular

Figure S2

! Figure S2. Linear graph comparing the number of given TFs and the number of different TF protein domain architectures in a particular genome. Cluster dendrogram with AU/BP values (%)

99 01au bp 98 1.0 94 0

98 1 100 0

0.8 93 1

92 5 9110 91 24 100 0

9612 0.6 87 9

93 6

62 34 73 28 99 22 98 0

0.4 96 11

92 13 Caenorhabditis.elegans 95 2 92 0 910 99 28 895 Height 87 14 571 93 52 98 86 92 41 92 51 9834 0.2 8162 93 22 92 7 98 84 99 95 95 74 9216

73 18 9928 Acanthamoeba.castellanii 87 4 9756 9926 9140 9017 9850 739 9454 7952 9421

Bigelowiella.natans 9750 9812 10093 9545 93 22 9546 98 66 96 46 8872 9338 10097 9386 9878 9954 10083 Giardia.lambiia 9749 9939 100 85 9777 Techamonas.trahens 0.0 Leishmania.major Emiliania.huxleyi Oryza.sativa Tr ypanosoma.cr uzi Coprinus.cinereus Perkinsus.marinus Ministeria.vibrans Nematocida.parisii Hydra.magnipapillata Cyanophora.paradoxa Rhizopus.oryzae Chlorella.variabilis Pirum.gemmata Guillardia.theta Aureococcus.anophagefferens Mnemiopsis.leydi Naegleria.gruberi Encephalitozoon.cuniculi Oscarella.carmela Enthamoeba.hystolitica Ostreococcus.tauri Volvox.cartieri Amphimedon.queenslandica Allomyces.macrogynus Micromonas.pusilla Tetrahymena.thermophila Arabidopsis.thaliana Tr ichoplax.adhaerens Monosiga.brevicollis Phytophtora.infestans Abeoforma.whisleri Ectocarpus.siliculosus Salpingoeca.rossetta Trichomonas.vaginalis Ustilago.maydis Homo.sapiens Sphaeroforma.artica Creolimax.fragantissima Lottia.gigantea Capitella.teleta Paramecium.tetraurelia Amoebidium.parasiticum Ciona.intestinalis Capsaspora.owczarzaki Acropora.digitifera Mortierella.verticillata Physcomitrella.patens Nannochloropsis.gaditana Mucor.circinelloides Cyanidioschyzion.merolae Spizellomyces.punctatus Neurospora.crassa Toxoplasma.gondii Selaginella.moellendorffii Drosophila.melanogaster Tuber.melanospor um Polysphondylium.pallidum Corallochytrium.limacisporum Dicthyostelium.discoideum Saccharomyces.cerevisiae Plasmodium.falciparum Chlamydomonas.reindhardti Saccoglossus.kowalevskii Phyomyces.blakesleeanus Phaeodactylum.tricornutum Thalassiossira.pseudonana Nematostella.vectensis Cryptococcus.neoformans Batrachochytrium.dendrobatidis Schizosaccharomyces.pombe

FUNGI METAZOA UNICELLULAR AMOEBOZOA VIRIDIPLANTAE HOLOZOA Figure S3 ! Figure S3. Cluster analysis of eukaryotes based on TFome content. In red, the AU (Approximately Unbiased) p-value is indicated. In green, the BP (Bootstrap Probability) value is indicated.

Figure S4

A B C

40 853 22 - 91 19 17 - 189 66 20 - 5 - - 411Tfs 16 294 9227 7 4 1 77 24 15 7 2 - - 2 4 131Tfs 55 588 87 17 231 284 7 10 61 2 3 3 - 71 -Tfs 8 39 5-15 5 6-28 23 10 - 3 - - 311Tfs Fs 5 34 7211 3 4 1 21 16 922- -1 3 131Tfs Fs 17 115 9416 97 5 5 36 2 2 3 - 61 -Tfs Fs 3 3 3

Drosophila transcription factors *** *** *** 2 Zebrafish transcription 2 Drosophila TF founders 2 *** * *** factors *** Arabidopsis transcription factors *** * *** *** ** *** *** *** Zebrafish TF founders *** * *** ** Arabidopsis TF founders resented 1 1 1 *** * *** *

* Over-rep

0 0 0

Under-represented

Log-odds *

-1 -1 -1 *** ** * ** *** * ** ** ** * * *** * *** -2 -2 ** *** -2 *** ** *** * ** * *** *** *** *** *** *** *** *** *** -3 *** *** -3 *** -3 *** *** 1 2 345 6 789 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 344 565 6 789 10 11 12 13 14 15 16 Danio rerio Drosophila melanogaster Arabidopsis thaliana Actinopterygii Drosophila Arabidopsis Euteleostomi 18 Diptera 20 Brasicaceae 16 Vertebrata 17 Endopterygota 19 eurosids II 15 Olfactores 16 Insecta 18 rosids 14 Chordata 15 Pancrustacea 17 core eudicotyledons 13 Deuterostomia 14 Arthropoda 16 eudicotyledons 12 Bilateria 13 Ecdysozoa 15 Magnoliophyta 11 Eumetazoa 12 Protostomia 14 Spermatophyta 10 Metazoa 11 Bilateria 13 Tracheophyta 9 Choanoflagellida/Metazoa 10 Eumetazoa 12 Embryophyta 8 Filozoa 9 Metazoa 11 Viridiplantae 7 Holozoa 8 Choanoflagellida/Metazoa 10 Archaeplastid a 6 Opisthokonta 7 Filozoa 9 Bikonta 5 Apusozoa/Opisthokonta 6 Holozoa 8 Eukaryota 4 Unikonta 5 Opisthokonta 7 Cellular org. 3 Eukaryota 4 Apusozoa/Opisthokonta 6 2 Cellular org. 3 Unikonta 5 1 2 Eukaryota 4 1 Cellular org. 3 2 1 !

Figure S4. Statistical analysis of TFs in a phylostratigraphic map in D. rerio, D. melanogaster and A. thaliana. Arrows indicate the strongest significant over-representations. To explore the sequence diversity of TFs underlying the statistical signals, we re-evaluated overrepresentation patterns using an estimate of the frequency of founder genes across phylostrata (56). The patterns obtained largely agree with the initial analyses, indicating that substantial protein sequence diversity underlies the observed signals, rather than an extreme duplication of few novel sequences. However, there are several notable exceptions. It both animal species, i.e. D. rerio and D. melanogaster, there is a loss of signal at the origin of Eukaryotes (ps2), indicating that the repertoire of TFs that originated from ps2 in animals is largely the result of later duplication events. In A. thaliana a similar pattern could be observed at the origin of Bikonta (ps3). In addition, the founder gene analysis revealed some novel signals that point to evolutionary periods that were hidden in the initial analysis. For instance, both zebrafish and fruit fly profiles show additional signals at the origin of Eumetazoa (ps10), whereas in A. thaliana there is significant overrepresentation at the origin of Magnoliophyta (ps9). ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Figure S5 Figure S5. A) Average expression level of the TFome during development in Drosophila melanogaster 0.00008 Drosophila melanogaster, the area Cleavage A Germ band elongation 0.00007 Germ band retraction shaded in red is the proposed phylotypic Head involution Differentiation 0.00006 Larvae stage (10). B) Transcriptome Age Index Pupae Adult 0.00005 (TAI) of the TFome during development

0.00004 in Drosophila melanogaster. The higher 0.00003 the TAI, the younger the TFs that are 0.00002 expressed, and vice versa. The grey area

0.00001 Embryo Larvae Pupae Adult around TAI values corresponds to the ± 0

d

h

h

h

0-2 h

12 h

2-4 h 4-6 h 6-8 h one standard error of mean. C) Relative

8-10 h

P + 1d

12-14 h

20-22 h 22-24 h 42-44 66-70 h 83-85

16-18 h

14-16

18-20 h

10 -12 h

WP

WPP + 2 d WPP + 3 d

WPP + 4 d

f stages 2-6

WPP +

ite prepupae

puf stages 7-8

puf

puff stages 1-2 puff expression of transcription factors sorted Ontogeny wh

n + 1 d 1 + adult eclesion n + 5 d 5 + adult eclosion adult eclosion + 30 by significantly enriched phylostrata 5.3 B Cleavage Germ band elongation during development in Drosophila Germ band retraction 4.9 Head involution Differentiation melanogaster. For easier comparison, Larvae 4.5 Pupae Adult relative expression is shown in relation 4.1 to the highest (0) and lowest (1) 3.7 expression values across developmental 3.3 stages (see Methods). Cl, cleaveage; 2.9 GBE, germ band elongation; GBR, germ Embryo Larvae Pupae Adult 2.5

d

0h

6h band retraction; HI, head involution; Df,

4h

2-6

5 d 5

pae

12 h

2-4 h 4-6 h 6-8 h

0-2 h

8-10 h

P+ 2 2 P+ P+ 3 d 3 P+

16-18 h

18-2

12-14 h 14-1

20-22 h 22-2 42-44 h 66-70 h 83-85 h

10 -12 h

tages 7-8

tages 1-2

WPP 1d +

WPP + 4 d

WP WP differentiation.

WPP +

Ontogeny puf s

puff s puff

puff stages puff

white prepu

lt eclosion +

sion + 1 d 1 + adult eclosion adu

adult eclosion + 30 d ! ! ! C ()2 Eukaryota ()5Opisthokonta ! ()6Holozoa ()9Metazoa ! 1.0 !

ession level ! ! 0.5 ! !

Relative expr ! 0 Cl GBE GBR HI Df Larvae Pupae Adult ! ! ! ! ! ! ! ! ! ! Figure S6. A) Average expression Figure S6 Arabidopsis thaliana 0.00003 0.00003 A level of the TFome during 0.000025 0.000025 development in Arabidopsis

0.00002 0.00002 thaliana, the area shaded in red is

0.000015 0.000015 the proposed phylotypic stage (10). Embryo Embryo Quadrant Quadrant Globular Globular B) Transcriptome Age Index (TAI) 0.00001 Heart 0.00001 Heart Torpedo Torpedo Bent Bent of the TFome during development Mature Mature 0.000005 0.000005 in Arabidopsis thaliana. The higher 0 0 the TAI, the younger the TFs that

7 DAP

4 DAP

1 DAP

8 DAP

7 DAP

4 DAP

1 DAP

6 DAP

10 DAP

13 DAP

14 DAP

2.5 DAP 2.5

1.5 DAP Ontogeny Ontogeny are expressed, and vice versa. The grey area around TAI values B 5 5 corresponds to the ± one standard Embryo 4.5 4.5 Quadrant Globular error of mean. C) Relative Heart 4 4 Torpedo Bent Mature expression of transcription factors 3.5 3.5 sorted by significantly enriched Embryo 3 Quadrant 3 Globular phylostrata during development in Heart Torpedo 2.5 Bent 2.5 Arabidopsis thaliana. For easier Mature

2 2 comparison, relative expression is

7 DAP

4 DAP 4

1 DAP 1

6 DAP

8 DAP

7 DAP

1 DAP

4 DAP

10 DAP

13 DAP 14 DAP shown in relation to the highest (0)

2.5 DAP

1.5 DAP Ontogeny Ontogeny and lowest (1) expression values Arabidopsis thaliana across developmental stages (see ()2Eukaryota C ()3Bikonta ()4 Archaeplastida Methods). Z, zygote; Q, quadrant; ()5Viridiplantae ()6Embryophyta 1.0 G, globular; H, heart; T, torpedo; B, bent cotyledon; M, mature. 0.5 ! !

Relative expression level Relative expression ! 0 Embryo Xianget al. 2011 Embryo Belmonteet al. 2013 !

Z Q GHTBM Q G HT BM ! ! ! ! ! ! ! ! ! ! ! ! ! ! Table Taxonomic group Lost domains S3 1 Choanoflagellata zf C2HC, zf BED, RHD, Runt 2 Nucleariida T-box, zf-BED, WRKY,GCR1_C, NDT80 Apusozoa zf TAZ, zf BED, TEA, CG-1, YABBY, NDT80 4 Stramenopile zf NF X1* (present in Blastocystis hominis) GATA 5 Alveolata zf TAZ, zf NF X1, HLH, Zn_Clus, zf GRF 6 Rhizaria zf TAZ, bZIP2, Homeobox, GATA 7 Haptophyta zf TAZ, zf BED, Whirly, WRKY, zf NF X1, HLH, Homeobox_KN* (present in Isochrysis galbana and Prymnesium parvum) 8 Excavata zf TAZ, zf BED, Whirly, CG-1, AP2,HLH, HSF 9 Cryptophyta zf TAZ, zf BED, Whirly, WRKY, zf NF X1, YL1, GATA Glaucophyta zf TAZ, Whirly, CG-1, YABBY, WRKY, PLATZ, YL1 11 Rhodophyta zf TAZ, zf BED, Whirly, CG-1, YABBY, WRKY, zf NF X1, AP2, bZIP2* (present in Chondrus crispus), Tub, HMG_box, CBFB NFYA ! Table S3. Secondary losses; data from of Figure 3. *Absent in the sampled genomes, but present in other species of the group (retrieved by tblastn in EST collections or non-redundant nucleotide NCBI database). ! DISCUSSION

Genomic evolutionary patterns in the protozoan/metazoan border: gene innovation and co-option

Since the sequencing of the first eukaryotic genomes, it became clear that the frontier between metazoans and the rest of eukaryotes was not a huge leap in terms of protein gain. From the ancestral toolkit of 3413 KOGs (Eukaryotic ortholog groups), bilaterians did only invent 1358 new KOGs (Koonin et al. 2004). However, in that study the taxon sampling was extremely limited. Homo sapiens, Drosophila melanogaster and Caenorhabditis elegans are not the best taxa to infer the urmetazoan gene repertoire, as the ecdysozoan species have suffered major gene losses, therefore the species-specific gains of H. sapiens were overestimated and the urmetazoan repertoire underestimated (Putnam et al. 2007; Simakov et al. 2013). The eukaryotic out-group of that pioneer study was also extremely limited, the fungi genomes sampled were secondarily simplified and there was just one bikont, the plant Arabidopsis thaliana (Figure 9). Nevertheless further analysis with improved taxon sampling draw similar tendencies. For example, an analysis of the protein origins of D. melanogaster in a phylostratigraphic showed that a little bit more than 50 percent of the genes has ancient eukaryotic or bacterial origins, being the other major peaks of gene innovation at the root of Bilateria and at the species level, and not at the origin of metazoans (Domazet-Lošo et al. 2007). In another study more than 80% of the inferred common eumetazoan proteome had ancient origins (pre-metazoan), leaving metazoan innovations to 1584 out of 7766 (Putnam et al. 2007). The data obtained from the UNICORN project allowed a better resolution of that event, increasing the number of the urmetazoan genome to ~10.000 proteins (twice more than the 5313 of Koonin et al 2004) and setting the number of urmetazoan specific proteins gains to 1235 (Figure 9, Fairclough et al. 2013). Thus, the metazoan ancestor only had to innovated 10% of its proteome content to overcome the evolutionary transition to multicellularity. The automatic detection of orthologs in big datasets is, however, a delicate issue, and it is known that the reciprocal best-hit approach (used in the previously listed studies) is a rather imprecise method to tackle the question (Gabaldón & Koonin 2013). Usually the use of phylogenetic inference outcompetes the results obtained by

127

A B +1,235 Dme Cel Hsa Sce Spo Ec u Ath 10,272 -718 Metazoa 3,491 4,503 13,688 1,679 586 3,711 84 2 369 751 114 299 202 1,969 +760 3,260 9,755 -1225 5,210 +3,419 267 9,411 85 55 -715 S. rosetta +574 188 5,313 3,048 Opisthokonta 6,707 -3,622 10,120 +2,616 1,671 15 7,980 M. brevicollis Metazoa/bilateria 802 -1,343 193 3,835 Opisthokonta 422 3,413

+638 7,386 LECA -3,372 Fungi

Figure 9. A) Gain and loss of KOGs (clusters of orthologous genes) across the evolution of eukaryotes. Dme (Drosophila melanogaster), Cel (Caenorhabditis elegans), Hsa (Homo sapiens), Sce (Saccharomyces cerevisiae), Spo (Schizosaccharomyces pombe), Ecu (Encephalitozoon cuniculi), Ath (Arabidopsis thaliana). From Koonin et al. 2004. B) Gain and loss of orthologous genes (OrthoMCL) across Opisthokonta. Data includes C. owczarzaki, choanoflagellates and basal metazoans. From Fairclough et al. 2013.

similarity orthology assignments due to the complex nature of eukaryotic genes, usually with intricate gene duplication histories and chimeric protein domain composition. There are two possible solutions to solve the detection of genome wide proteomic evolutionary events, one is automatic phylogenetic inference for all the genes in a genome (the phylome) and the other is reducing the unit of homology from genes to protein domains. The phylome is computationally very intensive, and therefore it can lose some information as it cannot cover a broad taxon sampling, and still has problems to deal with multi-domain proteins. Instead, the analysis of protein domains is rather fast and it is based in the smallest evolutionary units, confidentially detected by Hidden Markov Models (Punta et al. 2012; Gabaldón & Koonin 2013). Protein domains are conserved regions that can be present in completely different genes, and can be present more than once within a single gene. They are basically modular structural protein features that retain very basic functions, such as protein- protein interaction or DNA-binding. Thus domain conservation is not as informative as the whole gene at a functional level (Gabaldón & Koonin 2013). But this approach helps with domain architectures, as it reduces the complexity of multi-domain proteins into single independent units. Overall, it is a complementary view to the inference of gene orthology conservation. Using protein domains to understand the metazoan origins we have found that only 235 domains were gained from the holozoan repertoire of 4978, representing less than ~5% of innovation (Results R4).

128

Both approaches diminish protein innovation in the transition to multicellularity, as both protein and protein domain gains are rather average compared to other nodes of the eukaryotic tree of life (Fairclough et al. 2013; Zmasek & Godzik 2011; Cock et al. 2010). This global picture clearly stresses the importance of exaptation in evolutionary transitions. As François Jacob stated, evolution works as a tinkerer, old pieces are re-used into new functions (Jacob 1977). The ancient origins of the majority of the proteome pinpoints the importance of gene co-option in the emergence of evolutionary innovations, as it has been extensively shown in the results section (Results R1, R2, R3, R4, R5 and R6). The ancient nature of protein domains makes them ideal candidates of molecular tinkering, as their constant recycling into new genes shaped metazoan genomes. Domain accretion was the term coined to characterize metazoan proteins, usually more prone to have multi-domain architectures than those of other lineages. But if the total number of domain combinations is normalized by the total number of domains present in a given genome, among eukaryotes only fungi and embryophytes seem to have poorer taxes of domain combinatory richness compared to metazoans, holozoans and other bikonts (Zmasek & Godzik 2012). Analyzing a wide taxonomic dataset, Zmasek and Godzik found that 43% of the domain architectures restricted to a lineage belonged to holozoans, reinforcing the view of total domain combination richness and extending it from metazoans to all holozoans. Nevertheless, domain fusion is very prone to homoplasy, somehow making difficult to stress general tendencies across long evolutionary times without having a close look to specific cases (Leonard & Richards 2012). For example, we have identified increase in domain architectures in metazoans in the case of transcription factors, MAGUKs and GPCRs, but a lot of domain homoplasy in the RGS gene family (Results R1, R3, R6). When domains are analyzed just by their presence and abundance (independently to their role in multi-domain proteins), they reveal fundamental metazoan characteristics, such as enrichment in gene regulation, signaling and apoptosis (Results R4, Zmasek & Godzik 2011). Nevertheless there is also an important component of domain loss in metazoan genomes (Figure 10). This reveals that metazoans have lost many domains involved in metabolic pathways, somehow simplifying its biochemical potential (Zmasek & Godzik 2011). This simplification may be explained by the emerging role of microbiota associated to metazoans, somehow compensating their metabolic loss by gaining symbiotic partners (Zmasek & Godzik 2011; McFall-Ngai et al. 2013).

129

(a) (b) 160 DNA repair 180 Carbohydrate metabolic

s

n

i process

140 160 a

G-protein coupled receptor m Cellular amino acid protein signaling pathway o 140 d metabolic process 120

n Cell sur face receptor linked i

e Cofactor biosynthetic

120 t

100 signal transduction o process

r

p

Immune response 100 t Lipid metabolic process

80 c

n

i 80 t

s Regulation of apoptosis i Nucleotide metabolic

d 60

60 f process

o

Regulation of transcription r

40 e Polysaccharide metabolic 40 b process

m

20 Signal transduction u Number of distinct protein domains 20 Secondary metabolic

N process 0 0 Vitamin biosynthetic s a s a a a a a a a a a a ta o d o n ali

z process z ont o a ECA inid a m k t LECA k t L i e m e m trapo M a M ammali Uni Un e Chordata Hominid etrapod Chordat Ho T

M T Vertebrata M Vertebrat Opisthokont Opisthokont

Figure 10. Tendencies of number of protein domains gained (a) and lost (b) in the evolutionary path from the origins of eukaryotes to the human lineage. From Zmasek & Godzik 2011.

General tendencies described so far, though, are only the big picture, which sometimes does not reflect the mechanistic evolutionary process that underlies. Coding genes are regulated at all levels, from expression to post-translational modification. Finally, it is the interaction between different proteins in a given moment in a concrete cell what defines molecular systems. To understand how the basic cellular molecular machines that characterize metazoan multicellularity and development emerged, a detailed vision on the evolutionary histories of discrete subsets of genes reveals functionally significant evidences. Apoptosis, cell cycle, growth signaling, epithelial formation and neuronal specification are characterized by an amalgam of genes with different evolutionary origins (Srivastava et al. 2010). Moreover, animal development is also characterized by different gene origins, being early development and later development dominated by newer genes, while middle development is characterized by ancient genes (Domazet-Lošo & Tautz 2010a). In the following sections I will discuss my results under the perspective of the evolution of cellular molecular machines or pathways, and their implications into the formation of the metazoan multicellularity from unicellular protozoans.

MAGUKs and the Post-Synaptic Density complex assembly

In the first part of the results section (Results R1) we analyzed the evolutionary history of MAGUK proteins. MAGUKs are involved in many functions, from cell

130 signaling to cell-adhesion, polarizing cells asymmetrically by scaffolding certain protein complexes in certain parts of the cell. MAGUKs share a core domain architecture, formed by very ancient domains: Guanylate kinase, SH3 and PDZ. The three domains are already present in bacterial and archaean proteins, but the particular domain architecture is a product of domain shuffling at the root of holozoans. After the assemblage of the chimeric protein, an initial burst of duplication gave rise to several different paralogous families. Each specific gene family diversified from a common theme by using new domains. Thus an initial step of domain re-arrangement followed by duplication, and neo-functionalization by further domain re-arrangements conforms the evolutionary history of this gene family, which can be considered quite standard of metazoan proteins. Nevertheless the roots of the gene family stand in the protozoan ancestors, as the four main lineages of MAGUKs are encoded in unicellular holozoans, including MPP, DLG, MAGI and CACNB families. Moreover, choanoflagellates have invented a lineage specific repertoire of MAGUKs. Thus metazoans re-used a subset of pre-metazoan MAGUKs into multicellular functions, not only conserving the plesiomorphic gene types, but also further diversifying new gene families at different time-points of metazoan phylogeny. Interestingly, many MAGUKs are involved in the post-synaptic density; a complex of proteins that makes a membrane region specialized in signaling plasticity in the neurons (Figure 11A, Sakarya et al. 2007; Alié & Manuel 2010). Several proteins are accompanying MAGUKs in that complex, and most of them are found in metazoans that lack a proper nervous system, as sponges and placozoans (Sakarya et al. 2007). Some of those proteins are co-expressed in the same cell-type of the sponge larvae, suggesting a somewhat coordinated protein complex before it was co-opted into a proper neuron (Sakarya et al. 2007). But the post-synaptic scaffolding complex is not unique to animals, as choanoflagellates and C. owczarzaki, also possess many of its components (Results R4, Alié & Manuel, 2010). Most of the genes conserved in protists are scaffolding proteins such as MAGUKs, Shank or Homer. Theoretically it is easy to imagine a set of scaffolding proteins being used to polarize signaling structures in a unicellular context, later on co-opted into a protein complex involved in the signaling of a very specialized cell-type. Two things are also very illustrative about the MAGUKs and post-synaptic density protein complex. First, the role of proteins in the unicellular context is not easily predicted, since experimental work still unpublished on the choanoflagellate ortholog of Shank seems to be challenging some views on its subcellular localization and

131

binding partners (Pawel Burkhardt unpublished work). Second, the presence of many components of a molecular system does not necessarily involve interaction between its parts. An analysis of the patterns of co-expression in a transcriptomic dataset of the post-syanptic density complex and related synaptic functions, showed that in sponges most of them are not co-regulated as a system, but only some small modular subsets, which may work as small cellular machines (Figure 11B and 11C). Contrary Eumetazoans present a higher degree of co-regulation, as it should be expected due to the presence of a nervous system (Conaco et al. 2012). It would be interesting to follow up this kind of analysis in the transcriptomic datasets available for S. rosetta and C. owczarzaki (Fairclough et al. 2013; Sebé-Pedrós et al. n.d.). Although the nature of Post Synaptic Density is based on protein-protein interactions, the cis- regulatory landscape alone points towards a step-wise assembly of its complex level interactome. In a model of evolution that links gene origins to cell-type evolution, first we see how a group of proteins originated and diversified in the unicellular ancestors, forming the building blocks of a scaffolding system. In a second step, those proteins work as small cellular machines, still not working as a coordinated system in a neuron-less lineage. In a third step, the whole system is assembled into a multi- protein complex, which is defining a basic cell-type such as the neuron. This is a tinkering tale, where different levels of organization emerge through evolution from previous pieces, having gene family, protein complex and functional system a nested

A evolutionary pattern. B

PRE-SYNAPTIC SYN SV2 VGLUT SPONGE CORAL WORM Docking SVOP P Synaptic B RAB3 N S vesicle SYNGR M O vacuolar

I Synaptic IM O R vesicle vATPase ATPase R S O S SYT L SYP VAMP O BA Synaptic

Priming CC I

N vesicle P PF

C Post-synaptic

A 3

N

ER density 1 Synaptic

N C 1 vesicle T B P C VELI/LIN7 Ca 2+ UN B A N cha 1 n X N ne F l T MINT/LIN10 T 5 5 S C 2 PP 2 CPLX c

CASK/LIN2 h C X P P X a a F 2

T NAPA +

A n T A n F R

i n S N

R S r N NSF e P P n

S l T S i n he P i x r PT d e h r FLY FISH FROG p u Ca e E N r n i e o r r A r t r n r o e i o p P r t n o t o A i n h t n e o t p A r p t p d P o c r

+ p e h e p D igi AM

l e o A B l e c pamin

K p

ot t c re e

e c b r c

n Ca

C p o

r r

i c E AM ro

n re e re e

z re

e NM E R D re u M n c re

S u k a l

a

P

a re g G

h Ne

r D

h

c m

a

S

t K NN

S B C I T GRIP P -B n N A r P i P A

C S o A K B t N S G D A b A p r N L 9 e T

P A G c T 5

D E N 7 re C / / 5 D 9 R T R P 9 L A G A C G P D n n

S i CRIP S G

P r i l luR S P G Syn

Citro m Ka

A

K Scribble NO Homer

G Homer SHANK M Module size: blue > red > yellow > green A G Densin-180 Cortactin I C POST-SYNAPTIC Pearson correlation 0.5

0.4 Fungi ChoanoflagellatesPorifera (sponges)Cnidarians Nematodes Arthropods Vertebrates Filasterea S. cerevisiae A. queenslandicaA. millepora C. elegans D. melanogasterD. rerio 0.3 * X. tropicalis * * 0.2 * Bilateria SYNAPSE 0.1 * Eumetazoa verage correlation Metazoa A 0.0 Holozoa Aqu Ami Cel Dme Dre Xtr Eukaryote (87) (87) (107) (112) (84) (107) 132

Figure 11 (from the previous page). A) Schematic representation of the synapse genes and their evolutionary origins. Colors correspond to the cladogram below. B) Network showing the co-expression of genes belonging to the neuronal gene repertoire in different metazoans. Genes with the same color correspond to the same module of co-expression. C) Pearson correlation of the co-expression of the synapse genes, an asterisk reflects significative correlation. Aqu (Amphimedon queenslandica), Ami (Acropora millepora), Cel (Caenorhabditis elegans), Dme (Drosophila melanogaster), Dre (Danio rerio), Xtr (Xenopus tropicalis). From Conaco et al 2012.

The emergence of Tyrosine Kinases and the “Writer-Reader-Eraser” system

Tyrosine kinases were one of the first developmental genes found in a protist, opening the question of what would these proteins do in a unicellular context (King & Carroll 2001). Receptor Tyrosine Kinases (RTKs) are involved in many developmental pathways being considered one of the key developmental signaling systems of Metazoa (Gerhart 1999; King 2004). Eukaryotic tyrosine kinases are not as spread as their parental family, the Serine-Threonine kinases, much more abundant across eukaryotic genomes. The receptor type (RTK), with a transmembrane motif, were considered exclusive to metazoans (Goldberg et al. 2006). But the complete genome sequence of the choanoflagellate M. brevicollis revealed a very expanded RTK repertoire, richer than many metazoans (Manning et al. 2008; Pincus et al. 2008). Nevertheless, both domain composition and phylogenetic analysis of the kinase domain did not retrieve orthology relationship to any metazoan type. In sponges the panorama changes, as most of the RTK families typical of bilaterians are conserved, together with lots of lineage-specific diversification (Srivastava et al. 2010). Our work on this protein family found a similar pattern in filasterean genomes, with an expanded RTKs in which orthology relationships could not be traced to any metazoan or choanoflagellate RTK families, not even from C. owczarzaki to M. vibrans (Results R3). Thus between M. brevicollis and S. rosetta, the number of RTK families conserved in both genomes are only the 20% (Fairclough et al. 2013). The preeminence of lineage-specific diversifications in this family clearly suggests a species-specific use, probably involved in sensing the environment. Tyrosine Kinase signaling is very specific in its substrate, phosphorylation of tyrosine residues, and it does not interfere to the more common Serine-Threonine phosphorylation. SH2 domain, which binds phosphorylated tyrosine residues, and Phospho-Tyrosine Phosphatases (PTPs), which removes phosphor from tyrosine

133 residues, form together with TKs a “Writer-Reader-Eraser” system (Figure 12, Lim & Pawson 2010). Counter intuitively both SH2 and PTP domains are found in organisms with no Tyrosine kinases. Nevertheless, some serine/threonine kinases have evolved the capacity to phosphorylate tyrosine residues, as there are reported examples in Dictyostelium discoideum (Goldberg et al. 2006). This partly explains the pre- Tyrosine Kinase distribution of those protein domains, also the deleterious effects of stochastic tyrosine phosphorylation is reduced by the presence of PTPs. Nonetheless choanoflagellates and metazoans clearly expanded all the “Writer-Reader-Eraser” system (Liu et al. 2011). Our results in C. owczarzaki also suggest this correlation, with a more modest expansion of SH2 and PTP compared to choanoflagellates, but much higher than Fungi (Results R4). Also, we demonstrate that the holozoans acquired the phosphotyrosine binding (PTB) domain, an extra “reader” in tyrosine signaling. The genome sequence of the amoebozoan Acanthamoeba castellanii also retrieved consistent results, as there seems to be an independent acquisition of RTKs and a moderate increase in SH2 and PTP domains compared to other amoebozoans, although the classification of A. castellanii RTKs as metazoan-type RTKs is not well supported (see Results R3). If RTKs of A. castellanii are homologous to those of holozoans, it means that Fungi have lost this signaling pathway, and its absence in chytrid fungi and nucleariids (the sister-group to Fungi, see Figure 6) indicates that this is an ancient loss. The emergence of the tyrosine kinase signaling allowed a new phosphorylation code in metazoans, which enhanced the amount of different phosphorylation signaling pathways working at the same time in the same cell type. In a recent report, the level of tyrosine phosphorylation in the basal metazoan Trichoplax adhaerens was shown to be much higher than expected, around the 9% of the phosphoproteome, which is 3- fold the average of any bilaterian analyzed so far (Figure 12, Ringrose et al. 2013). The authors hypothesize that at the initial steps, the phosphotyrosine signaling system was much more spread in metazoan proteomes, and further constrained in more complex lineages. The loss of tyrosine residues seem to be under positive selection along metazoan evolution, avoiding deleterious effects of spurious phosphorylation, and suggesting a more tightly controlled signaling system (Tan et al. 2009). Ongoing phosphoproteomics projects of unicellular lineages will validate the hypothesis of the initial burst followed by constrain and reduction, or will make T. adhaerens case a species-specific expansion.

134

10.0 A % Tyr in phosphoproteome B 2,5 Reader/writer Tyr % Tyr in genome 8.7 Kinase Eraser/writer “Reader” 8.0 2 Tyr Tyr-P SH2

Tyr 6.0 phosphatase 1,5 “Eraser”

3.9 4.0 1 3.2 3.4 2.9 3.0 2.6 2.6 2.7 2.8 2.2 2.0 1.8 2.0 0,5 1.1

0.0 0 1 2 r

usculus o sapiens Danio rerio us m Homo sapiens om M Homo sapiensH Ciona intestinalis richoplax adhaerens SalpingoecaMonosiga rosetta brevicollis T CaenorhabditisNematostella elegans vectensis Trichoplax adhaerens CapsasporaCreolimax owczarzaki fragrantissima Caenorhabditis elegans Drosophila melanogaster Acanthamoeba castellani Drosophila melanogaste Amphimedon queenslandica Saccharomyces cerevisiae Figure 12. A) Bar chart depicting in blue the percentage of phosphorylated tyrosines in analyzed proteomes, and in red the percentage of tyrosines overall the aminoacids of the proteome. B) Bar chart depicting in blue the ratios between readers (SH2 containing proteins)/ writers (Tyrosine kinases) and eraser (PTP containing proteins)/writers (TKs). Figures adapted from Ringrose et al 2013, with new data courtesy of Hiroshi Suga.

Overall the TKs originated and diversified in a unicellular context, at the same time to their signaling associates, forming a cellular molecular machine that allows a new kind of signaling within the cell. This new signaling system is quite promiscuous at early steps of metazoan evolution, but it becomes more constrained as the evolution of the group proceeds into more complex forms. Many specific RTK families became tightly linked with concrete organs and cell behaviours, being completely integrated in developmental programs. Thus we hypothesize that the transition from a medium exposed signaling system of unicellular holozoans to inner-organism secreted ligand- receptor in metazoans, allowed fixation and conservation of concrete gene families. Nevertheless, it is noteworthy that RTKs never ceased to expand in eumetazoan genomes with further lineage specific diversifications, showing that TKs are still plastic and prone to diversify (D’Aniello et al. 2008).

Signaling pathways and bimodal evolution: cytoplasmic conservation versus extracellular divergence

From the seven signaling pathways considered to be main players in metazoan development, only RTKs are found in unicellular holozoans. Notch, TGF-Beta, WNT, 135

Hedgehog, JAK/STAT and Hormone receptors are not present in unicellular species as their canonic metazoan counterparts (Results R4). Although, the sequencing of 16 choanoflagellate transcriptomes has provides interesting results regarding both Notch and TGF-Beta pathways (Daniel Richter personal communication). Nevertheless, there is a bunch of signaling pathways present in metazoans that are involved in many processes and, sometimes, miss-assigned as non-developmental. Taking a look on the foundational paper of John Gerhart in 1999, clearly reveals a lot of signaling systems that, in fact, have been found in non-metazoans such as: Target of Rapamycin (Tor) (Shertz et al. 2010), Ca2+ channels (Cai & Clapham 2011), Cadherins (Nichols et al. 2012), Integrins (Sebé-Pedrós et al. 2010), Akt (Results R4) or Hippo/Warts (Sebé- Pedrós et al. 2012). In this same group, G Protein Coupled Receptors (GPCRs) represent an interesting case study, as they are known to be conserved in eukaryotes, and have been functionally characterized in several model systems outside metazoans, such as fungi, D. discoideum and plants. Most GPCR types have indeed their origins in the Last Eukaryotic Common Ancestor (Results R3). Nevertheless, a closer look retrieved few genes structurally (in terms of domain architecture) or phylogenetically orthologous to metazoan families in unicellular holozoans or elsewhere (Results R3). Therefore we checked the whole signaling associated system, including G proteins, RGS, Ric8, GRK, Arrestins and Phosducins. Among the transduction machinery, many elements were conserved in unicellular holozoans, even at the level of specific gene family. Specific GPCRs are linked to specific combinations of intra-cellular signaling transducers (most importantly, to combinations of different G alpha, beta and gamma subunits). The presence of most of those elements in a unicellular context clearly states that most of that combinatory potential was already in place in unicellular holozoans, and metazoans have expanded their GPCRs largely by plugging a new array of receptors to ancient machinery. GPCRs and Tyrosine Kinases show parallel stories, having pre-metazoan origins and diversification but scarce orthology in the case of receptors, and a very conserved cytoplasmic repertoire, mainly involved in signal transduction. This bimodal evolutionary pattern fits well with what was shown for other signaling pathways. Hippo/Warts is basically a signaling transduction pathway, coupled to extra-cellular receptors absent in unicellular holozoans (Sebé-Pedrós et al. 2012). Akt and Tor are also transduction pathways, as many of the MAPK kinases identified in C. owczarzaki genome (Results R4). Moreover, the Cadherin subtypes found in unicellular

136 choanoflagellates are of unknown function, but most classical cadherins are unique to metazoans (Nichols et al. 2012), nor they have B-catenin module. Notch signaling has some of its cytoplasmic elements originated anciently and are present in unicellular choanozoans (Gazave et al. 2009), including a receptor-less form of Notch (Results R4). Hedgeling is also a conserved element between choanoflagellates and early- branching metazoans, where Hedgehog signaling domain is present in a cadherin protein, but nothing like the secreted version of Hedgehog typical of eumetazoans (Adamska et al. 2007; King et al. 2008). Toll pathway signals through the Transcription Factor NF- κB, present in C. owczarzaki, but the receptors and some other complements are missing in unicellular holozoans (Gilmore & Wolenski 2012). From these evidences we hypothesize that receptors and proteins exposed to the extracellular space show one of the most drastic differences between animals and their unicellular ancestors. Receptors, originated from duplication, domain re-arrangement or de novo origin, where plugged into an already present transduction toolkit. Signaling in metazoans did not involve a very dramatic process of innovation in terms of protein domains, de novo gene origin was mainly achieved by re-use of previous elements already present in the unicellular ancestors. The convergent use of EGF, Sushi, Fibronectin-3 and other protein domains in the extracellular region of non- orthologous RTK of both metazoans and unicellular holozoans clearly illustrates this point (Results R3). Most of that extracellular signaling system, as stated in the previous section, was internalized into a controlled environment, the organism itself. The co-evolution of ligands and receptors stabilized many of those signaling pathways, making them prevalent and recurrently used in developmental processes. On an opposite fashion, some receptor families involved in sensing the environment have fast evolutionary rates, being mainly lineage specific, even species specific, and very prone to duplication, such as the mammalian or the insect odorant receptors (both GPCR families). Integrins are an interesting exception to this pattern, as the structure of both integrin alpha and beta are almost completely conserved all along the Obazoa (Breviates, Apusozoans and Opisthokonts) (Sebé-Pedrós et al. 2010; Brown et al. 2013). Their role in unicellular protists is still a mystery, as they are known to bind extracellular matrix in focal adhesion complexes in metazoans. But their role in metazoans is a mixture of signaling and adhesion, as they connect the actin cytoskeleton to the extracellular environment. Integrins seem to be conserved in those protist species that retain amoeboid movement, such as the apusozoan Techamonas trahens, the breviate

137

Pygsuia biforma, ichthyosporeans and filastereans, and lost in those that do not, including fungi and choanoflagellates. The sequencing of the transcriptome of C. owczarzaki suggests a role of integrins in the formation of the aggregative stage in that species, where the integrin adhesome is significantly upregulated (Sebé-Pedrós et al. n.d.). Most surprisingly, many proteins with domains typical of metazoan extracellular matrix, are also over-represented in that stage, suggesting a putative role of integrins in the adhesion to the secreted matrix that C. owczarzaki shows in the aggregative stage. In the case of the integrin adhesome, co-regulation of all the system tell us that the cellular molecular machine was operationally in place before the transition to multicellularity.

Transcription factors and the emergence of Gene Regulatory Networks

The bimodal evolution of extracellular and cytoplasmic proteins stated above is also based in the conservation of the ultimate downstream effectors of signaling, the transcription factors (TFs). Many metazoan TFs were missing in the choanoflagellates genomes, suggesting that many TF families were innovations of metazoans and were the key to understand the transition to multicellularity (King et al. 2008; Rokas 2008). Our study on C. owczarzaki first, and several other unicellular holozoans later, revealed a quite complex repertoire of metazoan-type TFs in unicellular holozoan lineages (Results R5 and R6). Many of the transcription factors usually associated to signaling pathways are present in holozoans, even though the signaling pathway itself is not there (Results R4), for example STAT or CSL. Overall, many of the TFs found in unicellular holozoans were responsible of homeostatic functions of the cell in metazoans, for example control of cell proliferation (e.g. Myc, Runx, TEAD), stress response (p53, NF- κB) or metabolic processes (SREBP, Mlx/MondoA). Congruently, cell proliferation control and stress response gene repertoires have similar ages to the TFs involved, as shown in more systematic studies (Domazet-Lošo & Tautz 2010b; Zmasek & Godzik 2013). Missing were those TFs involved in cell-differentiation and patterning of metazoan bodyplans, such as bHLH group A, Homeobox belonging to ANTP and Paired classes, Forkhead type I and Sox genes (Results R5). Thus, many TF types involved in the development 138 of metazoans were invented in later phases of metazoan evolution, through duplication and retention of previously present domains. Moreover, some TF types were invented completely de novo, having also important roles in development, such as Ets, hormone receptors or Double-sex. To test how TF origin influenced its role in development, we classified the TFs according to their phylogenetic origin and followed their expression patterns along development of two metazoan model species (D. rerio and D. melanogaster). The results indicate that TF ages are important in different parts of development. Interestingly, genes of metazoan origin are predominant in gastrulation (Results R6). Another interesting feature of the transition to multicellularity regarding TF types is the shift of the TFome profile. As stated above, many TFs types increased their abundances relatively to others in metazoan genomes, such as HMGbox, Homeobox, Forkhead, zf-C2H2 or T-box, all involved in cell differentiation and developmental patterning (Results R6). Unicellular holozoans, despite being paraphyletic and having choanoflagellates many secondary losses, share a common profile respect to metazoans. Sponges lay in between, having a TFome profile intermediate between the unicellular holozoans and eumetazoans. The system level change of the whole TFome may have been possible by the acquisition of multicellularity, as many of those TF types are not very prone to diversify their binding sites (Jolma et al. 2013). To avoid spurious binding specificities, many TF types evolved new ways to bind DNA through the acquisition of new domains, sometimes through protein-protein interactions or changes in binding specificity (Results R6). The increase of heterodimeric TF families in metazoans also may be related to the need of increasing combinatorial out-puts to avoid transcriptional noise (Results R5 and Amoutzias, Robertson, Oliver, & Bornberg-Bauer, 2004). When bZIP dimeric interactions were screened in a high-throughput approach, it appeared that yeast and choanoflagellates have more homodimeric genes compared to metazoans, showing a pattern of increase in combinatory capacities in multicellular species (Reinke et al. 2013). Another adaptive explanation to explain the increase of certain TF types is due to the robustness requirements of complex regulatory networks. Supporting that hypothesis, it has been shown that endogenous genes from a same TF type are able to rescue wild type phenotypes when the paralog is knocked down, overcoming stochastic failure (Burga et al. 2011). But most likely, it was the emergence and fixation of complex gene regulatory networks related to development what changed and expanded the TFome profile in metazoan genomes.

139

Some TF families present in unicellular holozoans have tissue-specific patterns in their metazoan counter-parts, for example Brachyury is a marker of gastrulation in many lineages (Scholz & Technau 2003), or Grainyhead is a marker for epithelial epidermis (Traylor-Knowles et al. 2010). The downstream regulatory networks that those TFs are governing in unicellular holozoans may be very similar to those of metazoans, having conserved/similar roles in both unicellular and multicellular contexts (i.e. brachyury governing cell movement, Grainyhead governing cell-wall formation). Alternatively, their function may have changed drastically. Our hypothesis is that TFs in unicellular context have more shallow regulatory networks in terms of their connection to downstream effectors genes. If this is true, the stability of functional linkage between a concrete TF and a cellular function should be more volatile in evolutionary terms, as it has been shown in yeast (Hogues et al. 2008; Thompson et al. 2013). It is known that gene regulatory networks can be very plastic in terms of genotype by the neutral turnover of the TF-downstream targets connections, but conserving a very robust phenotype, namely, same expression patterns with different TFs involved (Wagner 2011). We hypothesize that it is the emergence of metazoan development what made those networks more complex, by a process of intercalary evolution (Davidson & Erwin 2006). In multicellular development, the series of iterated cell divisions followed by cell-differentiation processes need to have specific spatial and temporal coordinates. Thus TFs are plugged into the cis-regulation of other TFs, fuelling the emergence of more hierarchized regulatory networks, with intermediate steps between higher-ranking TFs and downstream effector genes. The progressive embedding of TFs into highly connected regulatory networks increases the number of pleiotropic effects, making possible mutations more deleterious and fixing regulatory networks. Fixation of regulatory networks is what made some TFs master regulators of some specific functions, as the famous cases of Pax6 and eye formation or the Hox complex in anterio-posterior patterning, frozen accidents of evolution conserved by developmental constraints. To escape from pleiotropy, TFs have to duplicate and diverge fast to avoid deleterious effects at the time of acquiring a new domain of expression (Chen & Rajewsky 2007). But duplication and neo-functionalization can mobilize already established regulatory networks into new contexts and domains of expression. The conservation of gene regulatory networks is, on one hand constrained, but on the other flexible by allowing rapid co-option of regulatory networks. This duality fits with the model of hourglass development (Figure 13), in which early

140 development and later development stages are more prone to acquire newer genes, as they are more flexible and under the scope of natural selection, while the core of development is much more constrained, conservative and with older genes (Kalinka & Tomancak 2012). In consonance with the “phylotypic stage” hypothesis, our results on peak activity of Transcription Factors in the phylotypic stage of both fruitfly and zebrafish reveal the major role of regulatory networks in that developmental point (Results R6).

Transcriptome Fruitfly Divergence Mouse Corr. Zebrafish Frog Divergence age Recent cleavage blastula

Phylotypic * * * Ancient period pharyngula

Latest Xenopus laevis Recent Xenopus tropicalis

Figure 13. Genomic evidences of the hourglass model and the phylotypic stage. The first case depicts the divergence of the transcriptomes of Drosophila species along development, where the phylotypic period is the less divergent. Correlation expression levels of orthologous genes between mouse, fish, frog and chicken, showing significative pearson correlation in pharyngula stage. Transcriptome Age Index along zebrafish development, where older genes predominate in the phylotypic stage. Divergence of orthologous gene transcription between two different frog species, showing major divergence in early development. From Kalinka & Tomancak 2012.

Genomic architecture and early metazoan evolution

Most of the evolutionary histories in this thesis, and the general patterns seen in the first section of the discussion, are centered in the evolution of the metazoan proteome. Gene family expansion by gene duplication has been a common leitmotiv, but are metazoan proteomes real outliers among eukaryotes in terms of protein numbers? In what has been called the G-value paradox, complex multicellularity does not correlate with the total number of protein coding genes (Lozada-Chávez et al. 2011). Vertebrates have suffered two rounds of whole genome duplications (even three in the case of teleosts), and harbor the richest proteomes among metazoans around 20.000- 141

25.000 protein coding genes (with the exception of the 49.000 protein coding genes of the asexual tetraploid rotifer Adineta vaga (Flot et al. 2013)). Embryophytes surpass metazoan numbers; nevertheless they are also multicellular and very prone to whole genome duplications. But some exceptional unicellular species clearly contradicts the G-value correlation: the haptophyte Emiliania huxleyi has ~30.000 protein coding genes, the ciliate Paramecium tetraurelia has ~40.000 and the dinoflagellate Symbiodinium minutum has ~42.000 (Aury et al. 2006; Read et al. 2013; Shoguchi et al. 2013). Similarly the C-value paradox shows that genome size does not correlate with multicellularity, mostly influenced by introns, transposable elements and non- coding DNA (Lozada-Chávez et al. 2011). The bioenergetics of eukaryotes allows huge genomes compared to prokaryotes, and many of the genome properties are, in fact, most likely explained by non-adaptive processes related to small population sizes (Lane & Martin 2010; Lynch & Conery 2003). But some of those properties, although can be seen as spandrels of genome evolution, later on being co-opted to potentially adaptive functions. Here I will discuss some aspects of metazoan genome architecture that may be informative in the transition to multicellularity, such as intron density, alternative splicing, non-coding RNA and non-coding intergenic regions. High intron density characterizes metazoan genes, and usually intron size is longer than in other eukaryotic lineages. When the origin of those introns is assessed in a phylogenetic perspective, it seems that most of them appeared in two bursts of evolutionary novelty: one at the last common ancestor of choanoflagellates and metazoans, and a second in the emergence of metazoans (Csuros et al. 2011). Interestingly C. owczarzaki has fewer introns compared to the inferred ancestor of Opisthokonts, therefore it is possible that it is a secondarily derived simplification. The emergence of many of those introns may be older, as ichthyosporean genomes are as intron rich as choanoflagellates (unpublished work). Intron density seems to be a non-adaptive genomic property but may serve to evolutionary processes such as protein domain-shuffling. The higher the intron-density, most likely it is to have non- homologous recombination and emergence of new functional coding regions with new domain architectures by exon-shuffling. The presence of intron rich genes also allows a higher combinatory for alternative splicing events. In most unicellular eukaryotes, alternative splicing is based in intron- retention, where non-spliced mRNA is stored or degraded for regulatory purposes, and does not imply new protein configurations (McGuire et al. 2008). But exon- skipping allows many functionally divergent isoforms from a single gene, allowing a

142 more diverse proteome without the emergence of many new genes (Romero et al. 2006; Nilsen & Graveley 2010). Alternative splicing through exon-skipping can be a tightly regulated layer of protein diversity compared to huge proteomes seen in protistan species. As an important exception the rhizarian Bigelowiella natans has as massive alternative-splicing, and as much exon-skipping as the central nervous system of vertebrates (Curtis et al. 2012). Additionally, we have found that metazoans have richer RNA-binding repertoires than their unicellulars relatives (Results R4, Kerner et al. 2011). RNA-binding proteins, similarly to transcription factors, can recognize variable splicing sites, adding an extra layer of regulatory complexity in metazoan genomes (Ray et al. 2013). Another feature that may be characteristic of metazoan genomes is the presence and use of noncoding RNAs as regulatory elements. The most studied are microRNAs, a type of small non-coding RNAs that mediate gene downregulation by promoting mRNA degradation. There are microRNA in plants and animals, and microRNA-like in fungi, but their common origin is not clear though some of the synthesis machinery is shared (Lozada-Chávez et al. 2011; Chen & Rajewsky 2007). Current sampling reveals that more complex metazoans tend to have more complex miRNA repertoires than simpler phyla (Erwin et al. 2011). No comparison can be done to unicellular holozoans, as sequenced choanoflagellates nor C. owczarzaki have microRNA synthesis machinery, both lineage-specific secondary losses (Results R4, Grimson et al., 2008). Little evolutionary information is available about other types of noncoding RNAs, such as long noncoding RNAs. But recent reports support their pervasive role in gene regulation, and they seem to be more abundant than previously thought, even more than protein coding genes. Although non-coding RNA seems to have small phenotypic effects, they could provide robustness by buffering transcriptional noise (Hornstein & Shomron 2006). Long inter-genic regions characterize metazoan genomes (Levine & Tjian 2003). Once considered junk DNA, nowadays its functional significance is still a matter of debate. Recent reports from ENCODE project suggest that almost 80% of the genome, most of it non-coding, is functional (The ENCODE Project Consortium 2012). Inter-genic regions bear cis-regulatory information of the nearby genes: enhancers, insulators, promoters and other elements. An interesting correlation exists between inter-genic distance and complex gene regulation in the genomes of D. melanogaster and C. elegans, the more complex regulation a gene has (specific spatial and temporal patterns) the longer the intergenic region is, compared to house-keeping

143 genes such as ribosomal proteins, expressed in all cells. Most of those complex regulatory genes are involved members of signaling pathways and transcription factors important during development (Nelson et al. 2004). We decided to test this hypothesis in an evolutionary perspective, taking the same categories analyzed in Nelson et al. 2004 and plotting the inter-genic region distance relative to the median of the genome for various species (Figure 14). Interestingly, our results show that many of those functional categories have genes with significantly larger spacer DNA in all metazoan genomes, including early-branching metazoans such as T. adhaerens and N. vectensis. This difference, as in the case of ecdyosozoans, is more acute in 5’ intergenic distances, where most regulatory elements are. What we did not expect is the conservation of this pattern in unicellular species, including unicellular holozoans, fungi and D. discoideum. Therefore it seems that regulatory complexity for those Gene Ontologies is not an innovation of metazoan genomes, but an ancestral feature of eukaryotes. Nevertheless, the amount of enlargement of intergenic regions relative to the mean of the organism is higher in metazoans, and more GOs show significant deviation from the genomic mean. The genomic architecture of metazoans is shaped by functional information, but this trend is more similar to that of unicellular eukaryotes than expected.

C. elegans C. owczarzaki 10000 2000 0 800 0 150 0 600 0 100 0 400 0 0 50 200 0 0 n=19947 n=134 n=4172 n=404 n=606 n=443 n=533 n=123 n=743 n=8579 n=165 n=2907 n=217 n=63 n=108 n=96 n=102 n=519 All Ribos Metab Carbo Recep Differ TF Comm Signal All Ribos Metab Carbo Recep Differ TF Comm Signal

Figure 14. Box plots representing average intergenic distance in the genomes of C. elegans and C. owczarzaki. In red the 5’ intergenic regions, in blue the 3’ intergenic regions. Genes are classified according to Gene Ontologies (Ribos: Ribosomal, Metab: Metabolism, Carbo: Carbohydrate metabolism, Recep: Receptors, Differ: Cell differentiation, TF: Transcription Factors, Comm: Cell-cell communication, Signal: Signaling). The asterisk reveals statistically significant expansion relative to the rest of the genome. N= total number of genes belonging to the GO category in the given genome. Figure from Supplementary Material of Results R4.

144

Another feature of genomic architecture typical of metazoans is the conservation of gene synteny. Macro-syntenic blocks are sets of orthologous genes present in same chromosomal segments in their respective genomes, conserved from sponges to human, showing that most metazoans have low interchromosomal translocations that maintained linkage groups along millions of years (Figure 15A, Srivastava et al. 2010; Putnam et al. 2007). But most interestingly, there are micro-syntenic blocks very conserved, reduced groups of genes that are retained one next to the other in the genomes of very distant species (Irimia et al. 2012; Simakov et al. 2013). Among the genes that have that configuration, some have functionally related functions, such as histone clusters, but others have developmental functions, such as the cluster Hox (Irimia et al. 2012). The conservation of physical linkage of the gene clusters among distant lineages has functional implications, as those genes share regulatory information and they are maintained together by purifying selection. Alluringly, developmental genes are over-represented among conserved micro-syntenic blocks. The case of Genome Regulatory Blocks is special, as the developmental gene, with complex spatiotemporal expression, is attached to a non-developmental gene that simply acts as a by-stander, bearing in its gene body and introns regulatory information of the nearby developmental gene, but with a completely unrelated expression pattern (Figure 15B). Metazoan micro-syntenic blocks are due to complex developmental regulation, which requires many cis-regulatory modules that constrain gene order evolution. Confirming this hypothesis, few micro-syntenic blocks are conserved in unicellular species, and only those genes with bi-directional promoters or similar functions which are non-related to development (Irimia et al. 2013; Dávila López et al. 2010). Overall, metazoan multicellularity and embryonic development influenced genomic architecture, softening the completely neutralist view on the subject.

145

A) B) Human N. vectensis Human Genome Regulatory Block Chr. 17 26 Chr. 12 Developmental gene Bystander gene 61 HOXC

HOXB

53

Chr. 10

46 Chr. 2

3

Chr. 7

5 HOXA HOXD Corregulation Key: Cis-regulatory element Exon

Figure 15. A) Macrosynteny between the N. vectensis scaffolds and Human chromosomes around the Hox cluster. From Putnam et al. 2007. B) Microsyntenic gene associations in metazoan genomes. From Irimia et al. 2013.

The protistan perspective and the pre-adapted genome

To what extent the protistological perspective has helped to our understanding of the transition to multicellularity? Multiple possible competing hypotheses could have explained the evolutionary transition to metazoans. In a completely adaptive perspective a massive gene innovation would have been required. In contrast, in a completely neutral hypothesis no abrupt gene repertoire change would have occurred. From the many sources of genomic evolution evaluated along this work, the resulting perspective is quite more complex, having elements of both perspectives. Comparative genomics allow us testing the different origins of the many landmarks of metazoan multicellularity. Gene innovation was for sure an important source of adaptation to multicellular life-style, but a big part of that de novo gene gain was due to re-use and shuffling of ancient protein domains or to massive duplications of previously existing gene families. Also, the co-option of many genes, mostly unchanged from unicellular to multicellular genomes, has also gained importance thanks to our work on unicellular holozoans. As a general pattern, we find a broad genomic pre-adaptation to the evolutionary transition under study. The roots of many of the elements that configure extant multicellular organisms originated and evolved

146 in unicellular lineages, smoothening the abrupt change that theoretically implies a completely new life-style. But protists have not only challenged our views on the genomic content of metazoans, but also in their possible roles before the origins of multicellularity. The life cycles of unicellular holozoans (explained in the introduction) are much more complex than previously thought, most of them having some sort of multicellular stages. More interestingly, the emerging data on their genome in action, clearly tell us that processes such as cell-differentiation or adhesion are not so different between protists and metazoans. Many developmental genes are differentially expressed in the life cycle of the colonial choanoflagellate S. rosetta, while the whole integrin adhesome seem to have a role in the aggregative stage of C. owczarzaki (Fairclough et al. 2013; Sebé-Pedrós et al. n.d.). Information is still required in the gene expression changes involved in colony formation of ichthyosporeans, but most likely it will also reveal some pre-metazoan use of the multicellular toolkit. The differential transcriptome of many of those species have to be complemented by other sources of whole genome information, such as proteomics or epigenomics, to gain further insights in the way these metazoan relatives regulate and use their genomes. Moreover, the recent advances on transformation of unicellular holozoans open the door to the detailed characterization of certain important proteins (Suga & Ruiz-Trillo 2013). If many of the genes important for multicellularity had already complex temporal expression patterns and interactomic networks, the integration of that genetic system into a multicellular entity is not so drastic, though obvious cis-regulatory changes must have occurred. This line of thinking has been formalized by the revamping of the “synzoospore theory”, which predicts the co-option of complex life cycles of unicellular species into the formation of originally complex and differentiated multicellular urmetazoan (Mikhailov et al. 2009). This theory contradicts the more mainstream hypothesis of the “gastrea”, based on the simple origins of undifferentiated multicellularity followed by the gain of new cell types in the very early steps of metazoan evolution (Nielsen 2008). The nature of pre-adaptation, and especially of molecular pre-adaptation, is in consonance with other findings. For example, the origins of muscles seem to be preceded by the ancient origins of many of the genes involved in their formation, in unicellular species or muscle-less, such as placozoans and sponges (Steinmetz et al. 2012). Genes such as the myosin II-striated type, seem to have already contracting functions in sponges. Though, to what extent did the unicellular holozoans were pre-

147 adapted to multicellularity? Now it is clear that many tools were there, some of them performing similar functions to their metazoans counterparts. But unfortunately we cannot know if the unicellular prehistory of metazoans predicted its multicellular output. Somehow the enhanced phagotrophic lifestyle of choanoflagellates (if something like a choanoflagellate was the ancestor of choanoflagellates and metazoans) enabled the transition to the only phagotrophic complex multicellular lineage present today (Cavalier-Smith 2012). Apart from the issues on predictability, are genomic changes observed in metazoans due to their developmental features? Does the development of a multicellular entity from a unicellular bottleneck involves constraints and has shaped the genome in parallel among different eukaryotic groups? The case of plants is the most well studied, telling us that there are many similarities in the developmental logic of both groups (Meyerowitz 2002). Many of the genes involved in embryophyte multicellularity also have ancient origins, some of them being present in their unicellular ancestors. Transcription factors and signaling pathways that govern plant development have similar evolutionary histories to metazoans (Sandra K. Floyd & John L. Bowman 2007), so they have miRNA and hints of genomic synteny (Axtell & Bowman 2008; Tang et al. 2008). We have less information on phaeophytes and rhodophytes, because few developmental studies are done in those groups, and most of the genomic repertoire of both lineages is completely unknown functionally, and in the case of the unique species of red algae genome available, half of its genome has not detectable homology to any other eukaryote (Cock et al. 2010; Collen et al. 2013). Only the analysis of more genomes in both sides of the borders of multicellularity across the eukaryotic tree of life, as well as general functional information based in the powerful tools provided by functional genomics, will explain what are the key steps towards a full understanding of the transition to multicellularity in eukaryotes. Overall protists have changed our views on the origin of animals, blurring the uniqueness of our clade and claiming their central role on the understanding of multicellularity.

148

149

150

151

CONCLUSIONS

The main conclusions of the present work are the following: 1. The evolutionary origins of MAGUK gene family antedate the origins of animals, most families having pre-metazoan origins (CACNB, DLG, MPP and MAGI). Their posterior diversification through domain shuffling allowed new roles in adhesion and scaffolding. 2. The repertoire of Tyrosine Kinase Receptors in the filastereans C. owczarzaki and M. vibrans has diversified in an independent manner to other holozoan lineages. Complex Tyrosine Kinase repertoires characterize holozan lineages from other eukaryotic groups, having a common set of cytoplasmic tyrosine kinases. 3. G Protein Coupled Receptors had a major expansion in metazoan genomes compared to any other eukaryotic group. Nevertheless, all the cytoplasmic machinery was already in place in the unicellular ancestor of animals. 4. Several transcription factors with major roles in animal development are found in the genomes of unicellular holozoans, such as T-box, Runx, Myc/Max, STAT, p53 and Grainyhead. 5. Parallel evolutionary histories characterize the evolution of the TFome in both plants and animals. The total number of Transcription Factors in eukaryotic genomes reveals that both embryonic multicellular lineages are the richest in terms of diversity and absolute numbers, suggesting a putative role of embryonic development in the TFome size. 6. The gene transcription patterns of the whole TFome characterize the basic differences between animal and plant development. Peaks of TF activity characterize gastrulation and the phylotypic stage and a decrease in adult stages of the fruit fly and zebrafish are typical of determined development of animals. Plants have major Transcription factor activity in later development, expected from the undetermined mode of development that characterizes plants. 7. The genome of C. owczarzaki shows remarkable conservation of metazoan features, some of them secondarily lost in the choanoflagellate genomes. From domain protein domain presence and abundance to basic genomic architecture features, C. owczarzaki shows a complex pre-history of metazoan genomes.

152

153

154

BIBLIOGRAPHY

Abedin, M. & King, N., 2010. Diverse evolutionary paths to cell adhesion. Trends in cell biology, 20(12), pp.734–42.

Abedin, M. & King, N., 2008. The premetazoan ancestry of cadherins. Science, 319(5865), pp.946–8.

Adamska, M. et al., 2007. The evolutionary origin of hedgehog proteins. Current Biology, 17(19), pp.R836–R837.

Adamska, M. et al., 2011. What sponges can tell us about the evolution of developmental processes. Zoology, 114(1), pp.1–10.

Adl, S.M. et al., 2005. The new higher level classification of eukaryotes with emphasis on the taxonomy of protists. The Journal of eukaryotic microbiology, 52(5), pp.399–451.

El Albani, A. et al., 2010. Large colonial organisms with coordinated growth in oxygenated environments 2.1 Gyr ago. Nature, 466(7302), pp.100–4.

Alegado, R. a et al., 2012. A bacterial sulfonolipid triggers multicellular development in the closest living relatives of animals. eLife, 1, p.e00013.

Alié, A. & Manuel, M., 2010. The backbone of the post-synaptic density originated in a unicellular ancestor of choanoflagellates and metazoans. BMC evolutionary biology, 10, p.34.

Amoutzias, G.D. et al., 2004. Convergent evolution of gene networks by single-gene duplications in higher eukaryotes. EMBO reports, 5(3), pp.274–9.

Aury, J.-M. et al., 2006. Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia. Nature, 444(7116), pp.171–178.

Axtell, M.J. & Bowman, J.L., 2008. Evolution of plant microRNAs and their targets. Trends in Plant Science, 13(7), pp.343–349.

Bonner, J.T., 2001. First Signals: The Evolution of Multicellular Development, Princeton, New Jersey: Princeton University Press.

Boraas, M., Seale, D. & Boxhorn, J., 1998. Phagotrophy by a flagellate selects for colonial prey: A possible origin of multicellularity. Evolutionary Ecology, 12(2), pp.153–164.

Bouget, F.Y., Berger, F. & Brownlee, C., 1998. Position dependent control of cell fate in the Fucus embryo: role of intercellular communication. Development, 125(11), pp.1999– 2008.

Brasier, M., Green, O. & Shields, G., 1997. Ediacarian sponge spicule clusters from southwestern Mongolia and the origins of the Cambrian fauna. Geology, 25, p.303.

Brown, M.W. et al., 2012. Aggregative Multicellularity Evolved Independently in the Eukaryotic Supergroup Rhizaria. Current Biology, 22(12), pp.1–5.

155

Brown, M.W. et al., 2013. Phylogenomics demonstrates that breviate flagellates are related to opisthokonts and apusomonads. Proceedings of the Royal Society B: Biological Sciences.

Burga, A., Casanueva, M.O.O. & Lehner, B., 2011. Predicting mutation outcome from early stochastic variation in genetic interaction partners. Nature, 480(7376), pp.250–253.

Butterfield, N.J., 2010. Animals and the invention of the Phanerozoic Earth system. Trends in Ecology & Evolution, 26(2), pp.81–87.

Butterfield, N.J., 2000. Bangiomorpha pubescens n. gen., n. sp.: implications for the evolution of sex, multicellularity, and the Mesoproterozoic/Neoproterozoic radiation of eukaryotes. Paleobiology , 26 (3 ), pp.386–404.

Butterfield, N.J., 2009. Modes of pre-Ediacaran multicellularity. Precambrian Research, 173(1-4), pp.201–211.

Cai, X. & Clapham, D.E., 2011. Ancestral Ca2+ signaling machinery in early animal and fungal evolution. Molecular biology and evolution, pp.1–10.

Del Campo, J. & Ruiz-Trillo, I., 2013. Environmental Survey Meta-analysis Reveals Hidden Diversity among Unicellular Opisthokonts. Molecular Biology and Evolution , 30 (4 ), pp.802–805.

Carr, M. et al., 2008. Molecular phylogeny of choanoflagellates, the sister group to Metazoa. Proceedings of the National Academy of Sciences of the United States of America, 105(43), pp.16641–6.

Carroll, S.B., Grenier, J. & Weatherbee, S.D., 2005. From DNA to diversity: molecular genetics and the evolution of animal design, Oxford, England: Publishing, Blackwell.

Cavalier-Smith, T., 2012. Early evolution of eukaryote feeding modes, cell structural diversity, and classification of the protozoan phyla Loukozoa, Sulcozoa, and Choanozoa. European journal of protistology, pp.1–64.

Chen, K. & Rajewsky, N., 2007. The evolution of gene regulation by transcription factors and microRNAs. Nature Reviews Genetics, 8(2), pp.93–103.

Cock, J.M. et al., 2010. The Ectocarpus genome and the independent evolution of multicellularity in brown algae. Nature, 465(7298), pp.617–21.

Collen, J. et al., 2013. Genome structure and metabolic features in the red seaweed Chondrus crispus shed light on evolution of the Archaeplastida. Proceedings of the National Academy of Sciences of the United States of America, 110(13), pp.5247–52.

Conaco, C. et al., 2012. Functionalization of a protosynaptic gene expression network. Proceedings of the National Academy of Sciences of the United States of America, 109, pp.10612–10618.

Crespi, B.J., 2001. The evolution of social behavior in microorganisms. Trends in ecology & evolution, 16(4), pp.178–183.

Csuros, M., Rogozin, I.B. & Koonin, E. V, 2011. A detailed history of intron-rich eukaryotic ancestors inferred from a global survey of 100 complete genomes. PLoS computational biology, 7(9), p.e1002150.

156

Curtis, B. et al., 2012. Algal genomes reveal evolutionary mosaicism and the fate of nucleomorphs. Nature, 492(7427), pp.59–65.

D’Aniello, S. et al., 2008. Gene expansion and retention leads to a diverse tyrosine kinase superfamily in amphioxus. Molecular biology and evolution, 25(9), pp.1841–54.

Davidson, E. & Erwin, D., 2006. Gene Regulatory Networks and the Evolution of Animal Body Plans. Science Signaling, 311(iv).

Dávila López, M., Martínez Guerra, J.J. & Samuelsson, T., 2010. Analysis of Gene Order Conservation in Eukaryotes Identifies Transcriptionally and Functionally Linked Genes. PLoS ONE, 5(5), p.e10654.

Domazet-Lošo, T., Brajković, J. & Tautz, D., 2007. A phylostratigraphy approach to uncover the genomic history of major adaptations in metazoan lineages. Trends in Genetics, 23(11), pp.533–539.

Domazet-Lošo, T. & Tautz, D., 2010a. A phylogenetically based transcriptome age index mirrors ontogenetic divergence patterns. Nature, 468(7325), pp.815–8.

Domazet-Lošo, T. & Tautz, D., 2010b. Phylostratigraphic tracking of cancer genes suggests a link to the emergence of multicellularity in metazoa. BMC Biology, 8(1), p.66.

Dunn, C.W. et al., 2008. Broad phylogenomic sampling improves resolution of the animal tree of life. Nature, 452, pp.745–749.

Edgecombe, G.D. et al., 2011. Higher-level metazoan relationships: recent progress and remaining questions. Organisms Diversity & Evolution.

Eitel, M. et al., 2011. New Insights into Placozoan Sexual Reproduction and Development H. Escriva, ed. Plos ONE, 6(5), p.e19639.

Erwin, D.H. et al., 2011. The Cambrian conundrum: early divergence and later ecological success in the early history of animals. Science, 334(6059), pp.1091–7.

Fairclough, S., Dayel, M.J. & King, N., 2010. Multicellular development in a choanoflagellate. Current Biology, 20(20), pp.R875–R876.

Fairclough, S.R. et al., 2013. Premetazoan genome evolution and the regulation of cell differentiation in the choanoflagellate Salpingoeca rosetta. Genome biology, 14(2), p.R15.

Fisher, R.M., Cornwallis, C.K. & West, S.A., 2013. Group Formation, Relatedness, and the Evolution of Multicellularity. Current Biology, 23(12), pp.1120–5.

Flot, J.-F. et al., 2013. Genomic evidence for ameiotic evolution in the bdelloid rotifer Adineta vaga. Nature, 500(7463), pp.453–7.

Gabaldón, T. & Koonin, E. V., 2013. Functional and evolutionary implications of gene orthology. Nature Reviews Genetics, 14(5), pp.360–366.

Gazave, E. et al., 2009. Origin and evolution of the Notch signalling pathway: an overview from eukaryotic genomes. BMC Evolutionary Biology, 9(1), p.249.

157

Gerhart, J., 1999. 1998 Warkany lecture: signaling pathways in development. Teratology, 60(4), pp.226–39.

Gilmore, T.D. & Wolenski, F.S., 2012. NF-κB: where did it come from and why? Immunological Reviews, 246(1), pp.14–35.

Glockling, S.L., Marshall, W.L. & Gleason, F.H., 2013. Phylogenetic interpretations and ecological potentials of the Mesomycetozoea (Ichthyosporea). Fungal Ecology, pp.1–11.

Goldberg, J.M. et al., 2006. The dictyostelium kinome--analysis of the protein kinases from a simple model organism. PLoS genetics, 2(3), p.e38.

Gould, S.J. & Lewontin, R.C., 1979. The spandrels of San Marco and the Panglossian paradigm: a critique of the adaptationist programme. Proceedings of the Royal Society B: Biological Sciences, 205(1161), pp.581–598.

Grimson, A. et al., 2008. Early origins and evolution of microRNAs and Piwi-interacting RNAs in animals. Nature, 455(7217), pp.1193–7.

Grosberg, R.K. & Strathmann, R.R., 2007. The Evolution of Multicellularity: A Minor Major Transition? Annual Review of Ecology, Evolution, and Systematics, 38(1), pp.621–654.

Halder, G., Callaerts, P. & Gehring, W.J., 1995. Induction of ectopic eyes by targeted expression of the eyeless gene in Drosophila. Science , 267 (5205 ), pp.1788–1792.

Hejnol, A. et al., 2009. Assessing the root of bilaterian animals with scalable phylogenomic methods. Proceedings of the Royal Society B: Biological Sciences, 276(1677), pp.4261– 70.

Hertel, L. a, Bayne, C.J. & Loker, E.S., 2002. The symbiont Capsaspora owczarzaki, nov. gen. nov. sp., isolated from three strains of the pulmonate snail Biomphalaria glabrata is related to members of the Mesomycetozoea. International journal for parasitology, 32(9), pp.1183–91.

Hogues, H. et al., 2008. Transcription Factor Substitution during the Evolution of Fungal Ribosome Regulation. Molecular cell, 29(5), pp.552–562.

Hornstein, E. & Shomron, N., 2006. Canalization of development by microRNAs. Nature Genetics, 38 Suppl, pp.S20–S24.

Huldtgren, T. et al., 2011. Fossilized nuclei and germination structures identify Ediacaran “animal embryos” as encysting protists. Science, 334(6063), pp.1696–9.

Irimia, M. et al., 2013. Ancient cis-regulatory constraints and the evolution of genome architecture. Trends in Genetics, 29(9), pp.521–8.

Irimia, M. et al., 2012. Extensive conservation of ancient microsynteny across metazoans due to cis-regulatory constraints. Genome research, 22(12), pp.2356–67.

Jacob, F., 1977. Evolution and Tinkering. Science, 196(4295), pp.1161–1166.

Jolma, A. et al., 2013. DNA-binding specificities of human transcription factors. Cell, 152(1- 2), pp.327–39.

158

Kalinka, A.T. & Tomancak, P., 2012. The evolution of early animal embryos: conservation or divergence? Trends in ecology & evolution, 27(7), pp.385–93.

Karpov, S. & Leadbeater, B., 1998. Cytoskeleton structure and composition in choanoflagellates. The Journal of eukaryotic Microbiology, 45(3), pp.361–367.

Kayal, E. et al., 2013. Cnidarian phylogenetic relationships as revealed by mitogenomics. BMC Evolutionary Biology, 13, p.5.

Kerner, P. et al., 2011. Evolution of RNA-binding proteins in animals: insights from genome- wide analysis in the sponge Amphimedon queenslandica. Molecular Biology and Evolution, 28, pp.2289–2303.

King, N. et al., 2008. The genome of the choanoflagellate Monosiga brevicollis and the origin of metazoans. Nature, 451(7180), pp.783–8.

King, N., 2004. The unicellular ancestry of animal development. Developmental Cell, 7(3), pp.313–25.

King, N. & Carroll, S.B., 2001. A receptor tyrosine kinase from choanoflagellates: molecular insights into early animal evolution. Proceedings of the National Academy of Sciences of the United States of America, 98, pp.15032–15037.

Knoll, A.H., 2011. The Multiple Origins of Complex Multicellularity. Annual Review of Earth and Planetary Sciences, 39(1), pp.217–239.

Kodner, R.B. et al., 2008. Sterols in a unicellular relative of the metazoans. Proceedings of the National Academy of Sciences of the United States of America, 105(29), pp.9897– 902.

Koonin, E. V et al., 2004. A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome biology, 5(2), p.R7.

Koschwanez, J.H., Foster, K.R. & Murray, a. W., 2013. Improved use of a public good selects for the evolution of undifferentiated multicellularity. eLife, 2, pp.e00367–e00367.

Kusserow, A. et al., 2005. Unexpected complexity of the Wnt gene family in a sea anemone. Nature, 433, pp.156–160.

Lane, N. & Martin, W., 2010. The energetics of genome complexity. Nature, 467(7318), pp.929–34.

Leclère, L. et al., 2012. Maternally localized germ plasm mRNAs and germ cell/stem cell formation in the cnidarian Clytia. Developmental biology, 364(2), pp.236–48.

Leonard, G. & Richards, T. a, 2012. Genome-scale comparative analysis of gene fusions, gene fissions, and the fungal tree of life. Proceedings of the National Academy of Sciences of the United States of America, 109(52), pp.21402–7.

Levine, M. & Tjian, R., 2003. Transcription regulation and animal diversity. Nature, 424(6945), pp.147–51.

Leys, S.P. & Riesgo, A., 2012. Epithelia, an evolutionary novelty of metazoans. Journal of experimental zoology. Part B, Molecular and developmental evolution, 318(6), pp.438– 47.

159

Lim, W. a. & Pawson, T., 2010. Phosphotyrosine Signaling: Evolving a New Cellular Communication System. Cell, 142(5), pp.661–667.

Liu, B. a. et al., 2011. The SH2 Domain-Containing Proteins in 21 Species Establish the Provenance and Scope of Phosphotyrosine Signaling in Eukaryotes. Science Signaling, 4(202), pp.ra83–ra83.

Love, G.D. et al., 2009. Fossil steroids record the appearance of Demospongiae during the Cryogenian period. Nature, 457(7230), pp.718–21.

Lozada-Chávez, I., Stadler, P.F. & Prohaska, S.J., 2011. “Hypothesis for the modern RNA world”: a pervasive non-coding RNA-based genetic regulation is a prerequisite for the emergence of multicellular complexity. Origins of life and evolution of the biosphere : the journal of the International Society for the Study of the Origin of Life, 41(6), pp.587–607.

Lynch, M., 2007. The frailty of adaptive hypotheses for the origins of organismal complexity. Proceedings of the National Academy of Sciences of the United States of America, 104 (Suppl 1 ), pp.8597–8604.

Lynch, M. & Conery, J.S., 2003. The origins of genome complexity. Science, 302(5649), pp.1401–4.

Magie, C.R. & Martindale, M.Q., 2008. Cell-Cell Adhesion in the Cnidaria: Insights Into the Evolution of Tissue Morphogenesis. The Biological Bulletin , 214 (3 ), pp.218–232.

Maldonado, M., 2004. Choanoflagellates, choanocytes, and animal multicellularity. Invertebrate Biology, 123(1), pp.1–22.

Maloof, A.C. et al., 2010. Possible animal-body fossils in pre-Marinoan limestones from South Australia. Nature Geosciences, 3(9), pp.653–659.

Manning, G. et al., 2008. The protist, Monosiga brevicollis, has a tyrosine kinase signaling network more elaborate and diverse than found in any known metazoan. Proceedings of the National Academy of Sciences of the United States of America, 105(28), pp.9674–9.

Margulis, L., 1981. Symbiosis in Cell Evolution: Microbial Communities in the Archean and Proterozoic Eons, New York: W.H. Freeman and Company.

Martindale, M.Q., Finnerty, J.R. & Henry, J.Q., 2002. The Radiata and the evolutionary origins of the bilaterian body plan. Molecular Phylogenetics and Evolution, 24, pp.358– 365.

Maynard-Smith, J. & Szathmáry, E., 1995. The Major Transitions in Evolution, Oxford, England: Oxford University Press.

McFall-Ngai, M. et al., 2013. Animals in a bacterial world, a new imperative for the life sciences. Proceedings of the National Academy of Sciences of the United States of America, 110(9), pp.3229–3236.

McGuire, A.M. et al., 2008. Cross-kingdom patterns of alternative splicing and splice recognition. Genome biology, 9(3), p.R50.

De Mendoza, A., Sebé-Pedrós, A. & Ruiz-Trillo, I., 2013. El origen de la multicelularidad. Investigación y Ciencia, pp.32–39.

160

Menon, L.R., McIlroy, D. & Brasier, M.D., 2013. Evidence for Cnidaria-like behavior in ca. 560 Ma Ediacaran Aspidella. Geology .

Meyerowitz, E.M., 2002. Plants compared to animals: the broadest comparative study of development. Science, 295(5559), pp.1482–5.

Michod, R., 2000. Darwinian dynamics: evolutionary transitions in fitness and individuality, New Jersey: Princeton University Press.

Michod, R.E., 2007. Evolution of individuality during the transition from unicellular to multicellular life. Proceedings of the National Academy of Sciences of the United States of America, 104 Suppl , pp.8613–8.

Mikhailov, K. V et al., 2009. The origin of Metazoa: a transition from temporal to spatial cell differentiation. BioEssays : news and reviews in molecular, cellular and developmental biology, 31(7), pp.758–68.

Murchison, E.P. et al., 2012. Genome Sequencing and Analysis of the Tasmanian Devil and Its Transmissible Cancer. Cell, 148(4), pp.780–791.

Murgia, C. et al., 2006. Clonal Origin and Evolution of a Transmissible Cancer. Cell, 126(3), pp.477–487.

Nelson, C.E., Hersh, B.M. & Carroll, S.B., 2004. The regulatory content of intergenic DNA shapes genome architecture. Genome biology, 5(4), p.R25.

Nichols, S.A. et al., 2012. Origin of metazoan cadherin diversity and the antiquity of the classical cadherin/β-catenin complex. Proceedings of the National Academy of Sciences of the United States of America, 109(32), pp.13046–51.

Nielsen, C., 2008. Six major steps in animal evolution: are we derived sponge larvae? Evolution and development, 10, pp.241–257.

Nielsen, R., 2009. Adaptionism-30 years after Gould and Lewontin. Evolution; international journal of organic evolution, 63(10), pp.2487–90.

Niklas, K.J. & Newman, S. a, 2013. The origins of multicellular organisms. Evolution and development, 15(1), pp.41–52.

Nilsen, T.W. & Graveley, B.R., 2010. Expansion of the eukaryotic proteome by alternative splicing. Nature, 463(7280), pp.457–463.

Nosenko, T. et al., 2013. Deep metazoan phylogeny: When different genes tell different stories. Molecular phylogenetics and evolution, 67(1), pp.223–233.

Paps, J. et al., 2013. Molecular phylogeny of unikonts: new insights into the position of apusomonads and ancyromonads and the internal relationships of opisthokonts. Protist, 164(1), pp.2–12.

Paps, J. & Ruiz-Trillo, I., 2010. Animals and Their Unicellular Ancestors. Encyclopedia of life sciences, pp.1–8.

Philippe, H. et al., 2009. Phylogenomics revives traditional views on deep animal relationships. Current Biology, 19(8), pp.706–12.

161

Philippe, H. et al., 2011. Resolving Difficult Phylogenetic Questions: Why More Sequences Are Not Enough D. Penny, ed. PLoS Biology, 9(3), p.e1000602.

Pick, K.S. et al., 2010. Improved phylogenomic taxon sampling noticeably affects nonbilaterian relationships. Molecular biology and evolution, 27(9), pp.1983–7.

Pincus, D. et al., 2008. Evolution of the phospho-tyrosine signaling machinery in premetazoan lineages. Proceedings of the National Academy of Sciences of the United States of America, 105(28), pp.9680–4.

Pires-daSilva, A. & Sommer, R.J., 2003. The evolution of signalling pathways in animal development. Nature Reviews Genetics, 4(1), pp.39–49.

Prochnik, S.E. et al., 2010. Genomic analysis of organismal complexity in the multicellular green alga Volvox carteri. Science, 329(5988), pp.223–6.

Punta, M. et al., 2012. The Pfam protein families database. Nucleic Acids Research, 40 (D1 ), pp.D290–D301.

Putnam, N.H. et al., 2007. Sea anemone genome reveals ancestral eumetazoan gene repertoire and genomic organization. Science, 317(5834), pp.86–94.

Ramsdale, M., 2012. Programmed cell death in the cellular differentiation of microbial eukaryotes. Current opinion in microbiology, 15(6), pp.646–52.

Ratcliff, W.C. et al., 2012. Experimental evolution of multicellularity. Proceedings of the National Academy of Sciences of the United States of America, 109(5), pp.1595–600.

Ray, D. et al., 2013. A compendium of RNA-binding motifs for decoding gene regulation. Nature, 499(7457), pp.172–177.

Read, B. a. et al., 2013. Pan genome of the phytoplankton Emiliania underpins its global distribution. Nature, 499(7457), pp.209–13.

Reinke, A.W. et al., 2013. Networks of bZIP protein-protein interactions diversified over a billion years of evolution. Science, 340(6133), pp.730–4.

Richards, G.S. et al., 2008. Sponge genes provide new insight into the evolutionary origin of the neurogenic circuit. Current biology, 18(15), pp.1156–61.

Ringrose, J.H. et al., 2013. Deep proteome profiling of Trichoplax adhaerens reveals remarkable features at the origin of metazoan multicellularity. Nature Communications, 4(1408).

Rokas, A., 2008. The origins of multicellularity and the early history of the genetic toolkit for animal development. Annual Review of Genetics, 42, pp.235–51.

Romero, P.R. et al., 2006. Alternative splicing in concert with protein intrinsic disorder enables increased functional diversity in multicellular organisms. Proceedings of the National Academy of Sciences of the United States of America, 103(22), pp.8390–5.

Roure, B., Baurain, D. & Philippe, H., 2013. Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets. Molecular Biology and Evolution , 30 (1 ), pp.197–214.

162

Ruiz-Trillo, I. et al., 2008. A phylogenomic investigation into the origin of metazoa. Molecular biology and evolution, 25(4), pp.664–72.

Ruiz-Trillo, I. et al., 2007. The origins of multicellularity: a multi-taxon genome initiative. Trends in Genetics, 23(3), pp.113–8.

Sakarya, O. et al., 2007. A post-synaptic scaffold at the origin of the animal kingdom. Plos ONE, 2(6), p.e506.

Sandra K. Floyd & John L. Bowman, 2007. The Ancestral Developmental Tool Kit of Land Plants. International Journal of Plant Sciences, 168(1), pp.1–35.

Schierwater, B. et al., 2009. Concatenated analysis sheds light on early metazoan evolution and fuels a modern “urmetazoon” hypothesis. PLoS Biology, 7, p.e20.

Scholz, C.B. & Technau, U., 2003. The ancestral role of Brachyury: expression of NemBra1 in the basal cnidarian Nematostella vectensis (Anthozoa). Development Genes and Evolution, 212, pp.563–70.

Sebé-Pedrós, A. et al., 2010. Ancient origin of the integrin-mediated adhesion and signaling machinery. Proceedings of the National Academy of Sciences of the United States of America, 107(22), p.10142.

Sebé-Pedrós, A. et al., 2012. Premetazoan Origin of the Hippo Signaling Pathway. Cell Reports, 1(1), pp.1–8.

Sebé-Pedrós, A. et al., Regulated multicellularity in a close unicellular relative of Metazoa.

Shalchian-Tabrizi, K. et al., 2008. Multigene phylogeny of choanozoa and the origin of animals. Plos ONE, 3(5), p.e2098.

Shertz, C. a et al., 2010. Conservation, duplication, and loss of the Tor signaling pathway in the fungal kingdom. BMC genomics, 11(1), p.510.

Shoguchi, E. et al., 2013. Draft Assembly of the Symbiodinium minutum Nuclear Genome Reveals Dinoflagellate Gene Structure. Current Biology, 23(15), pp.1399–408.

Simakov, O. et al., 2013. Insights into bilaterian evolution from three spiralian genomes. Nature, 493(7433), pp.526–31.

Sparks, E., Wachsman, G. & Benfey, P.N., 2013. Spatiotemporal signalling in plant development. Nature Reviews Genetics, 14(9), pp.631–644.

Srivastava, M. et al., 2010. The Amphimedon queenslandica genome and the evolution of animal complexity. Nature, 466(7307), pp.720–726.

Srivastava, M. et al., 2008. The Trichoplax genome and the nature of placozoans. Nature, 454(7207), pp.955–60.

Stajich, J.E. et al., 2009. The fungi. Current Biology, 19(18), pp.R840–5.

Steinmetz, P.R.H. et al., 2012. Independent evolution of striated muscles in cnidarians and bilaterians. Nature, 487(7406), pp.231–4.

163

Suga, H. & Ruiz-Trillo, I., 2013. Development of ichthyosporeans sheds light on the origin of metazoan multicellularity. Developmental biology, 377(1), pp.284–92.

Sumathi, J.C. et al., 2006. Molecular evidence of fungal signatures in the marine protist Corallochytrium limacisporum and its implications in the evolution of animals and fungi. Protist, 157, pp.363–376.

Tan, C.S.H. et al., 2009. Positive selection of tyrosine loss in metazoan evolution. Science, 325(5948), pp.1686–8.

Tang, H. et al., 2008. Synteny and collinearity in plant genomes. Science, 320(7), pp.486–8.

The ENCODE Project Consortium, 2012. An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), pp.57–74.

Thompson, D.A. et al., 2013. Evolutionary principles of modular gene regulation in yeasts. eLife, 2, p.e00603.

Tomitani, A. et al., 2006. The evolutionary diversification of cyanobacteria: Molecular– phylogenetic and paleontological perspectives. Proceedings of the National Academy of Sciences of the United States of America, 103 (14 ), pp.5442–5447.

Torruella, G. et al., 2012. Phylogenetic relationships within the Opisthokonta based on phylogenomic analyses of conserved single-copy protein domains. Molecular biology and evolution, 29(2), pp.531–44.

Traylor-Knowles, N. et al., 2010. The evolutionary diversification of LSF and Grainyhead transcription factors preceded the radiation of basal animal lineages. BMC evolutionary biology, 10, p.101.

Wagner, A., 2011. The molecular origins of evolutionary innovations. Trends in Genetics, 27(10), pp.397–410.

Wolf, Y.I. & Koonin, E. V., 2013. Genome reduction as the dominant mode of evolution. BioEssays, 35(9), pp.829–37.

Yin, L. et al., 2007. Doushantuo embryos preserved inside diapause egg cysts. Nature, 446(7136), pp.661–3.

Zmasek, C.M. & Godzik, A., 2013. Evolution of the Animal Apoptosis Network. Cold Spring Harbor Perspectives in Biology, 5(3), pp.a008649–a008649.

Zmasek, C.M. & Godzik, A., 2011. Strong functional patterns in the evolution of eukaryotic genomes revealed by the reconstruction of ancestral protein domain repertoires. Genome biology, 12(1), p.R4.

Zmasek, C.M. & Godzik, A., 2012. This Déjà Vu Feeling—Analysis of Multidomain Protein Evolution in Eukaryotic Genomes C. A. Orengo, ed. PLoS Computational Biology, 8(11), p.e1002701.

164

165

166

ANNEX

Annex A1

167

168

The Mysterious Evolutionary Origin for the GNE Gene and the Root of Bilateria Alex de Mendoza*,1,2 and In˜aki Ruiz-Trillo1,2,3 1Departament de Gene`tica, Universitat de Barcelona, Barcelona, Spain 2Institut de Recerca en Biodiversitat, Universitat de Barcelona, Barcelona, Spain 3Institucio´ Catalana per a la Recerca i Estudis Avancxats, Passeig Lluı´s Companys 23, 08010 Barcelona, Spain *Corresponding author: E-mail: [email protected]. Associate editor: Manolo Gouy

Abstract Phylogenomic analyses have revealed several important metazoan clades, such as the Ecdysozoa and the Lophotrochozoa. However, the phylogenetic positions of a few taxa, such as ctenophores, chaetognaths, acoelomorphs, and Xenoturbella, remain contentious. Thus, the findings of qualitative markers or ‘‘rare genomic changes’’ seem ideal to independently test previous phylogenetic hypotheses. We here describe a rare genomic change, the presence of the gene UDP-GlcNAc 2- epimerase/N-acetylmannosamine kinase (GNE). We show that GNE is encoded in the genomes of deuterostomes, acoelomorphs and Xenoturbella, whereas it is absent in protostomes and nonbilaterians. Moreover, the GNE has a complex evolutionary origin involving unique lateral gene transfer events and/or extensive hidden paralogy for each protein Downloaded from domain. However, rather than using GNE as a phylogenetic character, we argue that rare genomic changes such as the one presented here should be used with caution. Key words: molecular markers, metazoan phylogeny, lateral gene transfer, Xenoturbella, UDP-GlcNAc 2-epimerase/N- acetylmannosamine kinase, acoels. http://mbe.oxfordjournals.org/ Molecular signatures that characterize a clade are an an effect of secondary gene loss (D’Aniello et al. 2008; important tool to create or corroborate phylogenetic Churcher and Taylor 2009). Moreover, a small fragment groupings (reviewed in Telford and Copley 2011). Examples of the gene is present in the expressed sequence tags of such molecular synapomorphies include the presence of (EST) of Xenoturbella bocki, and we have amplified the gene Hox/ParaHox genes in the ParaHoxozoa (Placozoa, GNE from the acoel Symsagittifera roscoffensis (GenBank Cnidaria, and Bilateria) (Ryan et al. 2010), an indel in JF826132). We searched publicly available EST data as well the gene elongation factor-1 alpha as a character of opis- as unpublished transcriptome data from nemertodermatids thokonts (Steenkamp et al. 2006) or a type of the NADH and did not get any hit. However, Because a complete by guest on July 23, 2013 dehydrogenase subunit 5 mitochondrial gene that is exclu- genome of a nemertodermatid is not available, we cannot sive to protostomes (Papillon et al. 2004). discard the presence of GNE in this group. Interestingly, The monophyly of deuterostomes (the clade that com- the GNE encoded by chordates, echinoderms, and hemi- prises chordates, echinoderms, and hemichordates) has been chordates all share the same nine introns (both in position consistently recovered on phylogenetic and phylogenomic and phase), whereas the GNE of the acoel S. roscoffensis does studies (Hejnol et al. 2009; Paps et al. 2009). However, it re- not share any intron with deuterostomes (supplementary mains contentious whether Xenoturbella and/or the acoelo- fig. S1, Supplementary Material online). Unfortunately, we morphs (acoels and nemertodermatids) are members of the could not elucidate the intron–exon structure of the deuterostomes or basal bilaterians (Ruiz-Trillo et al. 1999, GNE encoded by X. bocki because only a small cDNA Letter 2002; Bourlat et al. 2003, 2006, 2009; Philippe et al. 2007, fragment is available. 2011; Dunn et al. 2008; Hejnol et al. 2009; Paps et al. GNE is known to play an important role in the biosyn- 2009; Mwinyi et al. 2010). Thus, the identification of diagnos- thesis of sialic acids, which are monosaccharides that act in tic molecular synapomorphies for deuterostomes is impor- a wide range of biological and pathological events, such as tant to both corroborate previous molecular analyses and cellular adhesion, recognition determinants, tumorigenesis, independently test the putative deuterostome affiliation and stem cells (Effertz et al. 1999; Tanner 2005; Weidemann of acoels and Xenoturbella. et al. 2010). In mammals, the metabolic precursor of sialic We here show that the bifunctional enzyme UDP-GlcNAc acids is the N-acetylneuraminic (Neu5Ac) acid, which 2-epimerase/N-acetylmannosamine kinase (GNE) is exclu- derives from UDP-N-acetylglucosamine (UDP-GlcNAc). sive to deuterostomes, acoels and Xenoturbella, being absent The first two steps of this reaction are catalyzed by the bi- from all sequenced protostomes and nonbilaterian taxa. Our functional enzyme GNE (fig. 1). The bifunctional activity of data show that GNE is encoded in the genomes of all se- GNE comes from two different protein domains: the UDP- quenced deuterostomes except for the urochordates Ciona N-acetylglucosamine 2-epimerase domain (PF02350) (from savignyi, Ciona intestinalis,andOikopleura dioica, most likely herein the ‘‘epimerase-2 domain’’) and a kinase domain

© The Author 2011. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected]

Mol. Biol. Evol. 28(11):2987–2991. 2011 doi:10.1093/molbev/msr142 Advance Access publication May 25, 2011 2987 de Mendoza and Ruiz-Trillo · doi:10.1093/molbev/msr142 MBE

FIG.1.(A) Schematic representation of the biosynthesis pathway of CMP-NeuNAc both in bacteria and mammals (adapted from Tanner 2005). Steps 1 and 2 are performed by GNE in deuterostomes. Step 3 is exclusive of bacteria, whereas the additional step 4 has sofar only been described in mammals. (B) Cladogram representing GNE evolution among the Bilateria according to different current phylogenetic scenarios. Downloaded from Black dot indicates acquisition of the GNE, cross shows secondary loss, black triangle shows intron gain, and white triangle indicates intron loss. known as ROK (Repressor, Open Reading Frame, Kinase as recently suggested (Philippe et al. 2011). Interestingly, the PF00480) (Tanner 2005). The epimerase-2 domain converts sequence from Micromonas sp. falls as sister group to bilat-

UDP-GlcNAc to N-acetylmannosamine (ManNAc), which erians. However, the Micromonas gene, in contrast to the http://mbe.oxfordjournals.org/ is consecutively phosphorylated to ManNAc-6-P by the deuterostome sequences, has no introns. Moreover, neither ROK domain (fig. 1). Interestingly, whereas in prokaryotes, Micromonas pusilla (the other congeneric species with its the epimerase and kinase functions are carried out by two complete genome sequenced) nor any other sequenced separate enzymes, in bilaterians those two domains have clorophyte (except Ostreococcus lucimarinus, whose epimer- been fused, allowing an allosteric site to appear and thus ase gene is far related; see fig. 2) encodes this domain. Thus, conferring the potential for new functions to arise. Micromonas epimerase-2 sequence, as well as Ricinus homo- Thus, to gain further insights into the evolutionary origin log which branches within another independent bacterial of GNE, we performed both exhaustive searches across the clade (fig. 2), most probably come from independent LGT by guest on July 23, 2013 public databases and phylogenetic analyses of the two events (see Supplementary Material online). domains independently. Our data show that both domains The ROK domain also bears a complex evolutionary have a patchy distribution across eukaryotes. The epimer- history (fig. 3). The phylogenetic analysis shows most ase-2 domain is encoded in a few eukaryotic genomes, be- eukaryotic sequences in a single monophyletic group that ing, in contrast, ubiquitous among Archaea and Eubacteria also includes Eubacteria. The bilaterian GNE genes are mono- (fig. 2). The phylogenetic analyses show two major clades, phyletic and unrelated to the other eukaryotic sequences one comprising the bilaterian-specific GNE genes within (also supported by indel characters, see supplementary a mostly prokaryotic clade (clade A in fig. 2) and the other fig. S2B, Supplementary Material online). Again, this topology comprising most other (nonmetazoan) eukaryotes branch- can either be explained by LGT or by hidden paralogy. ing also within a prokaryotic clade (clade B in fig. 2). Both Our data show that GNE (the derived gene fusion of clades are divided with high nodal support (bootstrap value epimerase-2 and ROK domains) is exclusive to most deu- 5 100%, and Bayesian posterior probability 5 1.00), and terostomes, acoelomorphs and Xenoturbella. How this both have specific indel characters (see supplementary presence/absence scheme is interpreted remains unclear. fig. S2, Supplementary Material online). The general topol- One could easily propose the presence of the gene GNE ogy of the epimerase-2 domain is probably due to 1) an as a molecular synapomorphy of deuterostomes (fig. 1B). extreme case of hidden paralogy (i.e., this protein domain This would corroborate the recent, although lowly sup- was present at the origin of eukaryotes and lost in all lin- ported, proposal that xenacoelomorphs are deuterostomes eages except in deuterostomes, Xenoturbella and Acoela (Philippe et al. 2011). However, if Xenacoelomorpha or just and a few other eukaryotes), 2) domain convergence, or the Acoelomorpha are not deuterostomes but basal bilat- 3) the consequence of several independent lateral gene erians, as most phylogenetic analyses suggest (Ruiz-Trillo transfer (LGT) events to the different eukaryotic lineages, et al. 1999, 2002; Hejnol et al. 2009; Paps et al. 2009; Mwinyi one being an LGT event to the last common bilaterian an- et al. 2010), then GNE was secondarily lost in the last cestor (and then subsequently lost in protostomes and in common protostome ancestor. This is indeed not a difficult urochordates) or to the last common ancestor of deuter- scenario, specially considering that gene loss must already ostomes (and lost in urochordates) if xenacoelomorphs be hypothesized for urochordates (fig. 1B). The analysis of (Xenoturbella þ acoelomorphs) are indeed deuterostomes intron composition shows that the GNE encoded by acoels

2988 Evolutionary Origin of the GNE Gene · doi:10.1093/molbev/msr142 MBE Downloaded from http://mbe.oxfordjournals.org/ by guest on July 23, 2013

FIG.2.Maximum likelihood (ML) phylogenetic tree of the epimerase-2 domain as obtained by RAxML (Whelan and Goldman [WAG] þ C þ I). The tree is rooted using the midpoint-rooted tree option. Nodal support was obtained by RAxML 100 bootstrap replicates (bootstrap value [BV]) and Bayesian posterior probabilities (PP). Both values are shown on key branches. A black dot in the node indicates BV . 95% and PP . 0.95. Eukaryotes are shown in bold. ‘‘A_’’ before species name indicates archaeal sequences. Characteristic domain architectures are shown in key taxa or lineages.

2989 de Mendoza and Ruiz-Trillo · doi:10.1093/molbev/msr142 MBE Downloaded from http://mbe.oxfordjournals.org/ by guest on July 23, 2013

FIG.3.Maximum likelihood (ML) phylogenetic tree of the ROK domain as obtained by RAxML (Whelan and Goldman [WAG] þ C þ I). The tree is rooted using the eukaryotes as outgroup. RAxML 100 bootstrap replicates (bootstrap value [BV]) and Bayesian posterior probabilities (PP) values are shown on key branches. A black dot in the node indicates BV . 95% and BPP . 0.95. Eukaryotes are shown in bold. ‘‘A_’’ before species name indicates archaeal sequences. Characteristic domain architectures are shown in key lineages.

2990 Evolutionary Origin of the GNE Gene · doi:10.1093/molbev/msr142 MBE and deuterostomes independently evolved their own Dunn CW, Hejnol A, Matus DQ, et al. (18 co-authors). 2008. Broad introns (supplementary fig. S1, Supplementary Material phylogenomic sampling improves resolution of the animal tree online), somehow supporting acoels are not deuterostomes. of life. Nature 452:745–749. However, independent intron evolution within acoels could Edvardsen RB, Lerat E, Maeland AD, Flat M, Tewari R, Jensen MF, Lehrach H, Reinhardt R, Seo HC, Chourrout D. 2004. Hypervariable easily be argued as well, as has been described, for example, and highly divergent intron-exon organizations in the chordate in the urochordate O. dioica (Edvardsen et al. 2004). To sum Oikopleura dioica. JMolEvol. 59:448–457. up, although rare genomic changes can be important phy- Effertz K, Hinderlich S, Reutter W. 1999. Selective loss of either the logenetic markers, they should be used with caution because epimerase or kinase activity of UDP-N-acetylglucosamine 2-epim- gene loss has been shown to play an important role in eu- erase/N-acetylmannosamine kinase due to site-directed mutagen- karyotic evolution (see, e.g., Sebe´-Pedro´s et al. 2011; Zmasek esis based on sequence alignments. JBiolChem. 274:28771–28778. and Godzik 2011). Additional data, especially from phyloge- Hejnol A, Obst M, Stamatakis A, et al. (17 co-authors). 2009. nomics analyses, should be taken into account. Finally, the Assessing the root of bilaterian animals with scalable phyloge- nomic methods. Proc Biol Sci. 276:4261–4270. two domains that make up the GNE gene have complex evo- Mwinyi A, Bailly X, Bourlat SJ, Jondelius U, Littlewood DT, lutionary histories, most likely involving LGT events or ex- Podsiadlowski L. 2010. The phylogenetic position of Acoela as treme hidden paralogy. revealed by the complete mitochondrial genome of Symsagitti- fera roscoffensis. BMC Evol Biol. 10:309. Supplementary Material Papillon D, Perez Y, Caubit X, Le Parco Y. 2004. Identification of chaetognaths as protostomes is supported by the analysis of Supplementary material files 1, 2,and3 are available at

their mitochondrial genome. Mol Biol Evol. 21:2122–2129. Downloaded from Molecular Biology and Evolution online (http://www.mbe Paps J, Bagun˜a` J, Riutort M. 2009. Bilaterian phylogeny: a broad .oxfordjournals.org/). sampling of 13 nuclear genes provides a new Lophotrochozoa phylogeny and supports a paraphyletic basal acoelomorpha. Mol Acknowledgments Biol Evol. 26:2397–2406. Philippe H, Brinkmann H, Martinez P, Riutort M, Bagun˜a` J. 2007.

We thank Albert Poustka and Pere Martı´nez for allowing us Acoel flatworms are not platyhelminthes: evidence from http://mbe.oxfordjournals.org/ to use the genomic raw data from S. roscoffensis. We thank phylogenomics. PLoS One. 2:e717. Andreas Hejnol for accession to Meara stichopi data and Philippe H, Brinkmann H, Copley RR, Moroz LL, Nakano H, helpful discussion. We thank Marta Chiodin for providing Poustka AJ, Wallberg A, Peterson KJ, Telford MJ. 2011. the cDNA from S. roscoffensis. We thank Jordi Paps, Lora L. Acoelomorph flatworms are deuterostomes related to Xeno- turbella. Nature 470:255–258. Shadwick, Marta Riutort, Jaume Bagun˜a`, and Pere Martı´nez Ruiz-Trillo I, Paps J, Loukota M, Ribera C, Jondelius U, Bagun˜a` J, for helpful insights and motivation. This work was Riutort M. 2002. A phylogenetic analysis of myosin heavy chain supported by an Institucio´ Catalana per a la Recerca i type II sequences corroborates that Acoela and Nemertodermatida

Estudis Avancxats contract, an European Research Council are basal bilaterians. Proc Natl Acad Sci U S A. 99:11246–11251. by guest on July 23, 2013 Starting Grant (ERC-2007-StG-206883), and a grant Ruiz-Trillo I, Riutort M, Littlewood DT, Herniou EA, Bagun˜a` J. 1999. (BFU2008-02839/BMC) from Ministerio de Ciencia e Inno- Acoel flatworms: earliest extant bilaterian Metazoans, not vacio´n (MICINN) to I.R.-T. A.d.M.’s salary was supported by members of Platyhelminthes. Science 283:1919–1923. a pregraduate FPI grant from MICINN. Ryan JF, Pang K, Mullikin JC, Martindale MQ, Baxevanis AD. 2010. The homeodomain complement of the ctenophore Mnemiopsis leidyi suggests that Ctenophora and Porifera diverged prior to References the ParaHoxozoa. Evodevo 1:9. Bourlat SJ, Juliusdottir T, Lowe CJ, et al. (14 co-authors). 2006. Sebe´-Pedro´s A, de Mendoza A, Lang BF, Degnan BM, Ruiz-Trillo I. Deuterostome phylogeny reveals monophyletic chordates and 2011. Unexpected repertoire of metazoan transcription factors the new phylum Xenoturbellida. Nature 444:85–88. in the unicellular holozoan Capsaspora owczarzaki. Mol Biol Evol. Bourlat SJ, Nielsen C, Lockyer AE, Littlewood DTJ, Telford MJ. 2003. 28:1241–1254. Xenoturbella is a deuterostome that eats molluscs. Nature Steenkamp ET, Wright J, Baldauf SL. 2006. The protistan origins of 424:925–928. animals and fungi. Mol Biol Evol. 23:93–106. Bourlat SJ, Rota-Stabelli O, Lanfear R, Telford MJ. 2009. The Tanner ME. 2005. The enzymes of sialic acid biosynthesis. Bioorg mitochondrial genome structure of Xenoturbella bocki (phylum Chem. 33:216–228. Xenoturbellida) is ancestral within the deuterostomes. BMC Evol Telford MJ, Copley RR. 2011. Improving animal phylogenies with Biol. 9:107. genomic data. Trends Genet. 27:186–195. Churcher AM, Taylor JS. 2009. Amphioxus (Branchiostoma floridae) Weidemann W, Klukas C, Klein A, Simm A, Schreiber F, has orthologs of vertebrate odorant receptors. BMC Evol Biol. Horstkorte R. 2010. Lessons from GNE-deficient embryonic 9:242. stem cells: sialic acid biosynthesis is involved in proliferation and D’Aniello S, Irimia M, Maeso I, Pascual-Anaya J, Jimenez-Delgado S, gene expression. Glycobiology 20:107–117. Bertrand S, Garcı´a-Ferna`ndez J. 2008. Gene expansion and Zmasek CM, Godzik A. 2011. Strong functional patterns in the retention leads to a diverse tyrosine kinase superfamily in evolution of eukaryotic genomes revealed by the reconstruction amphioxus. Mol Biol Evol. 25:1841–1854. of ancestral protein domain repertoires. Genome Biol. 12:R4.

2991

174

Annex A2

175

176

GBE

A Broad Genomic Survey Reveals Multiple Origins and Frequent Losses in the Evolution of Respiratory Hemerythrins and Hemocyanins

Jose´ M. Martı´n-Dura´n1,*,y, Alex de Mendoza2,3,y, Arnau Sebe´-Pedro´ s2,3,y,In˜ aki Ruiz-Trillo2,3,4,and Andreas Hejnol1 1Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen, Norway 2Institut de Biologia Evolutiva (CSIC-Universitat Pompeu Fabra), Passeig Marı´tim de la Barceloneta, Barcelona, Spain 3Departament de Gene`tica, Universitat de Barcelona, Barcelona, Spain 4Institucio´ Catalana de Recerca i Estudis Avanc¸ats (ICREA), Passeig Lluı´s Companys, Barcelona, Spain yThese authors contributed equally to this work.

*Corresponding author: E-mail: [email protected]. Downloaded from Accepted: July 1, 2013 Data deposition: GenBank/DDBJ/EMBL accession numbers of the sequences used in the phylogenetic analyses are indicated in the supplementary data, Supplementary Material online. http://gbe.oxfordjournals.org/

Abstract Hemerythrins and hemocyanins are respiratory proteins present in some of the most ecologically diverse animal lineages; however, the precise evolutionary history of their enzymatic domains (hemerythrin, hemocyanin M, and tyrosinase) is still not well understood. We survey a wide dataset of prokaryote and eukaryote genomes and RNAseq data to reconstruct the phylogenetic origins of these proteins. We identify new species with hemerythrin, hemocyanin M, and tyrosinase domains in their genomes, particularly within animals, and demonstrate that the current distribution of respiratory proteins is due to several events of lateral gene transfer and/or massive gene loss. We conclude that the last common metazoan ancestor had at least two hemerythrin domains, one hemocyanin M by guest on August 17, 2013 domain, and six tyrosinase domains. The patchy distribution of these proteins among animal lineages can be partially explained by physiological adaptations, making these genes good targets for investigations into the interplay between genomic evolution and physiological constraints. Key words: comparative genomics, hemerythrin, hemocyanin, tyrosinase, respiration, lateral gene transfer.

Introduction polypeptide chain and have been described in a cnidarian Hemoglobins, hemerythrins, and hemocyanins are three dif- (Nematostella vectensis), priapulids, brachiopods, some anne- ferent respiratory proteins present in animals (Terwilliger lids, and sipunculans (Terwilliger 1998; Bailly et al. 2008). 1998). Hemoglobins have a Fe–protoporphyrin ring to revers- Recently, it has been shown that regulation of iron homeo- ibly bind oxygen and are the most common molecules for stasis in vertebrates involves an E3 ubiquitin ligase (FBXL5 oxygen transport and storage in the Bilateria (Weber and gene) with an iron-responsive hemerythrin domain in its struc- Vinogradov 2001). Globin proteins are widespread in the ture (Salahudeen et al. 2009; Vashisht et al. 2009), although it tree of life and, in animals, respiratory globins likely evolved is not clear how this hemerythrin domain-containing protein is from a membrane-bound ancestor that acquired a respiratory related to invertebrate respiratory hemerythrins. Hemocyanins function independently in different lineages (Roesner et al. are large proteins that have copper-binding sites to transport 2005; Blank and Burmester 2012). In contrast to the wide- oxygen in arthropods and molluscs (Bonaventura and Bona- spread hemoglobins, hemerythrins and hemocyanins have ventura 1980). Despite their shared name, arthropod and mol- been detected in fewer animal groups. Hemerythrins transport luscan respiratory hemocyanins are considered to have oxygen using two Fe2+ ions that bind directly to the evolved independently from a common ancestral copper

ß The Author(s) 2013. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Genome Biol. Evol. 5(7):1435–1442. doi:10.1093/gbe/evt102 Advance Access publication July 9, 2013 1435 Martı´n-Dura´netal. GBE

protein based on protein similarities (Burmester 2001; van 18 Fungi Hry Holde et al. 2001). A deeper understanding of the evolution- - 2 - Micromonas pusilla Hry ary history of hemerythrins and hemocyanins, and thus of the Spizellomyces punctatus 0 6 - - Hry 30 1 0.82 Bacteria different respiratory strategies in animals, is limited by the - Hry PolyCyc2 30 Ectocarpus siliculosus Hry 0 0.9 Bacteria - absence of data for many invertebrate groups and, in partic- Iron metabolism

LRR ular, for unicellular holozoans and other eukaryotes. In this Metazoan 93 Hry F-box 1 FBXL5 0 study, we survey a wide phylogenetic distribution set of ge- - 10 - nomes and RNAseq data to identify new hemerythrin and Haptophyta Hry Phaeodactylium tricornutum 23 - 91 Chlorophyta Hry hemocyanin proteins, and we then reconstruct their evolution 2 0.99 - Hry 17 Stramenopile - Hry within the eukaryote tree of life, with particular focus on Amoebozoa ADK Hry Bromo Bacteria animal lineages. 3 41 unicellular Holozoa - 0.97 Thalassiosira pseudonana Clade A 96 3 Hry - 70 0.63 green algae In addition to the respiratory hemerythrin sequences previ- 52 0.98 0.99 Monosiga brevicollis Bacteria Achantamoeba castellanii ously characterized in animals (Vanin et al. 2006; Bailly et al. Achantamoeba castellanii 36 Capsaspora owczarzaki - 32 Achantamoeba castellanii - Achantamoeba castellanii Volvox carteri 2008; Meyer and Lieb 2010), we identified respiratory hem- 16 14 - 15 - - Fungi erythrins in the priapulid Priapulus caudatus,thearthropod Cyanophora paradoxa 23 24 11 0.82 35 0.55 Fungi 0.52 11 0.95 - Downloaded from

Calanus finmarchicus, and the bryozoans Alcyonidium diapha- 16 0 - - 31 0.88 53 0.99 Fungi Hry num and Membranipora membranacea (see supplementary 11 0.81 20 16 - 15 0.87 0.62 89 table S1, Supplementary Material online). In bryozoans, no 20 1 Embryophyta 0.59

21 respiratory proteins have been previously described, despite 0.6 Fungi

38 Chlorella variabilis 0.95 Naegleria gruberi 82 Naegleria gruberi the presence of a circulatory system in these animals 1 42 Volvox carteri - 41 0.82 Capsaspora owczarzaki 14 - Naegleria gruberi http://gbe.oxfordjournals.org/ (Schmidt-Rhaesa 2007). Differently from other animal hemer- Ministeria vibrans Bacteria 3 Naegleria gruberi - 94 2+ 59 1 69 0.97 Trichomonas vaginalis ScdA Hry ythrins, the hemerythrin domain shows a Ca -binding EF- 0.86 Ministeria vibrans 39 0.79 unicellular Holozoa Hry 51 46 Clamydomonas reinhardtii 8 - DUF4395 C4d ma tran Hry hand domain in its N-terminal region in both bryozoan spe- - 40 - Ectocarpus siliculosus - 86 GlutRed DUF4395 C4d ma tran Hry 1 unicellular Holozoa cies. Additionally, we identified an E3 ubiquitin ligase contain- GlutRed C4d ma tran Hry 19 - ECH GlutRed DUF4395 C4d ma tran Hry ing an F-box domain together with a hemerythrin domain, as Hry

Hry Hry Hry zfCHY zfRING2 47 Archaeplastida in FBXL5, in cnidarians and across bilaterally symmetrical ani- 1 0.9 - Hry Hry zfRING2

Hry Hry zfCHY zfRING2 mals (see supplementary table S1, Supplementary Material Hry zfCHY zfRING2

Hry Hry zfCHY online), suggesting an ancient origin of the iron-sensing Hry Hry Hry zfCHY zfRING2 zfRING2

2

- Hry by guest on August 17, 2013 system described in vertebrates. Our phylogenetic analyses 85 0.99 Embryophyta GST N3 GST C2 Hry show two major clades of hemerythrin-containing proteins GST N3 Hry

(fig. 1), one comprising the metazoan FBXL5 gene and most 14 1 unicellular Holozoa Hry MCPsign Hry of the eukaryote hemerythrins (clade A) and the other com- Bacteria Hry Bacteria Bacteria HAMP MCPsign Hry Bacteria 0 11 prising the metazoan respiratory hemerythrins (including the - - Achantamoeba castellanii Flavo1 FADb NADb RasGEF Ank2 Hry 1 O transport newly identified sequences from this study) (clade B). As - 2 2 - shown in a previous report (Bailly et al. 2008), respiratory 92 Metazoan 1 hemerythrins hemerythrins are closely related to some Naegleria gruberi Hry 62 96 0.99 EF-hand Hry 0.66 (Bryozoa) hemerythrins and a sequence from the amoebozoan Clade B 0 - 88 Naegleria gruberi Hry Acanthamoeba castellanii. To eliminate possible bacterial con- 0.99 Hry 1 Bacteria HAMP MCPsign Hry tamination, we checked the gene structure and confirmed - LactB cNMPb cNMPb MCPsign Hry

that the Acanthamoeba gene has introns within the hemery- 49 0.91 Bacteria 1 - 8 Hry thrin domain. The two major clades (A, nonrespiratory, and B, - Bacteria 3 1 0 - - MCPsign Hry - 0 Archaea - Hry AMP bind respiratory) are separated with high nodal support, which is in 47 0.97 Archaea 9 0.59 23 0 0.67 Bacteria agreement with observed structural differences (Histidine 74 - Bacteria

48 10 0.77 Archaea Hry - being only present in clade B) (Thompson et al. 2012). 0 - 1 - Bacteria Archaea 5 Hry - 18 Interestingly, clade B hemerythrins seem to be more - Naegleria gruberi Hry Bacteria Hry Bacteria Ank2 Ank Hry MCPsign Hry common and highly diversified in prokaryotes (French et al. 2.0 2008) than in eukaryotes. Metazoans may have acquired FIG.1.—Maximum likelihood (ML) phylogenetic tree of the hemery- them during an ancient event of lateral gene transfer (LGT), thrin domain as obtained by RAxML. The tree is rooted using the midpoint- but, according to our phylogeny, it is more likely that clade B rooted tree option. 1,000 replicate bootstrap values (BV, in black) and BPP hemerythrins are ancient and have been lost in many (in red) are shown for each node. A black dot in the node indicates BV eukaryotic lineages, so far only present in three extant distant >95% and BPP >0.95. Metazoan hemerythrins are highlighted by colored lineages: Amoebozoa, Excavata, and Metazoa. Clade B pro- rectangles. Domain architectures are shown for major lineages (abbrevia- karyote hemerythrins have been shown to bind oxygen as tions and accession numbers of each domain are listed in supplementary metazoan respiratory hemerythrins (Xiong et al. 2000)but table S2, Supplementary Material online).

1436 Genome Biol. Evol. 5(7):1435–1442. doi:10.1093/gbe/evt102 Advance Access publication July 9, 2013 Evolution of Respiratory Hemerythrins and Hemocyanins GBE

have been mostly related to oxygen sensing and aerotaxis hemocyanin M domain in this lineage, which can exhibit sim- processes (Xiong et al. 2000; Isaza et al. 2006)andoxygen ilar activities to the tyrosinase domain, for example in melanin supply to other metabolic enzymes (Karlsen et al. 2005). This biosynthesis (Sugumaran 2002). In contrast to a previous anal- suggests that the oxygen storage and transport function of ysis (Esposito et al. 2012), our phylogenetic reconstruction of a hemerythrins may have evolved independently in metazoans, broader dataset shows that the animal tyrosinase domains given the lack of functional data for the Excavata and the group in six independent clades (clades A–F) (fig. 3), which Amoebozoa. Under this scenario, the role of respiratory hem- are further supported by the domain architecture of the pro- erythrinsinironstorage,metal detoxification, and immunity teins nested in each clade (e.g., clade D and clade F). The observed in some annelids (e.g., the leeches Theromyzon tes- tyrosinase domain that gave rise to the molluscan hemocya- sulatum and Hirudo medicinalis and the polychaete Neanthes nins is related to brachiopod and tunicate sequences (clade B), diversicolor)(Baert et al. 1992; Demuynck et al. 1993; Vergote and the series of duplications that lead to the typical arrange- et al. 2004) are secondary specializations of this type of pro- ment of eight tyrosinase domains in tandem (Bonaventura teins. Finally, nonrespiratory hemerythrins (clade A) are quite and Bonaventura 1980) is specific to molluscs. With the ex- common in eukaryotes and have recruited several companion ception of clade E, which is restricted to nonbilateral animals domains in different lineages. (fig. 3), the other clades exhibit an extremely patchy distribu-

Searches for the hemocyanin M domain (arthropod hemo- tion across bilaterally symmetrical animals (see supplementary Downloaded from cyanins) identified this copper-binding protein in amoebozo- table S1, Supplementary Material online), not only between ans, the fungus Aspergilus niger, the sponge Amphimedon major animal groups but also within the same group (e.g., in queenslandica, the ctenophore Mnemiopsis leidyi,andthe molluscs, Crassostrea gigas has only a clade D tyrosinase, hemichordate Saccoglossus kowalevskii (see supplementary Lottia gigantea has clade A and D tyrosinases, and Sepia offi- table S1, Supplementary Material online). Therefore it is cinalis has both clade B and D tyrosinases). Despite the poor http://gbe.oxfordjournals.org/ likely to be a unikont synapomorphy. Our phylogenetic anal- resolution of deeper nodes, our phylogenetic scenario at least yses show the monophyly of all metazoan sequences, as well strongly supports three independent origins of metazoan ty- as of the main arthropod protein families (fig. 2). The relation- rosinases. Clades A, B, and F are well supported (PP > 0.9) and ship of the sponge and fungal sequences may be due to an nested with nonmetazoan sequences. The other three clades ancient LGT, though the presence of hemocyanins in the more (clades C, D, and E) are not robustly supported, but have distant amoeobozoans does not support that idea. Moreover unique domain architectures and do not significantly cluster the A. niger gene has a N-terminal intron and is located be- with other metazoan groups, therefore they might also come tween two other fungal genes, making it less likely to come from independent origins. Moreover, our phylogenetic analy- by guest on August 17, 2013 from a recently incorporated segment of metazoan DNA. sis demonstrates that the tyrosinase-containing proteins of Furthermore, a pseudogene with a hemocyanin M domain plants likely originated due to a LGT event from bacteria, cor- is present in the fungus Neosartorya fischeri (a congeneric roborated by these proteins exhibiting the same domain species despite the name), but absent from all 6 other architecture (see fig. 3). Aspergillus genomes and also from all the other fungi se- Altogether, our data clarify the origins and evolutionary quenced to date. The newly identified sequences in this history of the alternative respiratory strategies observed in an- study demonstrate that the N-domain of arthropod hemocy- imals (fig. 4). Respiratory hemerythrins, arthropod hemocya- anins and related proteins is a specific molecular signature of nins, and molluscan respiratory tyrosinases originated the Panarthropoda (Onychophora + Arthropoda), although independently from enzymatic domains that were most there is some degree of similarity of these regions in nonar- likely already present in the last common metazoan ancestor. thropod sequences. The presence of hemocyanin-like proteins Although their function in early branching lineages that do not in the tunicate Ciona intestinalis with putative phenoloxidase possess circulatory systems needs to be elucidated (e.g., the activity suggested that respiratory hemocyanins evolved from function of hemerythrins in the cnidarian N. vectensis or the an ancestral prophenoloxidase (Immesberger and Burmester hemocyanin M domain in sponges and ctenophores), the co- 2004). Given the absence of functional data for the nonbila- option of these domains for respiratory purposes occurred terian animals (i.e., the ctenophore M. leidyi and the sponge independently, and most likely took place at the base of the A. queenslandica) and our phylogeny (fig. 2), this is still the Protostomia (hemerythrin), the (Pan-)Arthropoda (arthropod most parsimonious functional explanation for the evolution of hemocyanins), and the Mollusca (molluscan tyrosinase “he- the respiratory properties of arthropod hemocyanins. mocyanins”). Accordingly, the similarities observed between An extensive search for the tyrosinase domain (molluscan arthropod and molluscan hemocyanins (e.g., use of copper to hemocyanin) demonstrated a wide distribution of this copper- reversibly bind oxygen as a respiratory strategy, oligomeriza- binding protein across metazoan lineages (see supplementary tion, and secretion to the hemolymph) are the result of con- table S1, Supplementary Material online), with the remarkable vergent evolution. The evolutionary history of hemerythrins exception of arthropods. The absence in arthropods could be and hemocyanins is characterized by frequent losses, even associated with the expansion and diversification of the after a respiratory function has been acquired when a

Genome Biol. Evol. 5(7):1435–1442. doi:10.1093/gbe/evt102 Advance Access publication July 9, 2013 1437 Martı´n-Dura´netal. GBE

Dictyostelium fasciatum Polysphondylium pallidum Amoebozoa Amphimedon queenslandica Aspergillus niger Fungi Saccoglossus kowalevskii 58 MC 0.78 Tunicata

Ctenophora

Epiperipatus sp. Onychophora Pan-Arthropoda Myriapod hemocyanins N MC 74 0.97

47 0.78 Downloaded from 62 1 Chelicerate hemocyanins http://gbe.oxfordjournals.org/

Schistocerca americana Insect hemocyanins Pacifastacus leniuscus Homarus americanus Palinurus vulgaris – 3 38 Palinurus vulgaris – 2 0.95 Palinurus vulgaris – 1 Palinurus elephas – 4 Crustacean Crustacean hemocyanins cryptocyanins

Penaeus vannamei by guest on August 17, 2013 50 0.53 Callinectes sapidus Cancer magister

Insect hexamerins

28 0.56

Crustacean proPhenolOxidases

66 1

Insect proPhenolOxidases 0.5

FIG.2.—Maximum likelihood (ML) phylogenetic tree of the hemocyanin M domain as obtained by RAxML. The tree is rooted using the amoebozoan hemocyanin genes as outgroup. 1,000 replicate bootstrap values (BV, in black) and Bayesian posterior probabilities (BPP, in red) are shown for each node. A black dot in the node indicates BV >95% and BPP >0.95. Metazoan and panarthropod hemocyanins are highlighted by colored rectangles. The Aspergilus niger hemocyanin sequence is enclosed by a red circle. Domain architectures are shown for major lineages (abbreviations and accession numbers of each domain are listed in supplementary table S2, Supplementary Material online).

1438 Genome Biol. Evol. 5(7):1435–1442. doi:10.1093/gbe/evt102 Advance Access publication July 9, 2013 Evolution of Respiratory Hemerythrins and Hemocyanins GBE

Tyr 8 - Bacteria Tyr Tyr 36 Tyr Bacteria DWL 0.92 Bacteria 15 Bacteria - 29 Bacteria Tyr Tyr DWL - 84 1 59 0.93

Tyr

57 Tyr DWL 65 0.99 Plant 1 Tyr DWL KFDV Tyr Tyr DWL KFDV

Bacteria Tyr 42 Tyr 0.54 Bilateria Clade A 39 93 Fungi Tyr 0.93 1 18 Tunicata Tyr Clade B 0.79 47 Brachiopoda Tyr Cubind 58 0.98 12 0.97 - 25 0.72 Mollusc Bilateria 35 Tyrosinases Downloaded from 0.98 (Hemocyanins)

Tyr Tyr Tyr Tyr Tyr Tyr Tyr Tyr

88 Bacteria Tyr 0.99 Clade C 26 0.6 http://gbe.oxfordjournals.org/ 58 Bilateria Tyr 0.78

Tyr 55 Clade D Tyr ShK ShK 1 Protostomia Tyr ShK ShK ShK ShK

14 Tyr 0.77 Ichthyosporea

4 0.59 34 0.99 Tyr Stramenopile Tyr Tyr by guest on August 17, 2013 30 0.62 5 Ichthyosporea Tyr - Cnidarians Clade E vW-A Tyr 10 Non-bilaterian Ktetra Itrans Tyr Ctenophores 1 metazoans 7 Sushi Sushi Sushi Tyr 2 - Sponges 10 Tyr - 30 Tyr - 0.84 Bacteria Bacteria Tyr

Stramenopile Tyr

53 0.92 12 - Ichthyosporea Tyr

Tyr

DUF866 Tyr 49 Fungi Nnf1 Tyr 0.99 Tyr zf-MYND

Chitin bind 1 Chitin bind 1 Tyr 6 0.83

69 Bilateria Tyr vW-C vW-C vW-C vW-D 75 1 Clade F 1 42 Choanoflagelata Kunitz Tyr Kazal-2 Kazal-2 0.94 Kunitz Tyr Tyr Kazal-2 Kazal-2 Tyr Tyr

Kunitz Tyr 0.3

FIG.3.—Maximum likelihood (ML) phylogenetic tree of the tyrosinase domain as obtained by RAxML. The tree is rooted using the midpoint-rooted tree option. 1,000 replicate bootstrap values (BV, in black) and Bayesian posterior probabilities (BPP, in red) are shown for each node. A black dot in the node indicates BV >95% and BPP >0.95. Metazoan tyrosinase clades are highlighted by colored rectangles. Domain architectures are shown for major lineages (abbreviations and accession numbers of each domain are listed in supplementary table S2, Supplementary Material online).

Genome Biol. Evol. 5(7):1435–1442. doi:10.1093/gbe/evt102 Advance Access publication July 9, 2013 1439 Martı´n-Dura´netal. GBE

Ctenophora Porifera 1 Trichoplax Cnidaria Xenoturbellida

1 Iron Hry FBXL5 Acoelomorpha Hemichordata

3 Echinodermata Tyr E Deut. 2 Chordata Chaetognatha Nematoda Nematomorpha Downloaded from 1 Metazoan ancestor: Tardigrada O Hry 2 Iron Hry Respiratory hemocyanin Onychophora1 Hemocyanin M Arthropoda Tyrosinase: 1 A B C D E F Respiratory Priapulida

hemerythrin http://gbe.oxfordjournals.org/ 4 Loricifera 2 Bilaterian ancestor: Kinorhyncha

O2 Hry FBXL5 1 Bryozoa Bilateria Hemocyanin M

Tyrosinase: Entoprocta A B C D F Cycliophora

Annelida Protostomia 3 Deuterostomian ancestor: Respiratory tyrosinase Mollusca by guest on August 17, 2013 FBXL5 Nemertea1 Hemocyanin M

Tyrosinase: Brachiopoda1 C F Phoronida Gastrotricha 4 Protostomian ancestor:

O2 Hry FBXL5 Platyhelminthes

Hemocyanin M Gnathostomulida Tyrosinase: Micrognathozoa A B C D F Rotifera

FIG.4.—Scenario for the evolution of hemerythrins and hemocyanins in metazoans. As shown in this study, the metazoan ancestor likely had two different hemerythrin domains (oxygen subtype and iron-related subtype), one hemocyanin M domain and six independent tyrosinases (A–F subtypes). In the last common ancestor to cnidarians and bilaterians, the iron-related hemerythrin subtype evolved into FBXL5, an E3 ubiquitin ligase involved in iron homeostasis. The diversification of bilaterally symmetrical animals was accompanied by extensive independent losses of these proteins in the different lineages, which can be partially related to the colonization of new niches and environments. The adaptation of the oxygen hemerythrin subtype to respiratory purposes occurred at least in the last common ancestor of protostome animals. On the contrary, the adaptation of the hemocyanin M domain and the clade B tyrosinases to oxygen transport and storage occurred independently in panarthropods and molluscs, respectively. Results from groups indicated with a superscript 1 originated from RNAseq data. higher selective pressure against loss could be expected. For probably replaced by other respiratory proteins that have instance, the use of tyrosinase as an oxygen transport mole- evolved independently in these lineages. This is the case for cule seems to be absent in some groups of molluscs, such as gastropods in the group Planorbidae, which lack hemocyanin solenogasters and pteriomorphids (e.g., C. gigantea,asalso in their hemolymph and which utilize an extracellular hemo- showninthisstudy)(Lieb and Todt 2008), in which it was globin (evolved from an intracellular myoglobin present in the

1440 Genome Biol. Evol. 5(7):1435–1442. doi:10.1093/gbe/evt102 Advance Access publication July 9, 2013 Evolution of Respiratory Hemerythrins and Hemocyanins GBE

gastropod radula muscle) as an alternative strategy for oxygen likelihood trees; and 3) incorrect gene model predictions. The transport. This high molecular mass hemoglobin has a higher final alignment was carried out using the MAFFT G-INS-i al- affinity for oxygen than the ancestral hemocyanin (Lieb et al. gorithm (for global homology). Maximum likelihood (ML) phy- 2006). Similar adaptations are also observed within the logenetic trees were estimated by RaxML (Stamatakis 2006) Crustacea, such as in branchiopods, ostracods, copepods, cir- and the best tree from 100 replicates was selected. Bootstrap ripeds, and decapods, which lost hemocyanins and evolved support was calculated from 1,000 replicates. Bayesian infer- hemoglobins as respiratory proteins (Terwilliger and Ryan ence analyses were performed with PhyloBayes (Lartillot and 2001). In the water flea Daphnia magna,forinstance,the Philippe 2004), using two parallel runs for 500,000 genera- tandemly duplicated gene cluster of hemoglobin genes tions and sampling every 100. Bayesian posterior probabilities shows multiple hypoxia inducible factor (HIF) binding sites, (BPP) were used for assessing the statistical support of each which dramatically induce the expression of hemoglobins bipartition. The domain architecture of all retrieved sequences when daphnids are exposed to hypoxia (Kimura et al. 1999; was inferred by performing a Pfam scan with the gathering Gorr et al. 2004). The extremely patchy distribution of these threshold as cut-off value. The domain information was used proteins across the animal phylogeny can be partially under- to assess the reliability of each sequence of the initial dataset, stood by the different biochemical properties of their oxygen- to help define protein families according to their architectural binding domains and the changing physiological needs of coherence, and to assess the level of functional and structural Downloaded from each particular animal lineage, which make one or the other diversification of hemerythrins, hemocyanins, and tyrosinases respiratory protein more effective in their function as oxygen across the eukaryote lineages. carriers. Recent studies show that many enzymatic genes have complex evolutionary histories, with massive gene losses in most of the eukaryote genomes sampled, but retention in Supplementary Material http://gbe.oxfordjournals.org/ certain tips of the tree of life (Allen et al. 2011; de Mendoza Supplementary files S1, tables S1 and S2 areavailableat and Ruiz-Trillo 2011; Stairs et al. 2011; Attenborough et al. Genome Biology and Evolution online (http://www.gbe. 2012). In contrast, transcription factors, signaling pathways, oxfordjournals.org/). and adhesion molecules, for instance, can be traced back in a congruent phylogenetic pattern (Pang et al. 2010; Sebe´- Acknowledgments Pedro´ s et al. 2010; Srivastava et al. 2010; Sebe´-Pedro´ setal. 2011). In some cases, the patchy phylogenetic distribution The authors thank J. Ryan for his advice and M. leidyi se- observed in enzymatic families could be explained by multiple quences, members of the S9 lab for collecting material and by guest on August 17, 2013 events of LGT, although the phylogenetic signal is often not discussion, G. Richards for reading and improving the manu- strong enough. Together with gene structure and synteny script, G. Torruella and X. Grau-Boved for their help, and analysis we do not find strong evidences of LGT, with the J. Achatz for his support. This work was funded by the Sars exception of plant tyrosinases (see above). Moreover, the International Centre for Marine Molecular Biology to J.M.M.- study of the evolution of respiratory proteins emerges as an D. and A.H., and an ICREA contract, an European Research ideal model to study the interplay between molecular evolu- Council Starting Grant (ERC-2007-StG-206883), and a grant tion, biochemical constraints, and physiological-ecological (BFU2011-23434) from Ministerio de Economı´ay needs. Competitividad (MINECO) to I.R.-T. A.S.-P.’s salary was sup- ported by a pregraduate FPU grant and A.d.M.’s salary from a FPI grant, both from MICINN. Materials and Methods All potential hemerythrin, hemocyanin, and tyrosinase se- Literature Cited quences were identified by HMMER searches against the Allen AE, et al. 2011. Evolution and metabolic significance of the urea Protein, Genome, and EST databases at the NCBI (National cycle in photosynthetic diatoms. Nature 473:203–207. Center for Biotechnology Information) and against completed Attenborough RM, Hayward DC, Kitahara MV, Miller DJ, Ball EE. 2012. genome/transcriptome projects databases publicly available or A “neural” enzyme in nonbilaterian animals and algae: preneural or- that are being conducted in our laboratories (sequences avail- igins for peptidylglycine alpha-amidating monooxygenase. Mol Biol Evol. 29:3095–3109. able in supplementary file S1, Supplementary Material online) Baert JL, Britel M, Sautiere P, Malecha J. 1992. Ovohemerythrin, a major with the default parameters and an inclusive E-value of 0.05. 14-kDa yolk protein distinct from vitellogenin in leech. Eur J Biochem. The retrieved sequences were aligned using MAFFT (Katoh 209:563–569. et al. 2002) L-INS-i algorithm, and then manually inspected Bailly X, Vanin S, Chabasse C, Mizuguchi K, Vinogradov SN. 2008. to remove those hits fulfilling one of the following conditions: A phylogenomic profile of hemerythrins, the nonheme diiron binding respiratory proteins. BMC Evol Biol. 8:244. 1) incomplete sequences with >99% sequence identity to a Blank M, Burmester T. 2012. Widespread occurrence of N-terminal acyl- complete sequence from the same taxa; 2) sequences that ation in animal globins and possible origin of respiratory globins from a showed extremely long branches in the preliminary maximum membrane-bound ancestor. Mol Biol Evol. 29:3553–3561.

Genome Biol. Evol. 5(7):1435–1442. doi:10.1093/gbe/evt102 Advance Access publication July 9, 2013 1441 Martı´n-Dura´netal. GBE

Bonaventura J, Bonaventura C. 1980. Hemocyanins—relationships in their Roesner A, Fuchs C, Hankeln T, Burmester T. 2005. A globin gene of structure, function and assembly. Am Zool. 20:7–17. ancient evolutionary origin in lower vertebrates: evidence for two dis- Burmester T. 2001. Molecular evolution of the arthropod hemocyanin tinct globin families in animals. Mol Biol Evol. 22:12–20. superfamily. Mol Biol Evol. 18:184–195. Salahudeen AA, et al. 2009. An E3 ligase possessing an iron-responsive de Mendoza A, Ruiz-Trillo I. 2011. The mysterious evolutionary origin for hemerythrin domain is a regulator of iron homeostasis. Science 326: the GNE gene and the root of bilateria. Mol Biol Evol. 28:2987–2991. 722–726. Demuynck S, Li KW, Van der Schors R, Dhainaut-Courtois N. 1993. Amino Schmidt-Rhaesa A. 2007. The evolution of organ systems. Oxford: Oxford acid sequence of the small cadmium-binding protein (MP II) from University Press. Nereis diversicolor (annelida, polychaeta). Evidence for a myohemery- Sebe´-Pedro´ s A, de Mendoza A, Lang BF, Degnan BM, Ruiz-Trillo I. 2011. thrin structure. Eur J Biochem. 217:151–156. Unexpected repertoire of metazoan transcription factors in the unicel- Esposito R, et al. 2012. New insights into the evolution of metazoan lular holozoan Capsaspora owczarzaki. Mol Biol Evol. 28:1241–1254. tyrosinase gene family. PLoS One 7:e35731. Sebe´-Pedro´ s A, Roger AJ, Lang FB, King N, Ruiz-Trillo I. 2010. Ancient French CE, Bell JM, Ward FB. 2008. Diversity and distribution of origin of the integrin-mediated adhesion and signaling machinery. hemerythrin-like proteins in prokaryotes. FEMS Microbiol Lett. 279: Proc Natl Acad Sci U S A. 107:10142–10147. 131–145. Srivastava M, et al. 2010. The Amphimedon queenslandica genome and Gorr TA, Cahn JD, Yamagata H, Bunn HF. 2004. Hypoxia-induced synthe- the evolution of animal complexity. Nature 466:720–726. sis of hemoglobin in the crustacean Daphnia magna is hypoxia-induc- Stairs CW, Roger AJ, Hampl V. 2011. Eukaryotic pyruvate formate lyase ible factor-dependent. J Biol Chem. 279:36038–36047. and its activating enzyme were acquired laterally from a Firmicute. Mol Immesberger A, Burmester T. 2004. Putative phenoloxidases in the tuni- Biol Evol. 28:2087–2099. Downloaded from cate Ciona intestinalis and the origin of the arthropod hemocyanin Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based phyloge- superfamily. J Comp Physiol B. 174:169–180. netic analyses with thousands of taxa and mixed models. Isaza CE, Silaghi-Dumitrescu R, Iyer RB, Kurtz DM Jr, Chan MK. 2006. Bioinformatics 22:2688–2690.

Structural basis for O2 sensing by the hemerythrin-like domain of a Sugumaran M. 2002. Comparative biochemistry of eumelanogenesis and bacterial chemotaxis protein: substrate tunnel and fluxional N termi- the protective roles of phenoloxidase and melanin in insects. Pigment nus. Biochemistry 45:9023–9031. Cell Res. 15:2–9. http://gbe.oxfordjournals.org/ Karlsen OA, et al. 2005. Characterization of a prokaryotic hemerythrin Terwilliger NB, Ryan M. 2001. Ontogeny of crustacean respiratory pro- from the methanotrophic bacterium Methylococcus capsulatus teins. Am Zool. 41:1057–1067. (Bath). FEBS J. 272:2428–2440. Terwilliger NB. 1998. Functional adaptations of oxygen-transport proteins. Katoh K, Misawa K, Kuma K, Miyata T. 2002. MAFFT: a novel method for J Exp Biol. 201:1085–1098. rapid multiple sequence alignment based on fast Fourier transform. Thompson JW, et al. 2012. Structural and molecular characterization of Nucleic Acids Res. 30:3059–3066. iron-sensing hemerythrin-like domain within F-box and leucine-rich Kimura S, Tokishita S, Ohta T, Kobayashi M, Yamagata H. 1999. repeat protein 5 (FBXL5). J Biol Chem. 287:7357–7365. Heterogeneity and differential expression under hypoxia of two- van Holde KE, Miller KI, Decker H. 2001. Hemocyanins and invertebrate

domain hemoglobin chains in the water flea, Daphnia magna. J Biol evolution. J Biol Chem. 276:15563–15566. by guest on August 17, 2013 Chem. 274:10649–10653. Vanin S, et al. 2006. Molecular evolution and phylogeny of sipunculan Lartillot N, Philippe H. 2004. A Bayesian mixture model for across-site hemerythrins. J Mol Evol. 62:32–41. heterogeneities in the amino-acid replacement process. Mol Biol Vashisht AA, et al. 2009. Control of iron homeostasis by an iron-regulated Evol. 21:1095–1109. ubiquitin ligase. Science 326:718–721. Lieb B, Todt C. 2008. Hemocyanin in mollusks—a molecular survey and Vergote D, et al. 2004. Up-regulation of neurohemerythrin expression in new data on hemocyanin genes in Solenogastres and Caudofoveata. the central nervous system of the medicinal leech, Hirudo medicinalis, Mol Phylogenet Evol. 49:382–385. following septic injury. J Biol Chem. 279:43828–43837. Lieb B, et al. 2006. Red blood with blue-blood ancestry: intriguing structure Weber RE, Vinogradov SN. 2001. Nonvertebrate hemoglobins: functions of a snail hemoglobin. Proc Natl Acad Sci U S A. 103:12011–12016. and molecular adaptations. Physiol Rev. 81:569–628. Meyer A, Lieb B. 2010. Respiratory proteins in Sipunculus nudus—impli- Xiong J, Kurtz DM Jr, Ai J, Sanders-Loehr J. 2000. A hemerythrin-like cations for phylogeny and evolution of the hemerythrin family. Comp domain in a bacterial chemotaxis protein. Biochemistry 39: Biochem Physiol B Biochem Mol Biol. 155:171–177. 5117–5125. Pang K, et al. 2010. Genomic insights into Wnt signaling in an early diverging metazoan, the ctenophore Mnemiopsis leidyi. Evodevo 1:10. Associate editor: Emmanuelle Lerat

1442 Genome Biol. Evol. 5(7):1435–1442. doi:10.1093/gbe/evt102 Advance Access publication July 9, 2013 RESUMEN EN CASTELLANO

Título: Genómica comparada en el origen de los metazoos

INTRODUCCIÓN

Las grandes transiciones evolutivas se definen por un leitmotiv: las unidades de nivel organizativo más bajo se ensamblan en unidades de nivel más alto formando entes más complejos (Maynard-Smith & Szathmáry, 1995; Michod, 2000). Los sistema biológicos se definen por su estructura jerárquica, por lo tanto para entender la historia evolutiva de estos sistemas biológicos tenemos que entender las relaciones entre los distintos niveles de organización. Así los genes forman parte de cromosomas, los cromosomas de genomas, los genomas de células y las células de organismos multicelulares. En esta tesis he estudiado la transición de protistas unicelulares a animales multicelulares, pero teniendo en cuenta los distintos niveles de organización implicados. Mediante la información obtenida desde una perspectiva de genómica comparada he intentado conectar la evolución desde el nivel génico hasta el fenotípico.

Tipos de multicelularidad y sus múltiples orígenes

La multicelularidad es una forma de vida celular que implica la coexistencia de varias células en un mismo organismo. Sin embargo, esta definición general no contempla toda la posible diversidad de organizaciones que alberga este término. Hay dos formas de adquirir la multicelularidad, una es por división clonal y la otra por agregación (Bonner, 2001). La división clonal implica que las células al dividirse se mantienen unidas y por lo tanto el organismo multicelular esta formado por células genéticamente iguales. En cambio, la agregación implica que células con distintos genotipos formen un organismo multicelular. Dentro de estas dos formas de adquirir la multicelularidad también hay distintos grados de organización entre los cuales encontramos hasta simples colonias (con un solo tipo celular) hasta multicelularidades complejas (en la cual hay diferentes tipos de células que se especializan en distintas funciones). Por lo tanto no todas las multicelularidades son iguales, ni lo serán sus necesidades y propiedades.

185

Consecuentemente encontramos formas de vida multicelulares en distintas del árbol de la vida. Entre los procariotas, tanto en eubacterias como en arqueas, encontramos especies multicelulares tanto agregativas como clonales (Fisher, Cornwallis, & West, 2013). Pero son los eucariotas los que muestran mayor diversidad en linajes multicelulares, cosa que se debe a la flexibilidad de su citoesqueleto y a la bioenergética alimentada por las mitocondrias y plástidos endosimbiontes (Knoll, 2011; Lane & Martin, 2010). Hasta la fecha se han reportado al menos 26 adquisiciones independientes de la multicelularidad en eucariotas (Brown, Kolisko, Silberman, & Roger, 2012; Grosberg & Strathmann, 2007; King, 2004). Si usamos una definición más estricta de multicelularidad compleja hay al menos siete linajes con orígenes independientes: los metazoos, las plantas, las algas rojas (tanto las bangiales como las florideófitas), las algas marrones y los hongos (algunos clados dentro de ascomicetos y basidiomicetos) (Knoll, 2011; Niklas & Newman, 2013). A su vez dentro de estos nada más que las plantas y los animales cuentan con un desarrollo embrionario. En suma, los múltiples orígenes de la multicelularidad implican que no fue un evento único en la historia de la vida, sino una transición recurrente (Grosberg & Strathmann, 2007).

De hecho, la antigüedad de los restos fósiles y los tiempos inferidos del origen de los linajes eucariotas multicelulares varían mucho según el grupo (Knoll, 2011). Así encontramos las primeras algas rojas en sedimentos de hace 1.200 millones de años (Butterfield, 2000), mientras que los animales no aparecen hasta hace 600-800 millones de años (Erwin et al., 2011; Knoll, 2011). Las primeras plantas se encuentran en rocas de hace 400 millones de años y los hongos multicelulares no llegan hasta hace 350 millones de años (Knoll, 2011; Stajich et al., 2009). Finalmente las algas marrones solo tienen 30 millones de años de antigüedad (Cock et al., 2010). La recurrencia del origen de la multicelularidad y el hecho de que aparezca en tan distintos tempos implica que probablemente tiene un innegable valor adaptativo.

Ventajas adaptativas y constricciones de la multicelularidad

La transición de un estilo de vida unicelular a otro multicelular supone un gran cambio, ya que la selección natural deja de operar a nivel de células individuales para seleccionar el conjunto de ellas (Michod, 2000). Existen dos escuelas que interpretan este fenómeno de forma antagónica. La escuela neutralista sugiere que esta transición se produce simplemente por deriva genérica y propiedades emergentes (Lynch & 186

Conery, 2003; Newman & Bhat, 2009), mientras que la escuela adaptacionista opta por darle un valor adaptativo a las fases iniciales de la transición a la multicelularidad (Grosberg & Strathmann, 2007; King, 2004; Michod, 2000).

Entre las hipótesis adaptacionistas que explican el origen de la multicelularidad encontramos la que se anuncia como “demasiado grande para ser devorado”. Esta sugiere que en los mares ancestrales la fagocitosis era la única forma de predación, por lo tanto la adquisición de un tamaño celular mayor implicaría una mayor supervivencia. Así, la multicelularidad evolucionaría de una forma puramente defensiva. En un experimento llevado a cabo con una alga unicelular (Chlorella vulgaris) y su predador (el flagelado Ochromonas vallescia), observaron que tras varias generaciones de co-existencia en el mismo cultivo, el alga adoptó formas multicelulares (Boraas, Seale, & Boxhorn, 1998). Además, aunque se retirase el predador del cultivo, las formas multicelulares de Chlorella se mantenían a lo largo de la generaciones (en condiciones fotosintéticamente favorables). Aun y así, el aumento de tamaño también se puede dar mediante otros procesos, como sería la multiplicación de núcleos en un sincitio. Amebas de gran tamaño, como los mixomicetos, e incluso bacterias gigantes como Thiomargarita namibiensis han optado por esa solución (Lane & Martin, 2010).

Otra visión está basada en la teoría de “los bienes comunes”. Así, ciertos procesos metabólicos o fisiológicos se hacen más fáciles en formaciones multicelulares, aunque estás sean simplemente formadas por células indiferenciadas. Como en el caso anterior, hay también datos experimentales que soportan esta teoría. Las levaduras (hongos unicelulares) tienen problemas para absorber la fructosa y la glucosa cuando hay poca concentración de sacarosa en el medio. Así pues, pueden desarrollar distintas estrategias para solventar este problema (mediante duplicación génica o sobreexpresión de genes endógenos), todas ellas testadas experimentalmente. Pero si un cultivo madre unicelular es cultivado a bajas concentraciones de sacarosa durante varias generaciones, al final se obtienen líneas multicelulares (Koschwanez, Foster, & Murray, 2013). Este comportamiento se produjo en 10 réplicas, y las colonias multicelulares resultantes se formaban por deficiencias en la separación celular después de la mitosis.

Estas aproximaciones pretenden explicar los primeros pasos de una multicelularidad indiferenciada, en los que todas la células son exactamente iguales. Pero la diferenciación celular aporta mejoras sustanciales. Por ejemplo, una célula eucariota

187 tiene un compromiso estructural, o se divide o se mueve, ya que su centro organizador de los microtúbulos solo puede estar anclando el flagelo o el huso mitótico. Así pues, en un organismo multicelular, este compromiso queda resuelto, ya que unas células pueden dividirse, mientras que las otras siguen en tareas locomotrices (King, 2004; Margulis, 1981). Además, la especialización celular también permite compaginar procesos metabólicos excluyentes. En el caso de las cianobacterias, estás necesitan bajas concentraciones de oxígeno para poder fijar nitrógeno ambiental. Así pues, en general hacen la fotosíntesis de día y fijan nitrógeno de noche. Pero hay especies de cianobacterias multicelulares que han inventado un tipo celular especializado, los heterocistos. Estos son poco permeables al oxígeno, y pueden fijar nitrógeno todo el día mientras que las células que los acompañan les suministran los nutrientes necesarios. También han desarrollado otro tipo celular, los akinetos, especializados en formas de resistencia y reproducción. Así pues, la diferenciación celular es lo que aumenta las posibilidades de los organismos multicelulares, permitiendo nuevos niveles de eficiencia y abriendo nichos ecológicos inexplorados.

Aun y así, evolucionar la multicelularidad no es un camino fácil, ya que aparecen nuevos compromisos antes inexistentes. El más importante es la diferenciación entre la línea somática y la línea germinal. La línea germinal será la encargada de producir la descendencia, mientras que la línea somática se encarga de ayudar a la línea germinal. Así pues, esto puede generar conflictos genéticos en asociaciones multicelulares agregativas. El ejemplo clásico son las amebas sociales Dictysotelium discoideum, pertenecientes a los amebozos. Éstas amebas viven en general como unicelulares alimentándose de bacterias en los suelos. Pero cuando las condiciones no son favorables y hay escasez de comida, se agregan células con distintos genotipos en una gran “babosa” multicelular. Esta avanza de forma más rápida que las células individuales, y así explora en búsqueda de un sitio idóneo. Allí se transforma en un esporocarpio, también multicelular, que crece en el aire por tal de que las esporas resultantes se dispersen de forma más efectiva (Bonner, 2001; Grosberg & Strathmann, 2007). En esta estructura, hay dos tipos celulares, los que forman el pedúnculo, y los que serán las esporas. La probabilidad de una célula de convertirse en un tipo celular u otro es aleatoria en la mayoría de los casos. Pero hay ciertas líneas “egoístas”, que siempre forman las esporas. Curiosamente, si se cultivan las líneas “egoístas” en solitario, estás nunca llegan a formar la estructura multicelular (Bonner, 2001).

188

Las formas multicelulares clonales solventan este conflicto, ya que en general, todas las células del organismo vienen de la misma célula original después de varias divisiones mitóticas. Así pues, todas comparten la misma información genómica. Además, la mayoría de linajes multicelulares pasan por un cuello de botella en forma de estadio unicelular obligatorio. Esto, además, les permite disfrutar de las ventajas que aporta la meiosis y la sexualidad, ya que solo en un estadio unicelular se pueden fusionar dos gametos fara formar un zigoto (Grosberg & Strathmann, 2007). Así pues, todas la formas de multicelularidad compleja se sustentan en multicelularidad clonal, mientras que las formas agregativas normalmente son fases transitorias (y a veces no necesarias) del ciclo vital de las especies que la presentan (Fisher et al., 2013). Pero la división clonal también conlleva problemas, ya que la mutación es inherente incluso en mitosis. Así pues deben haber mecanismos para controlar estos errores, como sería la apoptosis. Pero está no es la única función que un organismo multicelular tiene que solventar, ya que hay un conjunto de cosas ineludibles en el proceso.

La caja de herramientas de la multicelularidad

Para que un organismo multicelular se pueda crear, hay un conjunto de propiedades que se deben cumplir. Así pues estas son la herramientas básicas para la evolución de la multicelularidad: adhesión, comunicación intercelular y regulación de la diferenciación/proliferación.

La adhesión es inherente a la multicelularidad. Si dos células después de dividirse se quedan unidas, algo debe estar mediando esa función. El tipo de moléculas que usan los distintos linajes multicelulares varían bastante. Por ejemplo, aquellos linajes que ya disponen de paredes celulares compuestas de azúcares, como serían los hongos y las plantas, no requieren de grandes inventos para formar estructuras multicelulares. Así pues, lo que en especies unicelulares es la pared celular, en especies multicelulares se amplía, conteniendo más de una célula dentro de un mismo espacio. Los experimentos de evolución de la multicelularidad en el laboratorio que hemos visto antes, que en general involucran pocas generaciones para obtener colonias, solo se han observado en este tipo de organismos (Boraas et al., 1998; Koschwanez et al., 2013; Ratcliff, Denison, Borrello, & Travisano, 2012). También son buen ejemplo de este patrón las algas volvocales. En este linaje encontramos desde especies unicelulares como Chlamydomonas reinhardtii hasta especies con colonias complejas como Volvox cartieri (que tiene más de dos tipos celulares). Se ha observado como la 189 matriz celular que envuelve a todas las células de la colonia de unas, es homologa a la pared celular de las especies unicelulares. De hecho, la información genómica de ambas especies es muy similar, así que no han requerido de grandes inventos para evolucionar la adhesión (Abedin & King, 2010). Otros linajes con células desnudas, es decir, desprovistas de pared celular, requieren de proteínas para mediar la adhesión. Este es el caso de los animales y las amebas sociales. En los animales encontramos un gran repertorio de estructuras de adhesión, desde los hemidesmosomas hasta la adhesiones focales, todas ellas están compuestas por proteínas distintas (Magie & Martindale, 2008). Muchas de las proteínas implicadas en la adhesión también tiene otra función de señalización, que ya la propia adhesión es una información crucial para células que están dentro de un organismo.

La señalización es un proceso común a unicelulares y multicelulares. Los unicelulares tienen que captar señales del medio en el que viven y actuar en consecuencia, encontrar pareja o presas, a la vez que pueden comunicarse con otras células de la misma especie para responder ante estímulos de forma coordinada (Crespi, 2001). Pero sin duda, los organismos multicelulares necesitan un sistema de señalización más complejo, ya que todas las células del organismo tienen que estar informadas en todo momento para coordinarse en sus procesos (King, 2004; Rokas, 2008). Todos los linajes multicelulares han evolucionado vías de señalización celulares propias, especialmente adaptadas a los distintos contextos. Las plantas tiene que lidiar con paredes celulares, por lo tanto sus vías de señalización se basan en hormonas permeables o simples poros de comunicación directa célula-célula (Sparks, Wachsman, & Benfey, 2013). Mientras que los metazoos pueden tener simples proteínas heterodiméricas para los contactos proximales, a la vez que requieren de péptidos que viajen a través de la matriz extracelular para comunicar tejidos distantes (Gerhart, 1999; Pires-daSilva & Sommer, 2003). La comunicación intercelular siempre desemboca en cascadas de transducción de señal que de una forma o otra cambiarán el estado de la célula, muy probablemente iniciando su diferenciación.

La diferenciación celular es el proceso por el cual células con un mismo genotipo pueden tener formas y funciones totalmente distintas, ya que solo es funcional parte de su genoma. Por lo tanto, la diferenciación está íntimamente relacionada con las distintas capas de regulación, desde la transcripción hasta las modificaciones post- transduccionales. Estas posibilidades no son en ningún caso únicas de los organismos multicelulares, pero la integración de varias de estas capas regulatorias en un

190 programa de desarrollo complejo, en el cual los procesos están jerarquizados y hay centenares de divisiones y tipos celulares, implica una mayor complejidad en la forma en la regulación en general (Levine & Tjian, 2003).

Un caso especial de regulación es la que inhibe la proliferación. Como se ha visto antes, en los organismos multicelulares pueden haber conflictos entre distintos grupos de células. Así pues, la inhibición de la proliferación aparece como una de las funciones primordiales para mantener la estabilidad en un ente multicelular . Los cánceres son justo eso, grupos de células que escapan a los mecanismos del control de proliferación. Así pues, se ha visto que muchos de los genes implicados en carcinogénesis aparecieron de forma concomitante a la multicelularidad (Tomislav Domazet-Lošo & Tautz, 2010).

Para entender realmente como y cuando estas funciones se desarrollaron en la transición a la multicelularidad, solo tenemos dos líneas de evidencia, buscar en el registro fósil o comparar la diversidad actual.

Evolución animal: filogenia, morfología y fósiles

Para entender como fueron los ancestros de los animales, primero debemos tener una clara idea de cuales son sus afinidades filogenéticas. Aunque el árbol de la vida de los animales aun es un tema bastante debatido, la literatura reciente empieza a vislumbrar una estructura básica con ciertas preguntas puntuales por resolver (Edgecombe et al., 2011). Los nuevos métodos están basados en el uso de marcadores moleculares para inferir tasas de cambio y divergencia, pero a diferencia de lo que se hacia antes, ahora se usan centenares o miles de marcadores a la vez. Este método, la filogenómica, presenta ciertos riesgos, ya que la selección de datos automatizada requerida por el método aun no es una técnica bien establecida, cosa que facilita la aparición de errores sistemáticos altamente soportados por análisis probabilísticos (Philippe et al., 2011; Roure, Baurain, & Philippe, 2013). Teniendo esto en cuenta repasaremos la posiciones filogenéticas de los filos animales más ancestrales, los que divergieron antes del resto. La gran mayoría de diversidad animal se encuentra en los bilaterales, animales con simetría bilateral como bien indica su nombre, que se caracterizan por tres hojas embrionarias, la presencia de boca y ano y una cefalización aparente en algunos casos (Nielsen, 2008). Estos son monofiléticos, y casi todos los estudios los sitúan como el grupo más derivado dentro de los animales. Los no-bilaterales por lo

191 tanto divergieron antes, y comprenden 4 filos: los nidarios, los tenóforos, los placozoos y las esponjas.

Los nidarios son corales y medusas, y tienen solo dos hojas embrionarias (aunque hay cierta discrepancia sobre el tema). Tienen sistema nervioso, musculatura y un sistema digestivo con una sola apertura (Nielsen, 2008). Son de simetría radial, aunque esto también ha sido cuestionado recientemente (Matus, Pang, et al., 2006). La mayoría de estudios sitúan a los nidarios como el grupo hermano de los bilaterales, formando el clado de los eumetazoos (Dunn et al., 2008; Hejnol et al., 2009; Matus, Copley, et al., 2006; Philippe et al., 2009, 2011; Pick et al., 2010; Torruella et al., 2012).

Después encontramos el filo monoespecífico de los placozoos. Su única especie descrita, Trichoplax adhaerens, está compuesta por solo 5 tipos celulares, contando con dos capas y ninguna simetría aparente (Nielsen, 2008; Srivastava et al., 2008). Su posición es más variable, pero los últimos estudios con modelos evolutivos más complejos (como el Phylobayes-CAT) lo suelen situar como grupo hermano de los eumetazoos (Philippe et al., 2009; Pick et al., 2010; Srivastava et al., 2008; Torruella et al., 2012).

Los tenóforos siempre habían sido considerados, junto con los nidarios, el superfilo conocido como celentéreos (Nielsen, 2008; Philippe et al., 2009). Estos presentan alto grado de complejidad morfológica, ya que tienen sistema nervioso, órganos sensoriales, musculatura (supuestamente estriada) y simetría radial (Dunn et al., 2008; Nielsen, 2008). Pero algunos estudios sitúan a los tenóforos como el primer linaje en divergir de la línea animal, cosa que implicaría que ha desarrollado toda esta complejidad de forma independiente o que el ancestro de todos los animales ya tenía todas estas características y que placozoos y esponjas son simplificaciones secundarias (Dunn et al., 2008; Hejnol et al., 2009; Nosenko et al., 2013). Recientes re-análisis sobre la posición de los tenóforos indican que su posición basal en los animales no es más que un problema metodológico afectado por falta de datos y la alta tasa evolutiva de los genes de estos organismos (Roure et al., 2013).

Finalmente encontramos las esponjas, que para la gran mayoría de la comunidad representan el linaje que se separó del resto de los animales antes. Esto no solo está bien soportado a nivel de datos moleculares, sino que también son de esponjas los primeros fósiles que se encuentran. Los 24-isopropilcholestanos (derivados de esteroles C-30 típicos de demosponjas) se detectan en sedimentos de hace 635 millones de años (Love et al., 2009), mientras que un fósil parecido a una esponja 192 encontrado en Australia data de hace unos 660 millones de años (Maloof et al., 2010). Todo ello tiene sentido, ya que las esponjas tienen un plan morfológico muy simple comparado con el resto de animales, además de ser modular y en general sin ningún tipo de simetría. Su tipo celular más característico son los coanocitos, células flageladas con un collar de microvilli que sirven para atrapar bacterias y otras partículas orgánicas de los cuales la esponja se alimenta. Además, también disponen de células ameboides de tipo pluripotentes (los arqueocitos) y del pinacodermo, un grupo de células que protege las cámaras de coanocitos, y que ciertos autores homologan a los epitelios epidérmicos del resto de animales (Adamska, Degnan, Green, & Zwafink, 2011; Leys & Riesgo, 2012; Maldonado, 2004).

Genomas animales y la caja de herramientas del desarrollo embrionario

Los avances en genética del desarrollo identificaron genes conservados entre distintas especies modelo, como serían ratón y mosca del vinagre. Estos genes no solo estaban conservados a nivel de estructura, sino que también funcional. Así, el gen Pax6/Eyeless de ratón, podía substituir la función del gen endógeno de la mosca (Halder, Callaerts, & Gehring, 1995). Muchos más casos fueron descubiertos, permitiendo la asociación de ciertas funciones, como el desarrollo del corazón o de los ojos a ciertos grupos de genes conservados al menos en todos los bilaterales (ya que el ancestro común de vertebrados y insectos es también el ancestro común de todos los bilaterales). Los primeros genomas confirmaron esta idea, mostrando un gran número de genes compartidos entre distintos animales y que se hallaban ausentes en plantas y hongos (Koonin et al., 2004). Pero aun quedaba por aclarar si los animales no bilaterales también poseían estos genes, ya que las evidencias morfológicas indicaban que eran más simples. El genoma del nidario Nematostella vectensis rompió con las expectativas y mostró una profunda conservación genómica, de hecho, su genoma era más parecido al de los vertebrados que el de la mosca o el del gusano C. elegans (Putnam et al., 2007). Genes importantes del desarrollo, como los genes WNT, se encontraban ampliamente diversificados en el genoma de N. vectensis, y además se expresaban de forma muy parecida a sus ortólogos bilaterales (Kusserow et al., 2005). Los genomas de los otros linajes de no-bilaterales también aportaron evidencias en el mismo sentido, el genoma de Trichoplax adhaerens posee casi todas las vías de señalización propias de animales y muchos factores de transcripción, como también es el caso en el genoma de la esponja Amphimedon

193 queenslandica (Degnan, Vervoort, Larroux, & Richards, 2009; Larroux et al., 2007; Srivastava et al., 2008, 2010). No solo eso, sino que contenían genes de tipos celulares inexistentes, como el complemento de genes involucrados en la formación de neuronas y muchos importantes para los músculos (Alié & Manuel, 2010; Sakarya et al., 2007; Steinmetz et al., 2012). Estudios basados en la hibridación in situ de estos genes muestran que se expresán en el desarrollo de las esponjas, cosa que implica que su función ancestral ya era la de regular la embriogénesis animal (Adamska et al., 2007, 2011; Fortunato et al., 2012; Richards et al., 2008). Entonces, el conjunto de genes del desarrollo es común a todos los animales, y es el lenguaje por el cual se gobierna la multicelularidad animal. La pregunta que quedaba por hacer era: entonces, son estos genes únicos de los animales? Para responderla, había que mirar los genomas de sus parientes unicelulares más próximos.

Los parientes unicelulares de los animales: los holozoos

Los coanoflagelados son un grupo de protistas heterótrofos con una estructura muy similar a la de los coanocitos de las esponjas (Cavalier-Smith, 2012; Karpov & Leadbeater, 1998; King, 2004; Maldonado, 2004). De hecho, ya desde el siglo XIX se pensaba que estaban probablemente emparentados con el reino animal, como defendieron Kent y Haeckel (King, 2004). La estructura de los coanoflagelados cuenta de un flagelo en la parte posterior (indicada por el sentido de la locomoción) que se mueve para atrapar presas bacterianas, además de un cesto de microvilli que lo rodea (Karpov & Leadbeater, 1998). Se encuentran en las aguas dulces y marinas de todo el planeta, y tienen importantes roles ecológicos en los niveles microscópicos de las cadenas tróficas (Carr, Leadbeater, Hassan, Nelson, & Baldauf, 2008; del Campo & Ruiz-Trillo, 2013). Presentan una diversidad morfológica importante, unos poseen loricas de silíceo y otros tecas, pero lo más intrigante es que hay bastantes formas coloniales (Carr et al., 2008). Aunque no está claro que la formación de colonias en los coanoflagelados sea homologa al desarrollo animal, los estudios en la especie Salpingoeca rosetta han mostrado interesantes resultados. La formación de las colonias se produce por divisón clonal no sincrónica (S. Fairclough, Dayel, & King, 2010) y su formación responde a un sulfonolípido secretado por una bacteria que es normalmente presa del coanoflagelado (Alegado et al., 2012). Además, se ha visto que la colonia se forma por divisiones celulares inconclusas, y que los puentes citoplasmáticos que las conectan están basados en septinas, también implicadas en

194 estructuras similares en otros eucariotas (Dayel et al., 2011). El estudio molecular de esta especie promete muchas cosas interesantes, ya que su genoma ha sido secuenciado y se ha podido ver que en las distintas fases del ciclo vital se expresan genes distintos (S. R. Fairclough et al., 2013).

Los coanoflagelados, junto con los animales y los hongos, forman el supergrupo de los eucariotas conocido como los opistocontos (Adl et al., 2005; Cavalier-Smith, 2012; Shalchian-Tabrizi et al., 2008). Dentro de los opistocontos, pero, hay otros linajes menos conocidos, pero que gracias a la filogenia molecular sabemos que pertenecen a este grupo de eucariotas. Los holozoos son todos aquellos organismos que son parientes cercanos a los animales después del origen y divergencia del linaje fúngico. Entre estos encontramos dos linajes bien descritos, que son los filastéreos y los ictiospóreos. Aun y así, muestras ambientales nos descubren que hay otros grupos aun desconocidos dentro de los opistocontos (del Campo & Ruiz-Trillo, 2013).

Los filastéreos solo cuentan con dos especies, Capsaspora owczarzaki y Ministeria vibrans. Ambas son amebas unicelulares con largos filopodios de actina (Paps, Medina-Chacón, Marshall, Suga, & Ruiz-Trillo, 2013; Paps & Ruiz-Trillo, 2010; Shalchian-Tabrizi et al., 2008; Torruella et al., 2012). Capsaspora owczarzaki fue encontrada en la hemolinfa de un caracol, probablemente alimentándose de los cistos del parásito Schistosoma mansonii (Hertel, Bayne, & Loker, 2002). En cambio, Ministeria vibrans vive en el mar del norte, alimentándose de bacterias que caza con los filopodios (Shalchian-Tabrizi et al., 2008). Los análisis filogenómicos disponibles nos indican que estos son los organismos más cercanos al clado que conforman coanoflagelados y metazoos (Shalchian-Tabrizi et al., 2008; Torruella et al., 2012).

El grupo más basal dentro de los holozoos, son los ictiospóreos (Paps et al., 2013; Torruella et al., 2012). Estos se caracterizan por un ciclo vital que implica una fase colonial. Los cistos en general crecen de forma sincitial o por diviones celulares palintómicas, hasta llegar a un punto de maduración, en el cuál la colonia se celulariza y explota liberando la progenie en forma de esporas, a veces flageladas o ameboides (Glockling, Marshall, & Gleason, 2013; Suga & Ruiz-Trillo, 2013). Muchos de ellos son parásitos o están asociados a animales, viviendo en sus sistemas digestivos (Glockling et al., 2013; Paps & Ruiz-Trillo, 2010). La mayoría son osmótrofos y pueden crecen en el laboratorio en condiciones axénicas. Finalmente se encuentra el caso de Corallochytrium limacisporum. Este organismo es muy parecido a los anteriores, pero vive en arrecifes de coral descomponiendo materia orgánica. Los

195

últimos análisis filogenómicos nos indican que este organismo pertenece a los ictiospóreos, aunque ciertamente a un linaje basal (Guifré Torruella comunicación personal, (Sumathi, Raghukumar, Kasbekar, & Raghukumar, 2006).

El genoma de los holozoos y la pre-historia molecular de los metazoos

El primer genoma de un unicelular estrechamente emparentado con los animales fue el del coanoflagelado unicelular Monosiga brevicollis (King et al., 2008). En ese estudio pionero encontraron muchos genes importantes para la multicelularidad en el genoma del coanoflagelado, los más importantes quizás fuesen las cadherinas, importantes proteínas en la adhesión celular animal (Abedin & King, 2008). También se encontraron Tirosina quinasas de tipo receptor, una de las grandes vías de señalización en el desarrollo embrionario animal (King et al., 2008; Manning, Young, Miller, & Zhai, 2008; Pincus, Letunic, Bork, & Lim, 2008). Pero muchos otros genes estaban ausentes, como una gran mayoría de factores de transcripción y otras vías de señalización. Esto, pero, era poco claro, ya que el patrón que se vio es que muchos de los genes animales estaban formados por partes de genes ya presentes en los coanoflagelados, es decir, por dominios proteicos ancestrales reconfigurados en nuevas formas durante el origen de la multicelularidad (King et al., 2008; Rokas, 2008). A este proceso evolutivo se le conoce como barajado de dominios, ya que implica la reutilización de elementos antiguos para crear nuevos.

Pero como nos indican los recientes datos genómicos, la pérdida secundaria de genes es algo más bien común en la evolución de los eucariotas (Wolf & Koonin, 2013). Así pues se decidió profundizar en el conocimiento genómico de los holozoos unicelulares, para obtener historias que quizás habían pasado desapercibidas únicamente basadas en un genoma. Así nació la iniciativa UNICORN, en la cuál laboratorios de distintos países se asociaron con el Broad Institute (en el MIT, Boston) para secuenciar varias de estas especies (Ruiz-Trillo et al., 2007). Proyectos piloto basados en EST (cDNA clonado) ya habían dado ciertos buenos resultados en este aspecto. Los EST de Capsaspora owczarzaki habían rebelado la presencia de una MAGI, proteína implicada en la adhesión de células animales, así como de Fascina, implicada en la migración celular (Ruiz-Trillo, Roger, Burger, Gray, & Lang, 2008). También aparecieron tirosina quinasas y putativas integrinas en los EST de Ministeria vibrans (Shalchian-Tabrizi et al., 2008). En la iniciativa UNICORN se incluyeron bastantes de estas especies y algunas más alejadas, como el apusozoo Techamonas 196 trahens, grupo hermano de los opistocontos (Derelle & Lang, 2012; Torruella et al., 2012), o la ameba agregativa Fonticula alba, grupo hermano de los hongos en el linaje de los holomicotos (Brown et al., 2012; Brown, Spiegel, & Silberman, 2009).

Durante mi tesis me he dedicado a analizar los genomas de estas especies para entender mejor la evolución del genoma animal, y así comprender cuales fueron las presiones adaptativas y las innovaciones necesarias para que se diese a cabo una transición evolutiva de ese calibre.

OBJETIVOS

Las recientes secuencias genómicas de especies cercanamente emparentadas a los animales nos permite asaltar la cuestión del origen de los animales desde una perspectiva radicalmente nueva, la perspectiva protistológica. Usando técnicas de genómica comparada, se han establecido cuales son los caracteres compartidos entre animales y sus primos unicelulares, además de aclarar qué fueron innovaciones propias de los animales. Estableciendo dos puntos concretos, mi estudio ha tratado de resolver estas dos cuestiones:

- Analizar la historia evolutiva de los genes implicados en funciones íntimamente relacionadas con la multicelularidad animal. Entre ellas la evolución de los factores de transcripción y las vías de señalización básicas para el desarrollo embrionario. - Analizar el contenido genómico del filastéreo Capsaspora owczarzaki y desentrañar cuales son sus funciones, sus similitudes con los animales y su estructura genómica.

197

DISCUSIÓN Y RESULTADOS

Una panorama genómico en la frontera de la multicelularidad

Los análisis previos a estos presentados en este estudio ya indicaban que el contenido ancestral de los genes animales era muy alto. En un estudio pionero, compararon los primeros genomas de levaduras y plantas con los de los animales (Koonin et al., 2004). Los eucariotas compartían unos 3000 genes ortólogos, mientras que los animales solo habían evolucionado 1500 comunes respecto al resto de eucariotas. Esta primera aproximación estaba muy influido por una representación disminuida y poco representativa de la diversidad eucariota y animal. Otros estudios posteriores, contando con nuevos datos genómicos, también encontraron patrones similares, como el estudio sistemático del origen evolutivo de todos los genes de la mosca del vinagre, donde aparecía que más del 50% de los genes tenían orígenes eucariotas (T Domazet- Lošo, Brajković, & Tautz, 2007). El genoma de Nematostella situaba esa proporción en el 80% del genoma animal (Putnam et al., 2007). Por lo tanto, la innovación génica no es, sin ninguna duda la mayor fuerza de la transición evolutiva a la multiceluaridad.

Como postuló François Jacob, la evolución no funciona como un ingeniero, que diseña nuevas piezas según sus necesidades. Lo que hace es más parecido al bricolaje, en el cual se usan las piezas que ya se tienen para intentar formar cosas nuevas, un re- aprovechamiento del material disponible (Jacob, 1977).

Justamente esta idea es la que inspira la visión del barajado de dominios proteicos, ya que la mayoría de dominios proteicos ya están presentes en los holozoos unicelulares. Como hemos demostrado, la evolución de los animales no requirió mucho más que la adquisición de 235 dominios proteicos de los aproximadamente 5000 de los genomas animales (Resultados R4). Además, si miramos el número de dominios proteicos en una perspectiva filogenética, vemos que estos marcan claramente las funciones básicas de los animales, ya que se enriquecen en número de copias aquellos implicados en señalización, regulación de la transcripción, apoptosis y adhesión (Resultados R4). Pero esto no es lo único que hemos descubierto en nuestro trabajo, ya que hemos visto que gran parte de los genes involucrados en el desarrollo animal ya estaban presentes en los ancestros unicelulares, no en forma de dominios proteicos, sino genes ortólogos muy similares a los animales en estructura (Resultados R1, R2, R3, R4, R5 i R6). Esto refuerza una visión aún más Jacobiana de la transición a la

198 multicelularidad, ya que muchas de la piezas necesarias para la multicelularidad sufrieron un proceso de co-opción, que no requirió de innovación.

Por ejemplo, las MAGUK son proteínas implicadas en el anclaje y polarización de estructuras adhesivas y de señalización (Resultados R1). Nosotros demostramos que estaban presentes antes de los animales, aunque la familia se expandió y diversificó más en los animales. Curiosamente las MAGUK forman parte del sistema de anclaje de la post-sinapsis, una estructura ampliamente conservada en animales (Alié & Manuel, 2010; Sakarya et al., 2007). Este sistema de anclaje también se conserva en su mayor parte en holozoos unicelulares, como demostramos en el Resultado R4. Pero no necesariamente la presencia de estos genes en estos genomas implica que tengan las mismas funciones que en las neuronas animales. De hecho, un estudio usando datos de expresión génica de distintas especies animales demuestra que este conjunto de genes no se expresa de forma coordinada en esponjas, mientras que empieza a co- regularse en eumetazoos (Conaco et al., 2012). Así pués, este conjunto de genes tiene orígenes antiguos, pero no se estructuraron en un complejo proteico hasta bien entrada la multiceluaridad animal, y finalmente fueron co-optados en una estructura altamente compleja como es la neurona animal.

También descubrimos que el repertorio de Tirosina quinasas de los unicelulares emparentados a los animales son muy diversas, de hecho a veces más que las de los propios metazoos (Resultados R2). La tirosina quinasas de tipo receptor son únicas de los holozoos, pero entre los distantes linajes no guardan relaciones de ortología. Cada linaje ha diversificado sus propios receptores, probablemente adaptados a sentir los cambios ambientales en el caso de los unicelulares. Esto explicaría que especies cercanas como Capsaspora owczarzaki y Ministeria vibrans no tengan ningún receptor ortólogo, siendo el mismo caso que en los coanoflagelados (S. R. Fairclough et al., 2013). En cambio, la mayor parte de tirosina quinasas de tipo citoplasmático si que se conservan entre distintos holozoos, formando un patrimonio común del grupo. La evolución de la multicelularidad y la internalización del sistema de señalización en parejas secretadas por el propio organismo de ligandos-receptores implicó una fijación y co-evolución de los tipos receptores, permitiendo que se detecten ortólogos en grandes distancias evolutivas. Además, el sistema asociado a la fosforilación de tirosinas también diversificó conjuntamente con las tirosinasa quinasas, mostrando como esta nueva forma de señalizar mediante fosforilación independiente de la más común serina/treonina.

199

Otras vías de señalización, como la de los receptores asociados a proteínas G (GPCRs) también nos han ofrecido resultados similares (Resultados R3). En todos los casos, los receptores extracelulares son fruto de diversificaciones independientes entre animales y unicelulares, pero la maquinaria subyacente, es decir, la involucrada en la transducción de señal, está muy conservada entre animales y unicelulares. Esto lo vemos ejemplificado también en elementos de la vía Notch, de la vía Hippor/Warts, la vía AKT y las MAPK (Resultados R4, (Gazave et al., 2009; Sebé-Pedrós, Zheng, Ruiz-Trillo, & Pan, 2012)). Así pues fue la reutilización de un complejo sistema de transducción de señal lo que permitió la evolución rápida de nuevos sistemas de señalización que gobiernan el desarrollo embrionario.

Los elementos más comunes al final de las vías de señalización, y por lo tanto los encargados de mediar la respuesta del organismo a la señal concreta, con los factores de transcripción. Muchos de estos son muy importantes en todos los roles del desarrollo embrionario, desde el control de la proliferación a la diferenciación celular. Pero muy pocos se habían encontrado en el genoma del coanoflagelado Monosiga brevicollis (King et al., 2008). Nosotros descubrimos una gran diversidad de estos en el genoma de Capsaspora, redibujando completamente el panorama evolutivo de esta familia de genes (Resultados R5). Los T-box, p53, CSL, STAT o RUNX ya estaban presentes antes del origen de la multicelularidad, así que muchos de ellos son muy anteriores a las funciones a las que están típicamente relacionados, como la gastrulación o la formación del ojo. De todas maneras encontramos que muchos genes de familias multigénicas, como serían los homeobox y los bHLH, han evolucionado más tarde, en el linaje que lleva a los animales mediante sucesivas rondas de duplicación y divergencia. Para comprobar si este patrón evolutivo es parecido en otros linajes de eucariotas, hicimos un análisis más global, en el que analizamos todos los tipos de factores de transcripción de los eucariotas enmarcados en la filogenia más actualizada. Así descubrimos que son las plantas y los animales los linajes con una diversidad y un número mayor de factores de transcripción (Resultados R6). Además, para comprobar su rol en el desarrollo, analizamos la expresión total de los factores de transcirpción a lo largo del desarrollo de varias especies modelo, incluyendo la mosca Drosophila melanogaster, el pez Danio rerio y la planta Arabidopsis thaliana. Podemos ver que la expresión de factores de transcripción crece en la gastrulación y en el estadio filotípico de animales, mientras que esta cae en el adulto. En cambio, en plantas, la expresión crece al final del desarrollo, esperable debido a que las plantas

200 presentan un desarrollo abierto, formando nuevas estructuras durante todo su ciclo vital (hojas, ramas, etc…).

En conjunto, vemos que la comprensión de la evolución de la multicelularidad se ha beneficiado mucho de los genomas de protistas unicelulares cercanamente emparentados a los animales. Las perspectivas de futuro cercano son entender qué hacen estos genes del desarrollo en los holozoos, aunque estudios basados en transcriptómica ya nos están empezando a dar una buena imagen de cuales podrían ser sus funciones (S. R. Fairclough et al., 2013; Sebé-Pedrós et al., n.d.). Así pues, la pre- historia genómica de los animales nos indica que, como en muchas otras grandes transiciones evolutivas, la pre-adaptación jugo un papel muy importante.

CONCLUSIONES

1. La evolución de la familia génica MAGUK precede el origen de los animales, encontrando miembros de las subfamilias CACNB, MAGI, MPP y DLG en los genomas de holozoos unicelulares. Su posterior diversificación se dio a cabo mediante barajado de dominios proteicos y duplicaciones sucesivas.

2. El repertorio de tirosina quinasas de tipo receptor de los filastéreos Capsaspora owczarzaki y Ministeria vibrans diversificaron de forma independiente al de otros linajes de holozoos. Un repertorio diverso de tirosina quinasas es definitorio de los genomas de holozoos, que comparten muchas de las tirosina quinasas de tipo citoplasmático.

3. Los receptores asociados a proteínas G fueron objeto de una mayor diversificación en los animales que en ningún otro linaje de eucariotas. Aunque la maquinaria de transducción de señal poco ha cambiado respecto a los holozoo unicelulares.

4. Muchos de los factores de transcripción importantes en el desarrollo animal ya estaban presentes en el ancestro unicelular de los animales, ya que se encuentran en el genoma de Capsaspora owczarzaki.

5. La evolución del total de factores de transcripción de plantas y animales ha seguido caminos paralelos, con dos grandes fases de expansión escalonadas. Una en los ancestros unicelulares de ambos grupos, y otra en sus respectivas bases. Además, plantas y animales son los que tienen un mayor número de factores de transcripción

201 dentro de los eucariotas, más que otros linajes multicelulares como hongos, algas marrones y algas rojas.

6. Los patrones de expresión del conjunto de factores de transcripción durante el desarrollo animal y vegetal marcan los puntos clave de ambos linajes. La expresión de factores de transcripción en animales es más importante en la gastrulación y estadio filotípico, y baja en la etapa adulta. En cambio en plantas la expresión aumenta a medida que avanza el desarrollo.

7. El genoma de Capsaspora owczarzaki presenta un remarcable nivel de conservación de mucha de la maquinaria implicada en la multicelularidad animal, sobretodo en comparación con los coanoflagelados secuenciados hasta hoy en día. Desde un punto de vista global basado en arquitectura genómica y presencia de dominios proteicos demostramos que la pre-historia genómica de los animales era más compleja de lo que se creía.

202

ACKNOWLEDGEMENTS

Quan vaig començar la carrera de biologia, si m’haguessin assegurat que acabaria fent un doctorat en protistologia molecular, probablement ho hagués deixat estar i m’hagués dedicat a quelcom més enriquidor. Però heus aquí em trobeu, presentant una tesi sobre el tema, després d’haver-hi dedicat 4 anys. Tot això és, per bé o per mal, gràcies al meu director de tesi, Iñaki Ruiz-Trillo, que va saber com convèncer a un crèdul estudiant de carrera per allistar-lo en aquest projecte que tantes voltes ha donat. I el cert és que no em va enganyar, més aviat al contrari, vam passar de compartir despatx amb altres investigadors a un racó del departament de genètica a gaudir d’un laboratori gegant al parc científic, acompanyat per la contractació de post- docs i màquines absurdament cares. I d’allà, vam fer un viatge final a davant del mar, una situació privilegiada per a un barceloní com jo, que en general vivim d’esquenes al mar i sense preocupar-nos del que hi passa en aquells regnes del turisme massiu. Però lluny d’agrair només aquest seguit de trasllats i mudances, també li agraeixo la gestió que ha dut a terme durant aquests anys. Sempre va respectar la meva iniciativa, assessorant-me i mantenint llarguíssimes converses (que portaven a la desesperació de més d’un company de despatx), i en molts casos, aguantant estirabots del meu caràcter que va saber gestionar amb paciència, respecte i bon humor. En el fons, sabia que si estàvem plegats en allò era per una passió comú per la biologia evolutiva, i no per un sou ni cap tipus de relació jeràrquica, cosa que cal aplaudir. El següent en merèixer un agraïment és l’Arnau, perquè probablement sense ell mai hagués pogut acabar aquesta tesis. I no ho dic des d’un punt de vista només literal (les co-autories parlen per si mateixes), sinó pel simple fet d’haver pogut discutir i viure la ciència d’una forma tan particular, convertint-la en un lloc d’esbarjo mental, un espai comú on bolcar-hi pensament, acudits i un diàleg (ple de ferralla, no ho dubtem) que m’ha mantingut actiu durant aquests anys. També li agrairé l’aprenentatge de moltes coses, no només de la perseverança que el caracteritza, sinó més aviat de petits detalls com els seus modismes lingüístics, la visió natural del bimberisme, la seva capacitat per no escoltar-me (en molts casos merescudament) i un

203 perfil psicològic totalment disfuncional que crec que mereix més espai en la ficció dels nostres dies. També el Guifré m’ha acompanyat des del primer dia. A ell poc puc agrair-li, si de cas l’agreujament de les meves úlceres estomacals. Intentant recordar, potser també puc celebrar el seu sentit del humor, una espècie de màquina imparable de convertir-ho tot en un acudit, espai en el que hem trobat sintonies i sinergies. Destaco un gust reverencial per l’actitud vital dels beastie boys, un coneixement erudit dels videojocs, imitacions de bona factura i un repertori musical sempre llest per convertir qualsevol cosa en un càntic mereixedor de ser repetit fins la sacietat, sobretot passades les 5-6 de la tarda al laboratori. Però també hem compartit una altre passió, els limax i els seus parents més misteriosos: els zooids (pronunciat zuoids). Tot això només ens fa profetes d’un visionari, però és aquest llegat, de vivència i projeccions mentals humorístiques, el que sempre durem plegats i ha caracteritzat el funcionament del lab (i esperem que algun dia es materialitzi en 3 temporades i molts riures enllaunats). La resta de membres del laboratori també mereixen el seu paràgraf, però veig que això s’allarga fàcilment i ja té més de 200 pàgines. Així que diré: Ponsy, tot i que estàs molt loca, sempre ens has aguantat (en molts casos de forma terapèutica per nosaltres) i ens has explicat coses que ens han deixat veure un mon totalment insospitat (com seria de naïf la meva visió de la humanitat sense saber del “Gym”). Sé que no hem tingut que patir tant com tu, que al CSIC de plantes bevíeu bromur d’etidi per no morir de set, però tot i això, espero que també mereixem un racó del teu gran cor. Hiroshi, even though we had some discrepancies on scientific topics and, for example, on the amount of noise a human can stand before going mad, you have teached me a lot. And not only under a scientific perspective, not even under a bioinformatics perspective (I have to confess that, despite your efforts, my Ruby is rather useless), but in a more broad spectrum, for exemple revealing a side of Japan that I was completely unaware of. Al Jordi Paps... és molt difícil no agrair-li tot, des de les classes de filogènia al origen del univers, probablement provocat per una de les seves sanglotades de riure explosives que l’han fet eminent entre tots els investigadors dedicats a l’evolució animal. Bon escoltador, bon amic, sempre disposat a ajudar i

204 positiu, vaja, el meu antònim. Al Alex Pérez, doncs agrair-li un lab impecable mentre va estar sota el seu ferri control i el plaer culpable de fer explotar coses amb neu carbònica (si, la culpable minoria d’edat sembla que és un leitmotiv d’aquest apartat d’agraïments). To Romain, you have revealed us the power of the dRomains, for which we will always be thankful. Also, for your funny and iconoclastic perspectives on nearly all well stablished topics in evolutionary biology. To Lora and John Shadwick, despite your presence in the lab was short, you’ve been always friendly and open. Al Javier del Campo... també mereix un espai apart. Primer agrair-li que m’ensenyés la tècnica del “mataleón” de forma sistemàtica, demostrant paciència i un gran rigor, a la vegada que deixant clar el rol que ha de tenir un senyor post-doc amb un humil doctorand en un context de ciència de primer nivell. També per confirmar les meves sospites inicials i permetre’m un altre dels meus estimats “jo ja ho havia dit” degut al seu to de veu agut i inaplacable. I finalment per representar en si mateix un seguit d’actituds vitals, posicions ideològiques i gustos culturals bigarrats, en els quals no sempre coincidíem (eufemísticament...), però que sempre van donar peu a discutir-ne, una passió que els dos compartim. A Helena, por ser lo más parecido a Bob Esponja, siempre positiva, con esa sonrisa que te ocupa media cara aún cuando te enfrentas a las desventuras que suponen nuestros amigos los protistas. Al Xavi Grau alias Boved. Primer de tot, gràcies per dir-te (o inspirar amb el teu nom, és ben igual) boved i/o bòvid, l’eufonia mai ha estat prou reconeguda en aquest mon. Després hauria d’agrair-te els teus coneixements i ganes d’ajudar amb el bash i l’scripting, però saps que prefereixo agrair les teves llacunes en temes bàsics de la biologia evolutiva, que em permeten re-afirmar-me contínuament i extreure’n un gran plaer tot delectant-me de la meva sapiència inaudita. A Meri, per ser tan diligent i haver lluitat colze a colze contra les grans plagues del lab, com la desaparició del RNA o la banda misteriosa de menys de 100pb en les extraccions de genòmic. To Matija, for revealing us some interesting insights in some characters “de cuyo nombre no quiero acordarme”. Al Davies, agrair-te la teva actitud de colegueo infinita i per ser un representat de la Barcelona més altiva, la que no té vergonya de dir “pèpins” i està

205 orgullosa de les llaunes de birra de claveguera. A tots plegats, per formar un sistema amb propietats emergents, un lloc on anar a passar els dies com si fossin d’esbarjo. També cal ser just i agrair a les nostres influencies més directes. Entre elles, el departament de genètica i la fauna que l’ha habitat aquests anys. Primer de tot, als membres del Journal Club, un espai de discussió on vam aprendre més del que vam fer veure. A Manu y Nacho, por descubrirnos los arcanos de la biología evolutiva más contestataria y estar siempre al día de los temas verdaderamente interesantes. También por consejos y enseñanzas, y por inspirar con vuestras excelentes tesis parte de mi trabajo. A Chema, por estar siempre tan excitado por todo lo que tenga que ver con la EvoDevo, por tener esas ganas de abarcarlo todo y por esa actitud de self- made-man tan loable. A Susanna, per mantenir en vida les meves possibilitats de fer la tesis, salvant-me de tots els dead-lines burocràtics que m’amenaçaven cada poc temps. També per intentar fer-nos de lligam amb el departament on no érem més que uns out-siders i per fer-me companyia a Kristineberg, en aquell curs tan desangelat de l’EMBO. To Johannes, not for anything in particular, just for being there being you. Your stories are just epic, and the way to explain them, well, sublime. Finalment, també a la comunitat d’investigadors sèniors, al Jaume Baguñà i al Jordi Garcia per les classes de embriologia comparada i al Pere Martínez per la seva bona predisposició a la discussió. També cal agrair als companys del parc científic la seva bona voluntat i la seva paciència amb nosaltres. Primer i especialment a la Laia, que malgrat els alts i baixos, sempre em va fer costat i es va mostrar comprensiva. També a la resta de membres del Estebánez Lab i als seus successors al lab 51b, el grup de l’Antonella. Allà l’Adriana va brillar amb llum pròpia, integrant-se ràpidament a les nostres dinàmiques tot i ser totalment aliena al nostre tarannà. Més lluny queda la gent del Sars a Noruega. Well, I guess you’ll never read this, but thanks for you warm welcome and making me feel confortable in your community, specially Maja’s lab, but also the others. I really appreaciate that, because

Bergen in February is hell, and without that lonleniness would have killed me (that or the junkies from the park next door). A la gente de Sevilla, también agradecerles su

206 hopitalidad. A Anita por pinchar a los sapos con dedicación y cariño, a Mario, por compartir esas comidas gourmet en el antro del pelirrojo loco de la UPO, a Ozren, por la buena compañía y sus consejos (Zelai, aun suspiro por esas tapas...), a Juan, por apretarme las ranas cuando desesperaba y por enfrentarse al señor de la limpieza y a JL, por su sentido del humor hilarante e invitarme a cubatas en el club de padel de Montequinto. Pero si mi estancia son recuerdos de un patio Sevilla es principalmente gracias a dos grandes personajes. Perico, que me alojó en su casa, lugar mítico y propuesta estética de una dignidad infinita, y también agradecerle los paseos, tapitas, manzanillas, por la paleosociología y, sobretodo, por la sabia conversación . Y el otro es Javi, mi particular Virgilio en el submundo de la Alameda, mi compañía de bares y fiel conversador. Pero dejémonos de sutilezas, mi colega de bimberismo extremo, un tipo de relación humana más fuerte que cualquier otra, sobretodo cuando está cimentada por el fracaso. En esta categoría también incluyo a Carlos, entre los tres (más Arnau cuando no se amedrentaba), la hemos bimbado fuerte, hemos estado en sucios rincones del planeta buscando pájaros absurdos, enfrentándonos a la presión de La Lista, gastando nuestro dinero en un objetivo intangible, y de paso, viviendo experiencias inolvidables. Aunque nuestras listas deben ser lavadas (y en algún caso, con sangre), volvería a fracasar en ellas una y mil veces. Y a Carlos también le debo una visita al Delta, además de no olvidar su hospitalidad habitual en La Tallada y la montaña. I parlant de montanya: Sala, t’he petat. Sé que això és el més gran homenatge que et puc oferir. També cal agrair als que ja no hi són. Primer, al meu avi, que em va alimentar molts dies durant tota la primera etapa del meu doctorat. Tot i que la digestió d’aquells arrossos era llarga i no feia molt en favor de la meva tesi, la seva dignitat i les seves ensenyances transcendien clarament aquests possibles inconvenients. També m’agradaria recordar al Ferran Lobo, qui em va influenciar clarament en la meva vocació, i que em va donar grans consells. Un d’ells va ser que mai fos un treballador de la ciència, sinó un pensador. No crec que aquesta tesi mereixi entrar en la segona categoria, però segur que en la primera tampoc, per la seva falta d’aplicabilitat i alt

207 nivell d’abstracció. Com bé deia, si la ciència i el pensament són com un torrent que avança cap a una direcció, un ha de trobar la seva palmera i pujar-s’hi per estar tranquil. També he d’esmentar al meu germà i al Jimmy Gimferrer, per voler crear d’aquest batibull de paraules, la multicel·lularitat, el flagel i els prostists, una experiència cinematogràfica del nivell de “Les transicions evolutives”. Crec que, passat els anys, quan miri enrere ho apreciaré molt més que aquest llibrot. Potser no guanyarem el Mar del Plata ni Locarno, però nosaltres ho sabrem perquè els hem vistos. También a ti Carla, gracias por aguantarme en plan fiera enjaulada, procrastrinando, sollozando, rasgando las paredes y demás actividades indignas a las que el hombre se reduce cuando se ha de enfrentar a una tesis. Por mantener mi habitáculo habitable y por la paz que traes. També en aquest sentit, donar gràcies a la meva mare, per estar sempre disposada a ajudar en qualsevol cosa, fins i tot de forma massa expeditiva. Y a mi padre, por lo mismo, por airearme de la ciencia y por aceptar revisar el inglés de este texto (si hay mil errores es puramente culpa mía, porqué revisar un texto así tiene delito). I finalment, gràcies al meu genoma, aquest que m’ha donat una vocació estranya i natural, que per sort no m’abandona i em permet seguir divertint-me i interessant-me per aquesta professió tan àrida de ciència i protists.

208

“The waves were dead; the tides were in their grave,

The moon their mistress had expir'd before;

The winds were withered in the stagnant air,

And the clouds perish'd; Darkness had no need

Of aid from them--She was the Universe.”

Lord Byron, The darkness.

209