LETTER OPEN doi:10.1038/nature11696

Insights into bilaterian evolution from three spiralian

Oleg Simakov1,2, Ferdinand Marletaz1{, Sung-Jin Cho2, Eric Edsinger-Gonzales2, Paul Havlak3, Uffe Hellsten4, Dian-Han Kuo2{, Tomas Larsson1, Jie Lv3, Detlev Arendt1, Robert Savage5, Kazutoyo Osoegawa6, Pieter de Jong6, Jane Grimwood4,7, Jarrod A. Chapman4, Harris Shapiro4, Andrea Aerts4, Robert P. Otillar4, Astrid Y. Terry4, Jeffrey L. Boore4{, Igor V. Grigoriev4, David R. Lindberg8, Elaine C. Seaver9{, David A. Weisblat2, Nicholas H. Putnam3,10 & Daniel S. Rokhsar2,4,11

Current genomic perspectives on diversity neglect two Supplementary Table 2.2.2 and Supplementary Note 2.2. The repetitive prominent phyla, the molluscs and , that together account landscape of these genomes is discussed in Supplementary Note 3.2). for nearly one-third of known marine and are important Comparing the new genomes with other metazoan sequences, we both ecologically and as experimental systems in classical embry- characterized 8,756 modern bilaterian gene families as likely to have ology1–3. Here we describe the draft genomes of the owl arisen from single progenitor genes in the last common bilaterian ( gigantea), a marine ( teleta) and a ancestor (Supplementary Note 3.4). As gene loss is common and highly freshwater ( robusta), and compare them with diverged orthologues can be difficult to detect, this is a conservative other animal genomes to investigate the origin and diversifica- lower bound on the number of genes encoded by the last common tion of bilaterians from a genomic perspective. We find that the bilaterian ancestor. Of the 8,756 gene families, 763 were newly identified organization, gene structure and functional content of as being of bilaterian ancestry based on the new spiralian genomes these species are more similar to those of some invertebrate deu- (Supplementary Note 3.4). These newly identified bilaterian families terostome genomes (for example, amphioxus and ) than belong to various functional categories (Supplementary Table 3.4.1), those of other that have been sequenced to date (flies, the most prominent being members of the G-protein-coupled recep- and ). The conservation of these genomic tor superfamily and epithelial sodium channels (see below) as well as features enables us to expand the inventory of genes present in various metabolic enzymes. Through subsequent gene duplication, the the last common bilaterian ancestor, establish the tripartite diver- 8,756 ancestral bilaterian families conservatively account for 47 to 85% sification of bilaterians using multiple genomic characteristics and of genes in other bilaterian species (70% of human genes; Supplemen- identify ancient conserved long- and short-range genetic linkages tary Note 3.4). Most of the remaining genes in extant bilaterian gen- across metazoans. Superimposed on this broadly conserved pan- omes share at least one with the bilaterian gene families, or have bilaterian background we find examples of lineage-specific genome a significant BLAST (Basic Local Alignment Search Tool) hit when evolution, including varying rates of rearrangement, intron gain compared against sequences from bilaterian gene families, suggesting and loss, expansions and contractions of gene families, and the that they have arisen through descent with modification (Supplemen- evolution of -specific genes that produce the unique content tary Note 3.5). of each genome. Exon–intron structures are highly conserved between spiralians and Molluscs, annelids and numerous smaller phyla typically share stereo- other ; thus we infer that cis-splicing ofintron-rich genes was the typed spiral cleavage patterns, cell-fate assignments and characteristic ancestral state of metazoans, bilaterians and protostomes (Supplemen- ciliated trochophore larvae, features that originated in the tary Note 5.2). In most cases, exon boundaries in the newly sequenced era3–5. These spiralian phyla are included in the larger lophotrochozoan spiralians are precisely conserved between orthologous genes in clade6 that is a to the ecdysozoans (, nematodes sequenced (, sea urchin and amphioxus) and other related phyla) but whose internal branching remains con- and non-bilaterians ( and starlet ). For example, troversial. However, so far the only deeply sequenced lophotrocho- 75% of human introns are present in one or more of the spiralians, zoan genomes are those of platyhelminth flatworms (two parasitic whereas only 14% of the same introns are found in Drosophila10,11. schistosomes7,8 and a free-living planarian9), whose comparatively rapid However, intron gain or loss rates vary markedly among the three rates of genome evolution do not reflect a general condition of lopho- spiralians. In particular, H. robusta also has substantially more novel trochozoans (see below). In this study, we explore spiralian diversity at introns than do the other two sequenced spiralians (Supplementary the genomic level by comparative analysis of one mollusc and two Notes 5.2 and 5.3, and Supplementary Fig. 5.2.1), the first of several genomes (Supplementary Note 1). indicators of a notably dynamic genome in this lineage. We assembled the limpet, polychaete and leech genomes from appro- Collectively and individually, the spiralian genomes reported here ximately eight-fold random whole-genome shotgun coverage with Sanger retain most of the inferred ancestral bilaterian gene families (8,203 dideoxy sequencing reads (Supplementary Note 2). No genetic or physical out of 8,756, corresponding to a 94% retention rate, compared to maps were available for these systems, so we reconstructed each genome 7,553 or 86% retention rate in human). In contrast, the collective as scaffolds (gap-containing sequences). The three genomes reported here retention rate of only 65% for sequenced flatworms (53% for schisto- each encode an estimated 23,000 to 33,000 protein-coding genes (Table 1, somes and 60% for Schmidtea) reflects the absence (and presumed

1European Molecular Biology Laboratory, Meyerhofstraße 1, 69117 Heidelberg, Germany. 2Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA. 3Department of Ecology & Evolutionary Biology, Rice University, PO Box 1892, Houston, Texas 77251-1892, USA. 4DOE Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California 94598, USA. 5Department of Biology, Williams College, Thompson Biology Laboratory, 59 Lab Campus Drive, Williamstown, Massachusetts 01267, USA. 6Children’s Hospital Oakland Research Institute, 5700 Martin Luther King Jr. Way, Oakland, California 94609, USA. 7Hudson Alpha Institute for Biotechnology, 601 Genome Way, Huntsville, Alabama 35806-2908, USA. 8Department of Integrative Biology, University of California, Berkeley, California 94720, USA. 9Kewalo Marine Laboratory, University of Hawaii at Manoa, 41 Ahui Street, Honolulu, Hawaii 96813, USA. 10Department of Biochemistry and Cell Biology, Rice University, Houston, Texas 77251, USA. 11Okinawa Institute of Science and Technology, 1919-1 Tancha, Onna-son, Okinawa 904-0495, Japan. {Present addresses: Department of , University of Oxford, The Tinbergen Building, South Parks Road, Oxford OX1 3PS, UK (F.M.); Insitute of Zoology, National Taiwan University, 1 Roosevelt Road, Taipei 10617, Taiwan (D.-H.K.); Genome Project Solutions, 1024 Promenade Street, Hercules, California 94547, USA (J.L.B.); The Whitney Laboratory for Marine Bioscience, 9505 Ocean Shore Boulevard, St. Augustine, Florida 32080-8610, USA (E.C.S.).

526 | NATURE | VOL 493 | 24 JANUARY 2013 ©2013 Macmillan Publishers Limited. All rights reserved LETTER RESEARCH

Table 1 | Genome sequencing and annotation summary Species Size of genome Scaffold N50 Repetitive GC (%) Predicted Number of genes in Number of genes in Mean number of exons Mean exon Mean assembly (Mbp) content(%) number orthologous clusters ancestral bilaterian per gene (with $2 length (bp) intron (Mbp) of genes with other species gene families orthologues) length (bp) 348 1.87 21 33 23,800 16,183 10,681 8 213 787 324 0.19 31 40 32,389 20,537 11,911 7 221 291 228 3.06 33 33 23,400 13,820 8,707 8 203 526

GC, fraction of guanine plus cytosine nucleobases; Scaffold N50, the length such that half of the assembled sequence is in scaffolds longer than this length; Mbp, megabase pairs. loss) of more than 3,018 ancestral bilaterian gene families in these in development (for example, genes (see below, and Sup- flatworms. Similar losses are observed for introns (Supplementary plementary Notes 4.5 and 4.6)). Both mollusc and annelid genomes are Note 5.3), as well as synteny (see below), which indicate a higher rate also enriched in specific metabolic enzymes and pathways of unknown of genomic turnover in platyhelminths than in the mollusc and annelid relevance (for example, galactoside 2-a-L-fucosyltransferase; see Sup- genomes reported here. plementary Notes 4.2 and 4.3). In general, lineage-specific gene family Against this background of conserved gene content and structure, expansions seem to be the norm in the evolutionary diversification of we find several significantly (P , 0.05) expanded gene families in spe- modern taxa from the bilaterian ancestor, whereas the more ancient cific spiralian (Supplementary Note 4.2). The sensory transduc- unicellular–metazoan14 and metazoan–bilaterian transitions are more tion and signalling genes of the G-protein-coupled receptor (GPCR) notably marked by the acquisition of apparently novel (or highly superfamily in C. teleta are a prime example. All six of the rhodopsin- divergent) gene families (Supplementary Table 3.5.1). like GPCRs represented in the KEGG (Kyoto Encyclopedia of Genes We identified 231 putative spiralian-specific gene families whose and Genomes) neuroactive ligand–receptor interaction pathway are members are readily aligned across all three spiralians (indicating expanded in C. teleta (but not in H. robusta or L. gigantea), as are purifying selection), but which lack obvious orthologues by BLAST several other GPCRs (Supplementary Figs 4.3.2 and 4.4.1). Moreover, in non-spiralian genomes (Supplementary Note 3.6). However, nearly the C. teleta genome encodes 372 putative GPCR receptors that are two-thirds of these (188 out of 231; 62%) showed residual similarity to most similar to peptide-binding GPCR subfamilies according to the non-spiralian genes using more sensitive Hidden Markov Model family classification in the GPCR database (http://www.gpcr.org/7tm/ methods, which suggests that they belong to ancient bilaterian gene proteinfamily/). This number is considerably higher than that obtained families (Supplementary Note 3.6) that diverged extensively on the for H. robusta (58), L. gigantea (113), Drosophila (32) or human (120) stem lineage leading from the bilaterian ancestor to the mollusc– using the same methods (Supplementary Note 4.4). Most of these annelid ancestor (‘type II’ novelties15). The remaining 43 out of 231 expansions occur as tandem duplicates. The C. teleta genome also novel gene families are without any significant (E values of less than shows an expansion of the calcium-signalling pathway downstream 0.01) similarities outside of spiralians (‘type I’ novelties15). More than of GPCR (Supplementary Note 4.3). It is tempting to speculate that one-half of the 231 spiralian novelties are transcribed based on existing these expansions are related to the function of polychaete chemosen- expressed sequence tag (EST) evidence, with enriched expression in sory structures such as antennae, palps and cirri ( sensory organs), adult rather than embryonic tissues (Supplementary Notes 2.4 and and the nuchal organ12. Another notable feature of all three genomes is 3.7), hinting at roles in clade-specific adaptations beyond the early the presence of several atypical GPCRs with weak similarity to both conserved stages of development. rhodopsin-like GPCRs and chemosensory receptors describ- The inference of deep phylogenetic relationships among animal ed previously in nematodes (Supplementary Fig. 4.4.1). Further studies phyla is controversial but has benefitted from the use of multiple are needed to determine whether these receptors can be classified as orthologous genes as phylogenetic markers16,17. Recent EST-based divergent members of previously described GPCR classes or whether studies provide broad taxonomic representation but rely on a limited they constitute novel groupings as described recently in planarians13. number of available genes or are forced to accommodate a substantial We also find changes in gene content associated with sensory pro- amount of missing data6,18. In contrast, full genome sequences provide cessing in the leech. These changes include expansion of the epithelial nearly complete sets of orthologues exempt from sampling bias, but sodium channel (ENaC) receptor gene family that functions in the can be more sensitive to long-branch attraction artefacts. taste-transduction pathway (Supplementary Fig. 4.3.5), and the gap- To strike a balance between the number of phylogenetically infor- junction-forming innexin gene family, as well as gene families involved mative characters and possible long-branch artefacts we ranked 1,180

a Capitalla teleta b H. robusta c H. robusta Helobdella robusta L. gigantea C. teleta Lottia gigantea C. teleta L. gigantea Schmidtea mediterranea S. mansoni S. mansoni Schistosoma mansoni S. mediterranea S. mediterranea Tribolium castaneum D. melanogaster C. elegans Drosophila melanogaster T. castaneum T. castaneum Daphnia pulex D. pulex D. melanogaster Ixodes scapularis C. elegans D. pulex Caenorhabditis elegans I. scapularis I. scapularis Homo sapiens Deuterostomia S. purpuratus H. sapiens Mus musculus M. musculus M. musculus Gallus gallus H. sapiens G. gallus Xenopus tropicalis G. gallus X. tropicalis Danio rerio X. tropicalis D. rerio intestinalis D. rerio C. intestinalis Branchiostoma floridae C. intestinalis B. floridae Strongylocentrotus purpuratus B. floridae S. purpuratus magnipapillata N. vectensis H. magnipapillata Nematostella vectensis H. magnipapillata N. vectensis Trichoplax adhaerens Amino acid T. adherens Intron T. adhaerens Indel Amphimedon queenslandica 0.1 A. queenslandica 0.01 A. queenslandica 0.005 Figure 1 | Full-genome evidence resolves metazoan relationships and a matrix of 5,377 introns analysed using MrBayes and an asymmetric binary verifies the monophyly of lophotrochozoans and spiralians. a, A protein tree model (probability of gain: 0.01). c, Indel tree reconstructed from a matrix of inferred from 299,129 amino acid positions gathered from 827 slow-evolving 1,928 indel sites using a regular binary model. Circles at nodes indicate a orthologues using RAxML and modelling heterogeneity of substitution bootstrap support of .0.90 (a) or a posterior probability of .0.95 (b and c). In processes using a LG 1 C4 model with each gene partitioned. Strong support is b and c, arrows indicate species that do not follow the protein family tree obtained for the monophyly of lophotrochozoans. b, Intron tree obtained from topology.

24 JANUARY 2013 | VOL 493 | NATURE | 527 ©2013 Macmillan Publishers Limited. All rights reserved RESEARCH LETTER clusters of orthologous genes (from 22 complete genomes) by their extensive reorganization in the leech genome relative to the last com- evolutionary rates (Supplementary Fig. 5.1.1) and identified a set of mon spiralian ancestor. 827 slowly evolving genes that include 299,129 aligned amino acid The observed conserved macro-synteny demonstrates the persis- positions suitable for deep phylogenetic analysis. These characters tence of 17 ancient bilaterian ancestral linkage groups (ALGs) in the strongly support the tripartite view of bilaterians and the monophyly common ancestor of L. gigantea and C. teleta. Independent fusions of available lophotrochozoans (annelids, molluscs and platyhelminths) (two in L. gigantea, three in C. teleta) subsequently reduced the num- (Fig. 1a)19; the progressive addition of characters representing more ber of bilaterian ALGs that remain distinct in these genomes. The rapidly evolving genes monotonically erodes support for this view, as conservation of 17 bilaterian ALGs among L. gigantea, C. teleta and expected under long-branch attraction (Supplementary Fig. 5.1.2). various deuterostomes implies that the last common and Although taxon sampling is generally considered critical to resolving ancestors also had this organization. Some ecdysozoans deep phylogeny, our analyses show the importance of gene sampling. like Caenorhabditis elegans (soil ), Tribolium castaneum The rate-stratification approach introduced here could be used to place (beetle) and Bombyx mori (moth) (Supplementary Note 6) also show problematic taxa (for example, acoels, ctenophores and chaetognaths) clear evidence of conserved macrosynteny. However, the large number when appropriate genome data becomes available. of fusions and rearrangement events make similar recon- We also examined the phylogenetic signals in the gain and loss of struction of the ancestral ecdysozoan ALGs impossible with current introns, and insertions or deletions (indels) within coding sequences, data (Supplementary Note 6). incorporating spiralian sequences for the first time. Although few Remarkably, we can also use the L. gigantea and C. teleta genomes to evolutionary reconstructions have been attempted with these charac- infer ancient translocations between linkage groups (Supplementary ters, they have attractive properties for phylogenetic analysis as change Note 6). For example, L. gigantea and C. teleta share a translocation 20,21 is rare and generally irreversible (Supplementary Note 5.3). Phylo- relative to the last common bilaterian ancestor (Fig. 3), indicating that genetic reconstruction using binary matrices encoding intron and this genomic rearrangement occurred on the stem lineage leading indel presence or absence recovered the backbone of metazoan phylo- from the bilaterian to the mollusc–annelid node. As noted above, more geny, and intron data provided strong support for grouping molluscs, recent translocations that are not shared between L. gigantea and annelids and platyhelminths (Fig. 1b). However, the analysis based on C. teleta are also evident (Supplementary Note 6). It remains unclear indels showed specific discrepancies relative to other data sets, notably whether these genome reorganizations were causally involved in the the grouping of nematodes and platyhelminths (Fig. 1c). As this radiation of diverse bilaterian lineages, or were simply neutral changes. grouping is not consistent with either amino acid or intron analyses, we ascribe it to the accelerated genome evolution in these taxa and the low number of phylogenetically informative indel characters. All three trees possess short internal branches near the base of (see ref. 22), which indicates that the diversification into Aqu separate lophotrochozoan, deuterostome and ecdysozoan lineages was relatively fast, taking perhaps 30 to 80 million (Myr) (com- Sme Hma parable to the diversification of ; Supplementary Note 5.5). Nve Bfl Spu We also sought evidence for genome-wide functional diversification Tad Cte across metazoan genomes using principal component analysis (Sup- Lgi plementary Note 4.1). Remarkably, this phenetic approach grouped Cel Cin Hro the newly sequenced mollusc and annelids with invertebrate deutero- Dre Xtr stomes (amphioxus, sea urchin and sea squirt) and non-bilaterian Sma Isc Dpu metazoan phyla (cnidarian, placozoan and ) (Fig. 2). PC2 Gga Mmu Hsa Given that this grouping includes both bilaterians and non-bilaterian Tca metazoans, cladistic logic implies that these genomes approximate the ancestral bilaterian (and metazoan) genomic repertoire. In contrast, Lophotrochozoans vertebrate genomes form a distinct cluster, and are thus functionally Ecdysozoans derived relative to this ancestral bilaterian state, partly owing to the Invertebrate deuterostomes Dme diversification of genes related to the vertebrate innate and adaptive Vertebrates immune system that dominate the loadings of principal component 1 Non-bilaterian metazoans (PC1, Fig. 2) (Supplementary Table). The functional coherence of the –0.010 –0.005 0.000 0.005 genes that differentiate currently available ecdysozoan genomes –0.006 –0.004 –0.002 0.000 0.002 0.004 0.006 through PC2 is unclear. Although this analysis may be skewed by PC1 the more complete functional annotation of vertebrates and classical model systems, other similar analyses less dependent on function con- Figure 2 | Clustering of metazoan genomes in a multidimensional space of molecular functions. The first two principal components are displayed, firm the clear separation of vertebrates from other metazoan genomes accounting for 20% and 15% of variation, respectively. At least three clusters are (Supplementary Note 4.1). evident, including a vertebrate cluster (far right), a non-bilaterian metazoan, The L. gigantea and C. teleta genomes show extensively conserved invertebrate deuterostome or spiralian cluster (centre, top), and an ecdysozoan macrosynteny with each other, with (including human; see group (lower left). Drosophila and Tribolium (lower left) are outliers. Aqu, Fig. 3 and Supplementary Note 6) and with several other extant meta- Amphimedon queenslandica (demosponge); Bfl, Branchiostoma floridae zoan lineages (sea anemone15, placozoan23 and demosponge14). In our (amphioxus) ; Cel, Caenorhabditis elegans ; Cte, Capitella teleta (polychaete); analyses, conserved macrosynteny requires only conserved linkage Cin, Ciona intestinalis (sea squirt); Dme, Drosophila melanogaster; Dpu, between orthologous genes, and is independent of intra-chromosome Daphnia pulex (water flea); Dre, Danio rerio (zebrafish); Isc, Ixodes scapularis rearrangements (that is, scrambling of gene order) that are typical in (tick); Gga, Gallus gallus (chicken) ; Hsa, Homo sapiens (human); Hma, Hydra 10,15,23 magnipapillata; Hro, Helobdella robusta (leech); Lgi, Lottia gigantea (limpet); phylogenetically deep comparisons . Conserved macrosynteny in Mmu, Mus musculus (mouse); Nve, Nematostella vectensis (sea anemone); Sma, L. gigantea and C. teleta involves nearly one-half of the conserved Schistosoma mansoni; Sme, Schmidtea mediterranea (planarian); Spu, protein coding genes in these species (Supplementary Note 6 and Sup- Strongylocentrotus purpuratus (sea urchin); Tad, Trichoplax adhaerens plementary Table 6.3.1). In contrast, we found no significant conserva- (placozoan); Tca, Tribolium castaneum (flour beetle); Xtr, Xenopus tropicalis tion of macrosynteny between H. robusta and other species, implying (clawed frog).

528 | NATURE | VOL 493 | 24 JANUARY 2013 ©2013 Macmillan Publishers Limited. All rights reserved LETTER RESEARCH

abc al

chromosom

Intra-rearrangements L. gigantea

L. gigantea Human

Translocation C. teleta Human C. teleta Human Human T. adhaerens T.

9.3 5.1 9.1 6.6 8.4 6.8 8.2 7.4 12.5 2.6 3.2 Segment: 1.7 1.5 7.717.92.16.220.519.2 17.6 7.2 10.2 T. adhaerens T. Human ALG: 9 17 2 Human

Figure 3 | Macrosynteny between spiralians, humans and Trichoplax. translocation of one or more chromosome segments from ALG 17 to ALG 9 in a, The location of genes in scaffolds of L. gigantea, C. teleta and Trichoplax (a the common ancestor of molluscs and annelids, after the divergence of the non-bilaterian outgroup that conserves synteny) relative to the position of their spiralian and vertebrate lineages. Genes inferred to derive from this orthologues in the human genome. The human chromosome segments have translocated segment are shown in red. Subsequent intra-chromosomal been grouped according to their ancestral linkage group (ALG); chromosome rearrangement has dispersed the translocated genes among the genes of ALG 9. segment identifiers are also shown (see ref. 10). Human genes in ALG 2 have b, The scenario in panel a represented schematically on a , their orthologues co-located on a limited set of scaffolds in L. gigantea, C. teleta with of ancestral and living genomes represented as horizontal and Trichoplax, indicating conserved linkage of this group of genes across blue bars and the translocated segment represented in red. c, The positions of eumetazoan lineages. In contrast, although ALG 17 and ALG 9 are preserved human genes and their L. gigantea, C. teleta and Trichoplax adhaerens separately in Trichoplax, scaffolds of L. gigantea and C. teleta have homologous orthologues compared in dot plots schematically (and in the real data; see gene content either with ALG 9 or with both ALG 9 and ALG 17, indicating a panel a) for three ALGs.

The most famous example of conserved microsynteny—conserved H. robusta, and there have been multiple duplications and loss of two tight linkages between orthologous genes—is the Hox complex, an mollusc–annelid paralogy groups (the orthologues of the anterior-class ancient cluster of homeodomain-containing transcription factors with proboscipedia and post1). Intriguingly, although the gene rearrange- conserved roles in patterning the anteroposterior body axis of animals24. ments observed in H. robusta are as extreme as in C. elegans, the leech In L. gigantea,11Hox genes occur as a single cluster that is structurally is not particularly derived with respect to other genomic characters. collinear with intact Hox clusters found in other genomes, and is the This lineage may therefore be an interesting model for focused studies first intact cluster found in a lophotrochozoan (Fig. 4). C. teleta Hox on rapid evolution ofgeneorder.We also find other tightly linked groups genes occur in one-to-one correspondence with their L. gigantea coun- of anciently duplicated (that is, paralogous) genes, including clusters terparts but lie on three scaffolds, with the scaffold harbouring the of deeply diverged gene superfamilies such as the homeodomain25, posterior class gene post1 clearly disconnected from the main cluster25. forkhead box27 and wingless28 gene families that duplicated extensively We therefore infer that the last common mollusc–annelid ancestor had before the bilaterian radiation but have remained linked (Supplemen- a single 11-gene Hox cluster (Supplementary Note 8) with 3 anterior- tary Note 7.4). and 6 central-class genes, plus 2 posterior-class genes (post1 and post2) Overall, we found hundreds of other examples of conserved micro- that arose by duplication along the spiralian (or lophotrochozoan) stem syntenic blocks involving thousands of genes in L. gigantea, C. teleta lineage26. In contrast, the Hox complex of H. robusta has fragmented and other metazoan genomes (Supplementary Note 7.1). We consider extensively, consistent with the general loss of synteny conservation in a microsyntenic block to be a group of three or more genes whose

B. floridae Hox1 Hox2 Hox3 Hox4 Hox5 Hox6 Hox7 Hox8 H9 H10 H11 H12 H13 H14

T. castaneum Lab Pb zen Dfd Scr Ftz Antp Ubx Abd-A Abd-B

L. gigantea Lab Pb Hox3 Dfd Scr Lox5 Antp Lox4 Lox2 Post2 Post1

C. teleta Lab Pb Hox3 Dfd Scr Lox5 Antp Lox4 Lox2 Post2 Post1

LabA DfdB Lox4B Post2A

LabB ScrA Lox5

H. robusta ScrB Lox2 ScrC Antp DfdA Lox4A Post2B Hox? Hox3 ScrD Post2C ScrE Figure 4 | The complement and linkage in the three the Hox gene is at the end of the scaffold. B. floridae, Branchiostoma floridae. lophotrochozoan genomes and selected bilaterians. Arrows indicate Colours indicate unambiguously assigned paralogy groups (Hox1, Hox2, Hox3, direction of transcription (orientation between scaffolds is arbitrary). Scaffolds Hox4, central class and posterior class). with ends marked by black dots may be part of a larger Hox complex because

24 JANUARY 2013 | VOL 493 | NATURE | 529 ©2013 Macmillan Publishers Limited. All rights reserved RESEARCH LETTER

a d A. queenslandica T. adhaerens 4 2,114 kb b C. teleta 39 kb T. adhaerens N. vectensis 11 kb L. gigantea 2 2 66 kb N. vectensis H. magnipapillata C. elegans 120 kb B. floridae 30 kb S. purpuratus B. floridae T-Box MORC C7orf42 ROX/MNT D. melanogaster 14,427 kb C. intestinalis H. robusta D. rerio c X. tropicalis G. gallus C. teleta 1,581 kb N. vectensis 134 kb 9 M. musculus L. gigantea 28 kb T. adhaerens 4 50 kb H. sapiens C. elegans 97 kb B. floridae C. teleta 50 kb I. scapularis D. pulex 125 kb L. gigantea 19 kb D. melanogaster 157 kb H. sapiens B. floridae 175 kb T. castaneum S. mediterranea 2,886 kb H. sapiens 5 420 kb S. mansoni L. gigantea C. teleta Taxilin HDC SYAP1 PSMG1 ZMPSTE24 Exoc7 WBP11 FoxJ FAM100 Rhomboid MGRN E3 H. robusta domain Ub ligase containing Figure 5 | Examples of conserved orthologous gene clusters. a–c, Clusters of protein 11; ZMPSTE24, zinc metallopeptidase, STE24 homologue. Scaffold linked genes across diverse species. Within each panel, genes in the same colour positions for all displayed linkages are listed in Supplementary Note 7.2. are members of the same orthologous group, with the gene identifiers of d, Cumulative rates of microsynteny change plotted on a fixed metazoan tree defining members of the group indicated: C7orf42, (human) chromosome 7 topology. Branch lengths are proportional to the number of inferred genomic open reading frame 42; Exoc7, exocyst complex component 7; FAM100, family rearrangements. A. queenslandica, Amphimedon queenslandica; C. intestinalis, with sequence similarity 100; FoxJ, forkhead box protein J; HDC, histidine Ciona intestinalis; D. melanogaster, Drosophila melanogaster; D. pulex, decarboxylase; MGRN E3 Ub ligase, mahogunin ring finger E3 ubiquitin ligase; Daphnia pulex; D. rerio, Danio rerio; G. gallus, Gallus gallus; H. magnipapillata, MORC, MORC family CW-type zinc finger; PSMG1, proteasome (prosome, Hydra magnipapillata; I. scapularis, Ixodes scapularis; M. musculus, Mus macropain) assembly chaperone 1; ROX/MINT, Max-binding protein family musculus; N. vectensis, Nematostella vectensis; Ub, ubiquitin; X. tropicalis, member; SYAP1, synapse-associated protein 1; WBP11, WW domain binding Xenopus tropicalis. orthologues are tightly linked (that is, separated by no more than ten slow evolving characters on the tree topology, several phylogenetic approaches intervening genes) in two or more genomes. Microsyntenic blocks are were taken (see Supplementary Note 5). often, but not always, embedded in a conserved macrosyntenic con- Identification of 8,756 ancestral bilaterian genes. Gene families were considered text. The count of 469 microsyntenic blocks that are putatively pre- to be ancestral bilaterian gene families when an orthologous group had at least two protostome and two deuterostome representatives (in-group) or two sequences from served from the bilaterian ancestor in at least one protostome and one either in-group and two from basal (that is, non-bilaterian) metazoans (out-group). deuterostome genome is substantially greater than the 157 blocks that Macrosynteny. Draft genome scaffolds were clustered into ancestral linkage would persist by chance in a simple model of genome rearrangement in groups (ALGs) based on the locations of orthologous genes in other metazoan which gene order is randomized within the macrosynteny blocks genomes, as described previously10 We iteratively constructed a parsimonious defined above (Supplementary Notes 6 and 7), implying either func- scenario of chromosome evolution, and ancestral genes were assigned to ancestral tional constraint on genome organization or intrinsically slow rates of ALGs when any other assignment would imply more hops between ALGs in the rearrangement in some genomic regions. Considering the deeply history of that gene family (Supplementary Note 6). diverged bilaterian lineages represented by L. gigantea, C. teleta and Microsynteny. Chromosomal locations of orthologous genes in two different amphioxus (Supplementary Table), we found 77 conserved microsyn- species were compared. If another set of orthologous genes is identified within a tenic blocks (Supplementary Note 7.1), which in some cases are stably maximal distance of 10 genes of the previous set, both sets were merged together into a microsyntenic block. Only syntenic blocks with at least three orthologues conserved across other metazoan genomes (Fig. 5). It is tempting to per species were considered. speculate that these conserved linkages are due to selection for preserv- Intron and indel identification and phylogeny. Gene families with a maximum ing complex cis-regulatory landscapes (Supplementary Note 7.2). of 2 missing species (out of 22) were included. Intron and indel positions were Although molluscs and annelids are related to flies, nematodes and detected using conserved flanking sites (3 out of 8 amino acids), no gaps were flatworms within the protostomes, we find that their genomes are in allowed to flank introns. For indels, the flanking amino acids had to be conserved. many ways more similar to those of invertebrate deuterostomes (such Phylogenetic inference based on presence or absence was computed with MrBayes as amphioxus and sea urchin) as well as non-bilaterian metazoans as described in Supplementary Note 5.3. (such as cnidarians, and placozoans). These similarities reveal Received 7 January; accepted 24 October 2012. features of bilaterian and/or metazoan genomes that have been lost or Published online 19 December 2012. diverged in many protostome genomes reported so far, and thus enable a more complete reconstruction of genomic features of the last 1. Wilson, E. B. The cell-lineage of Nereis. A contribution to the cytogeny of the annelid common ancestors of protostomes, bilaterians and metazoans, includ- body. J. Morphol. 6, 361–480 (1892). 2. Conklin, E. G. The of Crepidula: a Contribution to the Cell Lineage and ing gene and chromosome structure and organization. Superimposed Early Development of Some Marine Gasteropods (Ginn & Company, 1897). on these conserved features are evolutionary innovations—novel gene 3. Henry, J. Q., Hejnol, A., Perry, K. J. & Martindale, M. Q. Homology of ciliary bands in families and gene-family expansions and losses, as well as large- and spiralian trochophores. Integr. Comp. Biol. 47, 865–871 (2007). small-scale genomic rearrangements—that make each clade unique. 4. Fedonkin, M. A. & Waggoner, B. M. The Late Precambrian fossil is a mollusc-like bilaterian organism. Nature 388, 868–871 (1997). Nearly 20 other -level taxa lack even a single genome sequence, 5. Maloof, A. C. et al. The earliest record of animals and ocean geochemical and intra-phylum genomic variation can be extensive. Thus, for a change. Geol. Soc. Am. Bull. 122, 1731–1774 (2010). comprehensive genomic understanding of the metazoan radiation a 6. Dunn, C. W. et al. Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 452, 745–749 (2008). far larger sampling of genomes will be needed. 7. Berriman, M. et al. The genome of the blood fluke Schistosoma mansoni. Nature 460, 352–358 (2009). METHODS SUMMARY 8. The Schistosoma japonicum Genome Sequencing and Functional Analysis Gene families and phylogeny. Orthology relationships were reconstructed for Consortium. The Schistosoma japonicum genome reveals features of host–parasite 22 metazoan genomes (Supplementary Fig. 3.3.1) using a phylogenetic clustering interplay. Nature 460, 345–351 (2009). 9. Robb, S. M., Ross, E. & Sanchez Alvarado, A. SmedGD: the Schmidtea mediterranea approach, which progressively examined reciprocal best scoring BLAST hits at genome database. Nucleic Acids Res. 36, D599–D606 (2008). decreasing phylogenetic nodes of a reference animal tree. We recovered 1,235 gene 10. Putnam, N. H. et al. The amphioxus genome and the evolution of the families with orthologous members in all genomes. To assess the effect of fast and . Nature 453, 1064–1071 (2008).

530 | NATURE | VOL 493 | 24 JANUARY 2013 ©2013 Macmillan Publishers Limited. All rights reserved LETTER RESEARCH

11. Raible, F. et al. Vertebrate-type intron-rich genes in the marine annelid Platynereis Acknowledgements Work conducted by the US Department of Energy Joint Genome dumerilii. Science 310, 1325–1326 (2005). Institute is supported by the Office of Science of the US Department of Energy under 12. Purschke, G. organs in (Annelida). Dev. Hydrobiology 179, contract no. DE-AC02-05CH11231. Early work on the analysis of these genomes was 53–78 (2005). also supported by the Gordon and Betty Moore Foundation through a grant to the 13. Zamanian, M. et al. The repertoire of G protein-coupled receptors in the human Center for Integrative Genomics at the University of California. D.S.R. thanks R. Melmon parasite Schistosoma mansoni and the model organism Schmidtea mediterranea. for a grant in support of this project. N.H.P. is supported by the National Science BMC Genomics 12, 596 (2011). Foundation (NSF) under grant EF-0850294; D.A.W. is supported by the National 14. Srivastava, M. et al. The Amphimedon queenslandica genome and the evolution of Institutes of Health (NIH; RO1 GM 074619) and NSF (I0S-0922792); and O.S. is animal complexity. Nature 466, 720–726 (2010). supported by Boehringer Ingelheim Fonds. E.E-G. was supported by a training grant 15. Putnam, N. H. et al. Sea anemone genome reveals ancestral eumetazoan gene from the US National Human Genome Research Institute. We thank A. Meyer and repertoire and genomic organization. Science 317, 86–94 (2007). J. Gerhart for helpful discussions. 16. Telford, M. J. & Copley, R. R. Improving animal phylogenies with genomic data. Author Contributions This study was conceived by J.L.B., D.R.L., D.A.W., E.C.S. and Trends Genet. 27, 186–195 (2011). D.S.R. The project was led by D.S.R. with N.H.P. and O.S. The three genomes were 17. Delsuc, F., Brinkmann, H. & Philippe, H. Phylogenomics and the reconstruction of assembled by H.S. and annotated by A.Y.T., R.P.O., A.A. and I.V.G. Comparative the tree of life. Nature Rev. Genet. 6, 361–375 (2005). analyses of gene complements were carried out by N.H.P., P.H. and O.S. 18. Philippe, H. et al. Phylogenomics revives traditional views on deep animal Phylogenomic analyses were carried out by F.M. and O.S. Macro- and microsynteny relationships. Curr. Biol. 19, 706–712 (2009). was analysed by N.H.P., O.S. and J.L. GPCR complements were analysed by T.L., O.S. 19. Adoutte, A. et al. The new animal phylogeny: reliability and implications. Proc. Natl and D.A. Innexin analysis was carried out by O.S., D.A.W. and D.-H.K. Hox gene Acad. Sci. USA 97, 4453–4456 (2000). complements were analysed by D.A.W. and D.-H.K., S.-J.C. and E.C.S. Additional 20. Roy, S. W. & Gilbert, W. Resolution of a deep animal divergence by the patterns of analyses, data and materials were contributed by U.H., J.A.C. J.G., P.d.J., K.O., R.S. and intron conservation. Proc. Natl Acad. Sci. USA 102, 4403–4408 (2000). E.E.-G. The main paper was written by O.S., F.M., N.H.P., D.A.W., E.C.S. and D.S.R. with 21. Roy, S. W. & Irimia, M. Rare genomic characteris do not support Coelomata: intron input from other authors. O.S. wrote the Supplementary Information with input from loss/gain. Mol. Biol. Evol. 25, 620–625 (2008). other authors. 22. Rokas, A., King, N., Finnerty, J. & Carroll, S. B. Conflicting phylogenetic signals at the base of the metazoan tree. Evol. Dev. 5, 346–359 (2003). Author Information The Lottia gigantea whole-genome shotgun project has been 23. Srivastava, M. et al. The Trichoplax genome and the nature of placozoans. Nature deposited in GenBank under the accession AMQO00000000. The sequence version 454, 955–960 (2008). described in this paper is deposited under accession AMQO01000000 24. Duboule, D. The rise and fall of Hox gene clusters. Development 134, 2549–2560 (PRJNA175706). The Helobdella robusta whole-genome shotgun project has been (2007). deposited in GenBank under the accession AMQM00000000. The sequence 25. Frobius, A. C. & Seaver, E. C. Capitella sp. I homeobrain-like, the first version described in this paper is the first version, AMQM01000000 lophotrochozoan member of a novel paired-like homeobox gene family. Gene Expr. (PRJNA175704). The Capitella teleta whole-genome shotgun project has been Patterns 6, 985–991 (2006). deposited in GenBank under the accession AMQN00000000. The sequence version 26. de Rosa, R. et al. Hox genes in and priapulids and protostome described in this paper is the first version, AMQN01000000 (PRJNA175705). evolution. Nature 399, 772–776 (1999). Reprints and permissions information is available at www.nature.com/reprints. The authors declare no competing financial interests. Readers are welcome to 27. Shimeld, S. M., Boyle, M. J., Brunet, T., Luke, G. N. & Seaver, E. C. Clustered Fox comment on the online version of the paper. Correspondence and requests for genes in lophotrochozoans and the evolution of the bilaterian Fox gene cluster. materials should be addressed to N.H.P. ([email protected]) or D.S.R. Dev. Biol. 340, 234–248 (2010). ([email protected]). 28. Cho, S. J., Valles, Y., Giani, V. C. Jr, Seaver, E. C. & Weisblat, D. A. Evolutionary dynamics of the wnt gene family: a lophotrochozoan perspective. Mol. Biol. Evol. 27, 1645–1658 (2010). This work is licensed under a Creative Commons Attribution- NonCommercial-Share Alike 3.0 Unported licence. To view a copy of this Supplementary Information is available in the online version of the paper. licence, visit http://creativecommons.org/licenses/by-nc-sa/3.0

24 JANUARY 2013 | VOL 493 | NATURE | 531 ©2013 Macmillan Publishers Limited. All rights reserved