<<

Supplementary Information

The draft genomes of soft-shell turtle and green sea turtle yield insights into the development and evolution of the turtle–specific

Zhuo Wang1*, Juan Pascual-Anaya2*, Amonida Zadissa3*, Wenqi Li4*, Yoshihito Niimura5, Zhiyong Huang1, Chunyi Li4, Simon White3, Zhiqiang Xiong1, Dongming Fang1, Bo Wang1, Yao Ming1, Yan Chen1, Yuan Zheng1, Shigehiro Kuraku2, Miguel Pignatelli6, Javier Herrero6, Kathryn Beal6, Masafumi Nozawa7, Juan Wang1, Hongyan Zhang4, Lili Yu1, Shuji Shigenobu7, Junyi Wang1, Jiannan Liu4, Paul Flicek6, Steve Searle3, Jun Wang1,8,9, Shigeru Kuratani2, Ye Yin4†, Bronwen Aken3†, Guojie Zhang1,10,11†, Naoki Irie2†

*: Equally contributed co-first authors. †: To whom correspondence and requests for materials should be addressed.

1BGI-Shenzhen: Beishan Industrial Zone, Yantian District, Shenzhen 518083, China 2RIKEN Center for Developmental Biology 2-2-3 Minatojima-minami, Chuo-ku, Kobe, Hyogo 650-0047, Japan 3Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom. 4BGI-Japan, Kobe KIMEC Center BLDG. 8F, 1-5-2 Minatojima-minamicho, Chuo-ku, Kobe City, Hyogo 650-0047, Japan. 5Medical Research Institute, Tokyo Medical and Dental University, 1-5-45 Yushima, Bunkyo-ku, Tokyo 113-8510, Japan 6European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, United Kingdom. 7NIBB Core Research Facilities, National Institute for Basic Biology Nishigonaka 38, Myodaiji, Okazaki 444-8585 Aichi, Japan 8Department of Biology, University of Copenhagen, DK-1165 Copenhagen, Denmark 9King Abdulaziz University, Jeddah 21589, Saudi Arabia 10China National GeneBank, BGI-Shenzhen, 518083, China 11Centre for Social Evolution, Department of Biology, University of Copenhagen, DK-2200, Copenhagen, Denmark

Nature Genetics: doi:10.1038/ng.2615 Contents

Supplementary Note ...... 4 Genomic DNA sequencing and assembly ...... 4 Prediction of sequences for phylogenetic analysis ...... 4 Phylogenetic analysis ...... 4 Divergence time estimation ...... 5 family contraction and expansion in turtle lineages ...... 5 Extensive expansion of olfactory receptors (ORs) ...... 6 Turtle-specific gene loss and pseudogenization ...... 6 with accelerated evolutionary rate ...... 7 Hourglass-like divergence during embryogenesis and un-shifted phylotype ...... 7 Developmental timetables by expression profile ...... 8 Genes that characterizes the stages after the phylotypic period ...... 8 Molecular development of the carapacial ridge ...... 8 Genomic DNA extraction ...... 9 Library construction and sequencing ...... 10 K-mer estimation of genome size ...... 10 Raw read filtering ...... 10 Error correction of short libraries before assembly ...... 10 Genome assembly ...... 10 Repeat annotation ...... 10 Assessment of genome assembly coverage ...... 10 Sanger-based quality check of the soft-shell turtle genome ...... 11 Whole genome alignment ...... 11 Dataset used for gene predictions ...... 11 Gene prediction pipeline for two turtles ...... 11 Additional gene prediction for soft-shell turtle by the Ensembl-prediction-pipeline ...... 11 analysis ...... 13 Turtle ultra conserved non-coding elements (TUCNE) ...... 13 Turtle-specific genes ...... 13 Phylogenetic tree reconstruction ...... 13 Divergence time estimation ...... 13 Gene loss analysis ...... 13 Statistical analysis of expansion and contractions (E/C genes) ...... 14 Pseudogene / frame-shifted gene detection ...... 14

2

Nature Genetics: doi:10.1038/ng.2615 Predictions of Olfactory (OR) Genes ...... 14 Genes of accelerated evolutionary rate in the turtle lineage ...... 14 Embryo sampling and mRNA extraction ...... 15 RNA-Seq for transcriptome identification ...... 15 De novo transcriptome assembly ...... 15 RNA-Seq for gene expression analysis...... 15 Obtaining gene expression scores from RNA-Seq data ...... 15 Comparison of gene expression profiles ...... 16 Wnt gene identification, cloning and whole-mount in situ hybridization ...... 16 microRNA extraction, prediction, and expression analysis ...... 16 microRNA target predictions ...... 17 Statistical tests ...... 17 Software and computation environment ...... 17

Supplementary Figures and Tables ...... 18

3

Nature Genetics: doi:10.1038/ng.2615 Supplementary Note

The turtle’s anatomical features (Supplementary Figure 1), which turtle genomes were higher in frequency than in other sauropsid are atypical for an amniote, have been the subject of an extensive genomes (Supplementary Table 6). debate regarding their evolutionary origin. To clarify the The Whole Genome Shotgun project data for soft-shell turtle and phylogenetic position of turtles with a genome-scale data set and to green sea turtle have been deposited in DDBJ / EMBL / GenBank further investigate the origin of their unique body plan from an under the accession numbers AGCU00000000 and AJIM00000000, evolutionary developmental perspective, we determined the genome respectively. The genome versions described in this paper are the first sequences of a soft-shell turtle (P. sinensis) and a green sea turtle (C. versions for these genomes. mydas) and performed further analyses as described in this supplementary text. Prediction of protein sequences for phylogenetic analysis While recent studies have shed on the transcription and Genomic DNA sequencing and assembly regulation of non-coding genomic elements, protein-coding genes We utilized massively parallel short read sequencing with 8 types have provided the measure of evolutionary relationships with higher of short- to long-insert libraries (Supplementary Table 1) to fidelity because their cross- relationships, namely their determine the genomic sequences of a soft-shell turtle and a green sea orthologies, are more reliably inferable within the established turtle; each of these libraries was constructed from a single female methodological framework75. We therefore performed gene individual. As shown in Supplementary Table 2, our assembly predictions for soft-shell turtle (Ps-BGI_gene), green sea turtle strategy allowed us to obtain an N50 scaffold size of longer than 3 (Cm_gene), Saltwater crocodile, and American alligator Mega bases. To assess the quality of our assembly, we performed (Supplementary Table 7) to determine the conserved gene sets of three different evaluations as described below. First, we checked the protein coding genes. The lengths of predicted genes, coding GC content of the sequenced DNA. The mode GC contents for sequences, exons, introns, and exon numbers of the genes in two soft-shell turtle and green sea turtle (44% and 43%, respectively) species are comparable to other vertebrate species (Supplementary were comparable to those of , , and anole lizard Figure 4). Additionally, the evolutionarily conserved profiles of (Supplementary Figure 2A), and a more detailed GC content genes in two turtle species were found to be comparable when distribution analysis (Supplementary Figure 2B-C) revealed a low compared to the existing genome-determined species average depth island that was found only in soft-shell turtle, which (Supplementary Figure 5). A Gene Ontology (GO) analysis was also may reflect the ZW karyotype of the female sample (female performed for soft-shell turtle and green sea turtle. Our analysis soft-shell turtles are known to have the ZW karyotype73, while green indicates that the predicted gene sets cover more than 89% of sea turtles have 56 diploid in both sexes74). We then conserved eukaryotic genes (Supplementary Table 8). The basic evaluated the coverage of two turtle assemblies against 248 core statistics of the two genomes are summarized in Supplementary eukaryotic genes (CEGs) and confirmed that both of the assemblies Table 9. covered more than 70% of the complete sequences of core eukaryotic genes and more than 90% including partial coverage (Supplementary Phylogenetic analysis Table 3). Finally, for the soft-shell turtle genome sequence, we Despite efforts using both morphological and molecular compared the scaffolds to independently constructed and sequenced approaches, the origin of turtles remains controversial, and (by Sanger-based sequencing) fosmid clones, and we clarified that researchers debate between three major hypotheses1,76 the weighted average coverage rate for five fosmid clones was (Supplementary Figure 6): turtles are the [I] sister group to the greater than 95% (96.7%, See Supplementary Table 4). In addition, lizard-snake-tuatara (Lepidosauria) clade9,10, [II] turtles are the sister the assembled genome sequences from soft-shell turtle were assessed group to birds and crocodilians (the Archosauria)77,78, or [III] turtles to evaluate the coverage rate of transcribed regions (Supplementary are basal to the Diapsida (a clade composed of the Archosauria and Table 5). BLAT alignment against the Expressed Sequence Tag the Lepidosauria79). The main inconsistency among molecular-based (EST) clusters with the assembled genome (with more than 50% analyses, including a recent study using micro RNA (miRNAs)9, is sequence length alignment) showed that 98.6% of EST clusters presumably due to a lack of comprehensive data (e.g., without a generated by Illumina short reads were covered by the assembly. genome sequence, we cannot distinguish whether an miRNA is Similarly, 98.1% of the 79,305 EST clusters generated by 454 absent from the genome or is present yet expressed under the sequencing were covered by the assembly, suggesting high coverage detection level of the sequencing protocol). We therefore performed of gene regions by the assembled genome. Although the overall phylogenetic analyses using the gene sets from two turtles characteristics of the turtle genomes were comparable to other (green sea turtle and soft-shell turtle) that were sequenced in this vertebrates (Supplementary Figure 4-5), the repetitive elements in the project together with 10 vertebrates with sequenced genomes

4

Nature Genetics: doi:10.1038/ng.2615 (chicken, , saltwater crocodile, American alligator, anole study on the expansion of olfactory receptor genes is provided in a lizard, dog, human, platypus, and medaka). subsequent section. Using concatenated sequences constructed from 1,113 single-copy Several gene families involved in the immune system and more orthologues (Supplementary Table 10) with a variety of sequence specifically in the innate immune response are also expanded in the sets (including amino acid sequences, whole coding sequences, 1st, turtle lineage. We also observe several family expansions of 2nd, 1st & 2nd, and 3rd codon positions), we have performed Zinc-finger . phylogenetic reconstruction analyses (Supplementary Figure 6). To Turtle lineage-specific gene family contractions were also increase the robustness of our data, phylogenetic trees were made evaluated by the CAFÉ program and included OR protein using two different programs, RAxML80 and PhyML47,48 subfamilies and some zinc-finger families (Supplementary Table 10). The sequence length and log-likelihood Supplementary_Table 15.xls: value for each set of sequences are shown in Supplementary Table Nucleotide sequences of genes that are predicted to be expanded in the turtle lineage. 10. Supplementary_Table 16.xls: Based on our analysis using various datasets, turtles were clearly Table of gene family IDs and genes that are predicted to be grouped with the Archosauria, while the Lepidosauria served as a contracted / expanded in the turtle lineage. sister group. The significance of this result was supported by statistical tests, in which alternative tree topologies were explicitly Examination of the gene repertoire in selected gene families rejected (Supplementary Table 11). The tree topology inferred with It is of great interest if gene families other than the previously the third codon position was different, as turtles were grouped mentioned families exhibit a standard pattern in their gene repertoires together with crocodiles and formed a sister group to birds. This or any other unexpected feature. For example, lineage-specific gene could be largely due to the faster evolutionary rate at the third codon duplications observed across gene families are recognized as signs of 84 position (Supplementary Figure 6) and the saturation effect. We whole genome duplication that are unique to that lineage . To obtain therefore decided not to rely on the data from the third codon position a snapshot of the turtle protein-coding gene repertoire, we focused on in the phylogenetic tree inferences, as commonly practiced81,82. a few select gene families that are known to exhibit a relatively high level of variation in gene retention among major vertebrate lineages. First, we focused on the -releasing hormone receptor Divergence time estimation (PRLHR) family, in which large differences in gene repertoires were st Based on the phylogenetic tree and data obtained from the 1 and previously documented between vertebrate lineages85. In the nd 2 codon positions, we next focused on estimating the divergence soft-shell turtle genome, orthologues of all the four PRLHR subtype time of each species using the Bayesian MCMC method in PAML genes (PRLHR1-4) have been retained (Supplementary Figure 8), as together with several fossil records that can be used as calibrating was reported for the anole lizard85. This suggests that PRLHR1, time points (Supplementary Table 12-13). The divergence between which is present in the anole lizard and soft-shell turtle genomes but soft-shell turtles and green sea turtles was estimated to have occurred absent in the chicken genome, was most likely lost somewhere in the approximately 114.5 million years ago (Mya); using the 95% archosaurian lineage after its separation from the turtle lineage. credibility interval, this divergence could have occurred Functional characterization of the identified turtle genes and their approximately 216.4 - 38.2 Mya. The crocodilian/bird split and the relatives should highlight the biological significance of these changes turtle/(crocodilian/bird) split were estimated to have occurred 241.2 in the gene repertoire. Second, we made an attempt to identify Mya and 257.4 Mya, respectively (Supplementary Figure 7). We also soft-shell turtle orthologues of selected genes that were absent in used several other methods and data sets to estimate divergence times, chicken, namely Pax486. As a result, no putative orthologue was which are listed in Supplementary Table 12. Note that the estimated identified for any of these genes. Their absence marks a relatively time rages for turtles/birds did not differ to a large extent regardless of ancient characteristic of the gene repertoire that was possibly the method or data set used. established, at the latest, at the radiation of extant sauropsids and would not account for phenotypic differences between turtles and Gene family contraction and expansion in turtle lineages other sauropsid lineages. Third, we found eye globin (also called After obtaining a statistically reliable phylogenetic tree for turtle globin E or GbE), which has been reported only in birds (including 87 evolution and predicted gene sets, we explored genes that underwent chicken) and is considered to be involved in supplying oxygen to 88 a large expansion in gene family size during turtle evolution. We the thick avian retina . Our search in the soft-shell turtle genome found several turtle lineage-specific expansions of gene families by detected a possible GbE orthologue (Ensembl Gene ID analyzing 12 vertebrate genomes using the CAFÉ program83 ENSPSIG00000002766), and we further confirmed the orthology (Supplementary Table 14). with the chicken GbE by reconstructing a molecular phylogenetic Olfactory receptors (OR) genes are the typical example of genes tree (Supplementary Figure 8). Remarkably, among vertebrates, that have evolved in vertebrates following a birth-and-death model. turtles are the group with the highest rate of gene retention in the Indeed, we find many changes in the size of the OR gene families, focused globin gene family, which delineates the turtle-specific gene notably a very large expansion of the OR52 gene family in the repertoire that is different from anole lizards and birds. Our study on soft-shell turtle genome. Other OR families like OR10 are also turtle genomes identified the first GbE sequences outside the avians. 87 expanded while a few subfamilies are contracted. A more detailed Although its exact role remains to be clarified , the characterization of turtle and bird GbE functionality may account for the defining

5

Nature Genetics: doi:10.1038/ng.2615 phenotypic characteristics of the turtle-archosaurian clade, and the Genome-scale analyses demonstrated that a drastic expansion of 89 extraordinarily thick plexiform layer of the P. scripta retina group α genes has occurred in the turtle lineage. There are three deserves further study. If crocodilian genomes have retained the GbE major turtle-specific clades, each of which contains >100 soft-shell orthologue, this would also be of great interest. This result, which turtle OR genes (Figure 2). An estimation of the numbers of ancestral was initially obtained based on the soft-shell turtle resource, was later genes suggested that the soft-shell turtle, for example, acquired more confirmed with the green sea turtle genome sequence information. than 500 functional group a OR genes after its separation from the Overall, we did not observe any sign of whole genome duplication(s) green sea turtle (Supplementary Table 20, Figure 2). that were unique to the turtle lineage. Several studies have demonstrated that aquatic , cetaceans, sirenians, and pinnipedians have a greater proportion of Extensive expansion of olfactory receptors (ORs) OR pseudogenes than terrestrial mammals13,14,93. Moreover, Kishida Diverse odor molecules in the environment are detected by and Hikida94 reported that the fraction of OR pseudogenes in fully olfactory receptors (ORs) expressed in the olfactory epithelium of the aquatic viviparous sea snakes is significantly higher compared with nasal cavity. Typically, mammalian genomes harbor ~1,000 OR oviparous sea snakes, which depend on a terrestrial environment for genes, which form the largest multi-gene family in vertebrates46,90. laying eggs. Contrary to these observations, our analyses showed that From the genome sequences of Chinese soft-shell turtle and green the OR gene repertoires of the two turtle species (soft-shell turtle and sea turtle, we have identified 1,137 and 254 intact (potentially green sea turtle) have expanded despite their adaptation to aquatic functional) OR genes, respectively. The number for the soft-shell life. turtle is the largest among the non-mammalian vertebrates examined In fact, several lines of evidence indicate that aquatic turtles have so far91. We also found hundreds of OR pseudogenes and OR gene good olfactory abilities in general. Endres et al.49 reported that sea fragments in both the soft-shell turtle and green sea turtle genomes. turtles can detect airborne odorants as well as water-soluble odorants The amino acid and nucleotide sequences of intact OR genes and proposed that this ability may play a role in navigation and/or identified in this study are available as Supplementary_Table_17.xls foraging under natural conditions. The dynamic expansion of group and Supplementary_Table 18.xls, respectively, through the online α (Class I) OR genes in turtles implies the importance of these version of the paper at http://www.nature.com/ng/ groups of genes for the turtles’ living environment. It was suggested Supplementary_Table 17. xls: that ligands of Class I ORs tend to be hydrophilic while ligands of Amino acid sequences of intact OR genes identified from 95 genome sequences. “Pesi” and “Chmy” represent soft-shell Class II ORs tend to be hydrophobic . Therefore, the expansion of turtle and green sea turtle genes, respectively. Each gene Class I genes may reflect the turtles reliance on an aquatic name contains a scaffold number, the initial and the environment. terminal positions of the gene, and a transcriptional We next examined the distribution of OR genes in genomes. In direction. Supplementary_Table 18.xls: general, the distribution of OR genes in mammalian genomes has the Nucleotide sequences of intact OR genes identified from following features90,96: (i) OR genes form genomic clusters that are genome sequences. Gene names are the same as those in scattered on many chromosomes, and (ii) Class I and Class II OR Supplementary_Table 17.xls genes are located in distinct genomic clusters and do not coexist

within a single cluster. Our analysis clarified that these two features The OR genes of Osteichthyes (teleost fishes and tetrapods) are are also characteristic of both soft-shell turtle and green sea turtle OR classified into seven groups (α–η), each of which corresponds to at gene families. As shown in Figure 2c-d, the OR genes of soft-shell least one ancestral gene in the last common ancestor of turtle are arrayed in tandem in a fairly regular pattern in a contig. The osteichthyans91,92. It was reported that genes from group and are α γ largest Class I and Class II OR gene clusters were found in scaffold present only in tetrapods (mammals, birds, , and ), 55 and scaffold 145 in the soft-shell turtle genome, which contains 53 while genes from group δ, ε, ζ, and η are present exclusively in and 41 intact Class I and Class II OR genes, respectively. These amphibians and bony fishes (Supplementary Table 19). Genes from contigs do not include any other genes. We did not find any cases in group β are present in both tetrapods and bony fishes. Mammalian which intact Class I and Class II genes are present together within OR genes are usually classified into Class I and Class II genes. The one contig. former corresponds to groups α and β, while the latter corresponds to group γ91,92. Turtle-specific gene loss and pseudogenization We identified OR genes belonging to groups α, β, and γ in the two To clarify the functional aspect of genes lost in the turtle lineage, turtle genomes. This observation is consistent with the distribution of we performed a more detailed gene loss analysis, together with OR groups in amniotes that have been previously examined. enriched GO detection using GOstat97. The genes lost in turtles However, interestingly, we found that group α (Class I) OR genes are (GLT) were defined as genes that cannot be found in either of the two largely expanded in the turtle lineage, which is unique among turtle genomes but can be found in both archosaurians (either in amniotes. Group a genes generally represent 10-20% of OR genes in American alligator, saltwater crocodile, chicken, or zebra finch) and mammals, and this proportion is much smaller in birds and lizards mammals (either in human, dog, or platypus). Consistent with the (Supplementary Table 20). However, group a genes represent >45% family expansion/contraction analysis, the ontology “olfactory of OR genes in the two turtle species. receptor activity” was found to have a high amount of gene losses (Supplementary Table 21). In relation to the evolution of body plan,

6

Nature Genetics: doi:10.1038/ng.2615 the loss of genes having the GO assignment of “multicellular rather divergent among species led to the formulation of the organismal process” was also statistically significant. A similar “developmental hourglass model” hypothesis16,17. This model observation was further implied by the protein family-level analysis predicts divergent early stages and more highly conserved stages (Supplementary Table 22), as Kruppel-associated box, GPCR around the so-called “phylotypic period19,21,22,103,104”, which is -like, and 7TM were found to be over-represented in the considered to be the source of the basic vertebrate body plan. Recent turtle-lost protein families. One of the surprising findings was that we molecular studies18-21 supported developmental hourglass-like found loss of the hunger-stimulating hormone ghrelin98 in two turtles divergence in vertebrates18,20,21 and Drosophila19; however, none of (Supplementary Table 23). The loss was also confirmed by manual the analysis was performed in a non-. investigation including BLAST search against two turtle genomes To examine whether turtles, which have a rather atypical anatomy (only partial-hit was observed in both genomes, and no EST cluster for amniotes, also follow hourglass-like16,17 divergence during was found in the soft-shell turtle genome) and de novo assembled embryogenesis, we compared the whole embryonic gene expression soft-shell turtle ESTs (partial hit corresponding to the BLAST against profiles of soft-shell turtle embryos against chicken embryos using the genome). Furthermore, we also investigated genes that 11602 orthologous genes with non-biased slim GO profiles presumably underwent pseudogenization in turtles and found two (Supplementary Figure 9-10). We first explored the number of genes genes that were predicted to have undergone complete expressed during each developmental stage. Although the predicted pseudogenisation (see also Methods). Interestingly, two of these three gene number for soft-shell turtle (18175) differs largely from that of genes, Forkhead box M1 (FoxM1) and serine/arginine-rich splicing chicken (16736), the number of genes expressed (with at least one tag factor 1 were related to developmental processes. Especially, the mapped to their coding region) was comparable between the two FoxM1 gene is known to be an indispensable gene: the knock-out species (Supplementary Figure 11). Interestingly, the overall number phenotype is embryonic lethal, and the gene is also known to be of genes detected during embryogenesis was also comparable among involved in the development of liver, heart, lung and blood vessels in the various embryonic stages (see also Supplementary Figure 13). mice99. Based on these results, however, no definitive conclusion can We next compared the similarity of expression profiles between be made for the turtle body plan evolution, and the effects of the loss turtle and chicken embryos with 11602 one-to-one orthologues to of these genes deserves for further investigation. evaluate the conserved nature of the embryonic stages. Consistent Supplementary_Table 23.xls: with the hourglass model, the mid-embryonic stages exhibited a List of genes that were predicted to be lost in two turtles, but higher expression similarity compared to the earliest and the latest exists in either of chicken, zebra finch, anole lizard, or X. stages of the sampled embryos (Supplementary Figure 14). This tropicalis. result could not be explained by the moderate and monotone increase

of the numbers of genes expressed during embryogenesis Genes with accelerated evolutionary rate (Supplementary Figure 13). While the above results suggest an To identify genes that have experienced positive selection and hourglass-like divergence between turtle and chicken embryos, we accelerated evolutionary rate in the turtle lineage, we compared the cannot exclude the possibility that these results were biased by the soft-shell turtle genome and green sea turtle genome and calculated pairs of turtle-chicken embryos we selected. We therefore performed the dN/dS ratio of the coding sequences (Supplementary Table 24). more robust all-to-all comparisons to corroborate the hourglass-like Among the coding sequences, several genes, MGST3, ABCB1, FAH, divergence. As shown in Supplementary Figure 14, all-to-all RFC4, HEATR2, APOBEC2, SCYL3, PDC, and METTL15, were comparison analysis supported the highest conservation at found to have a dN/dS ratio higher than 1, which implies positive mid-embryonic stages, and importantly, this result did not change selection after the split of these two turtle lineages. Of note is the gene when either the distance calculation method or the normalization with the highest evolutionary rate, microsomal methods were used. S-transferase 3 (MGST3), which is reported to be involved in The essence of the hourglass model resides in the waist region. protecting cells from oxidative stress100. Interestingly, disruption of This region represents the conserved phylotypic period, which is the homolog MGST-like reduces life-span in 16,17 15 believed to illustrate the basic body plan of vertebrates . However, Drosophila . Relationship between accelerated MGST3 and the 25 101 as indicated by the observations of von Baer and Haeckel , longevity deserves for further investigation. relatively closely related species such as turtles and birds remain

similar in appearance until late embryogenesis. This observation Hourglass-like gene expression divergence during embryogenesis prompts the expectation that the most conserved period emerges at and un-shifted phylotype later developmental stages than the phylotype when comparisons are In general, all build their complex body from a single made between relatively small clades (inner or red part of the nested fertilized egg. The process begins with establishing basic polarity hourglasses model in Fig. 3a), whereas the vertebrate phylotype still information (such as information regarding body axes) followed by emerges when comparisons are made among far-related vertebrate the further addition of polarity and topological information. The species (outer or blue part of the nested hourglasses model in Fig. 3a). important role of the earliest (or upstream) developmental processes Thus, the hierarchical relationship between phylogeny and ontogeny fits well with the hypothesis inspired by von Baer25 and Haeckel101, or still becomes valid (as von Baer once proposed) after the phylotypic the idea that earlier embryonic processes are typically better 18 period. This idea, or the nested hourglass model can be tested by conserved in evolution (funnel-like model)102. Meanwhile, identifying the most conserved developmental stages between turtles observations that vertebrate cleavage and patterns are

7

Nature Genetics: doi:10.1038/ng.2615 and and assessing whether it corresponds to the previously we have investigated a revised molecular approach using identified vertebrate phylotypic stage in chicken8, namely stage microarray-generated expression profiles that contain all expressed HH16. As shown in Supplementary Table 25, our statistical genes as reported in our previous study21. Despite the fact that actual analysis, which is based on the hierarchical Bayes method, robustly time needed for soft-shell turtle and chicken development differs demonstrated that TK11 of soft-shell turtle and HH16 of chicken almost twice in length, the molecular analysis corresponded show the maximum expression similarity among the stages we have reasonably well with the current understanding24,106 (Supplementary analyzed. All combinations of normalization and similarity Figure 17). This type of molecular approach could be a new strategy calculation methods we have utilized supported the same result. The for the adjustment of developmental timetables between different result suggests that these stages have a highly conserved nature in species, which in turn would be helpful for estimating the terms of gene regulation. embryogenesis of common ancestors that are now extinct. Finally, Notably, chicken HH 16 is the stage at which the phylotypic the above results and conclusion did not change depending on which period was observed in our previous study8, which compared the soft shell turtle gene model (Ps-BGI_genes and Ps-ens_genes) was embryonic gene expression profiles of mouse, , and used. Xenopus. The identified turtle stage appears similar in external *Corresponding developmental timetables (CDT): Because appearance with chicken phylotypic period and also shows a similar there are no equal developmental stages between different species due to evolutionary changes (e.g., heterochronic shifts), repertoire of organ primordia shared among the phylotypic periods of “corresponding” indicates the pair of stages that are expected four other vertebrates (Supplementary Table 26). In accordance with to have diverged from the same stage of the last common this, the identified turtle/chicken phylotypic stages exhibited the ancestor with minimum changes in developmental events. shared expression of developmental toolkit genes105 or genes known

to be involved in developmental process in various animals (Fig. 4a, Genes that characterizes the stages after the phylotypic period Supplementary Table 27). Furthermore, the group of genes that was The result that turtle-chicken embryogenesis also follows the associated with the GO assignment of “multicellular organismal developmental hourglass model implies that the gene development” showed a significantly similar expression level regulation that characterizes the turtle morphology also between the embryos (Supplementary Table 28). This result implied becomes evident after the phylotypic period. We, therefore, that the conserved nature of the phylotypic stage is more encoded by searched for genes that potentially explain the turtle-specific regulatory sequences and less by the primary sequences of coding characteristics that appear after the phylotypic period, such as genes. Considering the different sizes of their eggs (around 1.5 cm for the sequence of ossification events, including the shell108. By turtle eggs, and 5cm for chicken eggs) and the actual time required searching for turtle genes that become more highly expressed for development24,61, a similarity in both gene expression and after the phylotypic period (excluding orthologous genes that morphology is surprising. Taken together, the maximal similarities of also show increasing expression in chicken embryogenesis), we the expressed gene repertoire between the turtle TK11 embryo and found 233 genes that increased after the phylotypic period chicken phylotypic period, together with the shared anatomical (turtle IAP) (Supplementary Figure 18). As expected from the features, demonstrate that even animals that have an atypical body well-ossified and collagen-rich anatomy of turtles, these IAP plan (e.g., turtles) follow the hourglass model with a conserved genes showed enriched GO assignments related to ossification phylotypic period. Additionally, a temporal shift to later stages does and extracellular matrices (Fig.4b). We further narrowed down not seem to occur within relatively small phylogenetic clades, the period into TK13 and TK15 to identify genetic programs suggesting the highly conserved nature of these phylotypic stages in that potentially explain turtle-specific morphogenesis, such as various vertebrate clades (Supplementary Figure 15). In accordance CR formation, the axial arrest of the ribs, and the folding of the with this, gene families that possibly experienced expansion or body wall that occurs during this period23. Based on the contraction (see supplementary Table 16) showed lower expression phylotypic period and the CDT predicted above, the expression levels, particularly during the mid-embryonic stages (Supplementary profile of the TK13-15 period was compared to HH19-28, and Figure 16). we found that the genes highly expressed in these turtle stages, but not in the corresponding chicken stages, have enriched GOs Developmental timetables by expression profile of “collagen fibril organization” and “positive regulation of We next explored whether the current understanding of chondrocyte differentiation” (data not shown). However, the corresponding developmental timetables24,106 can be reproduced by results do not necessarily indicate that these genes are the key gene expression similarities. An estimation of corresponding players that cause turtle-specific morphogenesis, and their developmental timetables (CDT)* between different species is analysis awaits further investigation. especially important in the search for developmental changes that

have occurred during evolution, including heterochrony and Molecular development of the carapacial ridge heterotopy107. However, both morphological and candidate molecular The carapacial ridge (CR) of turtles represents a major approaches (e.g., using Hox expression as a marker) have subjective 1 morphological innovation within vertebrates . However, the biases; the morphological approach is often limited to structures that molecular pathways leading to its development are still not well have clearly distinguishable elements, and the molecular candidate known. In previous work, we have shown that several genes are approach is often confined to genes that have known functions. Thus, 28,109 specifically involved in CR formation . Among them, genes

8

Nature Genetics: doi:10.1038/ng.2615 downstream the Wnt/β-catenin signaling pathway, such as the 21a,b, Supplementary Table 31). In addition, Wnt5a and some of its transcription factor LEF-1, are expressed in both the CR possible downstream components were found to be the potential mesenchyme and ectoderm28. Accordingly, β-catenin protein targets of miRNAs expressed in the CR, body wall, and limbs translocation to the nucleus was also shown28. Thus, a Wnt gene must (Supplementary Figure 21c, Supplementary Table 32), suggesting be upstream of this signaling cascade, and this gene was likely that Wnt signaling components are also regulated at the level of co-opted for the innovation of the CR in the turtle lineage. However, protein translation. Of note, Wnt5a is predicted to be controlled in no Wnt gene identified so far in turtle has been detected in the CR, these three tissues by different miRNAs, implying that, although and this part of the pathway has been thought to be activated by these are transcribed in all three tissues, their translation regulation HGF/c-Met signaling109. Thus, to identify the complete set of Wnt may be different and may be important for the differential patterning genes, we have performed a screening of the soft-shell turtle genome and development of these structures. Moreover, the important together with the extensive RNA-seq and have identified a total of 20 downstream components Tcf7 and b-catenin were predicted to be members of the Wnt family (Fig.5), including Wnt10b, which seems targeted by tissue-specific miRNAs, which may explain why there is to be lacking in the bird genomes (Garriok et al.110 and our Wnt5a transcription in the body wall, but that it lacks b-catenin 28 observations) and Wnt11b, which has been lost in mammals85. The nuclear translocation, as was reported previously . We also found 6 expression patterns illustrated by whole-mount in situ hybridizations non-coding regions in the 10kb-upstream region of Wnt5a that are show that out of 20 Wnt genes analyzed, only Wnt5a was expressed conserved in the two turtles (Supplementary_Table 33.xls) with in the CR at TK14 (Fig.5), the stage at which the CR becomes more 100% identity but not conserved in human, chicken or anole lizard apparent. Although Wnt5a is typically involved in non-canonical genomes. Taken together, our results suggest that the expression of Wnt signaling (i.e., exerting its function independently of β-catenin), the Wnt cascade components and their Wnt ligands are differentially this depends on the present receptors, and it has been shown that controlled by multiple miRNAs in each tissue, implying that the Wnt Wnt5a can also activate the canonical pathway111. Nonetheless, cascade co-option in the CR might also be accompanied by evolution further experiments are needed to clarify whether Wnt5a is of miRNA regulation. responsible for β-catenin translocation in CR cells. Finally, we believe that the turtle genome and RNA sequences microRNAs (miRNAs) are a type of non-coding RNA, 21-23 presented here are invaluable tools for the investigation of key facets nucleotides in length, that regulate genes either by inhibiting of the development of morphological innovations, such as the CR 1,23 translation or directing mRNA cleavage via binding to and its associated aspects . complementary sequences. These sequences mainly occur in the Supplementary_Table 30.xls: Predictions of miRNAs from P. 112 sinensis TK14 carapacial ridge, limb and 3’UTR of mRNAs . Moreover, miRNAs are thought to have an body wall. important role in development113. To investigate the miRNA Supplementary_Table_33.xls: Turtle conserved (with 100% repertoire associated with CR formation, we have performed a small identity) non-coding elements located within 10kb upstream range of each RNA-seq with various tissues (CR, limbs, and body walls) gene. Regions shorter than 30bp and micro-dissected from soft-shell turtle embryos at stage TK14 regions with a good alignment with (Fig.5b-c, see also Online Method for library construction and human, chicken or anole lizard are miRNA prediction, Supplementary Figure 19-22). For each tissue, excluded.

we analyzed the small RNA sequences with more than 62,000X All of the sequenced data have been made available and are sequencing depth (calculated against the predicted mature sequences). freely accessible from the online databases. The details are 114 From the mapped reads, miRDeep2 software predicted a total of summarized in Supplementary Table 34. 715 miRNAs expressed in the CR, with 564 in limbs and 868 in body walls (see Supplementary Table 29 and Supplementary Table 30.xls). In total, 1082 unique miRNAs were predicted in soft-shell turtle, and mature sequences of 22% of them were found to match Genomic DNA extraction 100% in the green sea turtle genome (Supplementary Figure 20). For the soft-shell turtle, 3.58 mg of DNA (256 µg/ml) was extracted Among the 1082 unique miRNAs, 212 were found to be specific from 8 ml of whole blood of an anesthetized (diethyl ether) female for CR and were not detected in either the limbs or the body walls (anatomically confirmed) purchased from a local farmer in Japan. For (Supplementary Figure 19). For example, miR-187 was found to be the green sea turtle, a total of 4 mg of high-quality DNA was extracted one of the most highly expressed (741 reads, ~1.5% of CR-specific from the whole blood of a female individual provided by the G10K reads) miRNAs in the CR (Supplementary Figure 19b-c). This (http://www.genome10k.org/) project (originally collected in Ocean unexpectedly high number of specific miRNAs suggests that Park, Hong Kong). DNA extraction was performed by an overnight miRNAs have an important regulatory function in the CR, which treatment of whole blood with proteinase K, followed by phenol opens the door to a new field of research on the unique extraction and ethanol extraction using a glass rod. No column-based morphological novelty of turtles. kits were used for the extraction step to avoid fragmentation of the Provided that Wnt signaling plays a critical role in turtle sample DNA. The purity and integrity (especially regarding length) of development, certain circumstantial evidences should be observed. the DNA samples were confirmed by a Qubit Fluorometer and agarose We first confirmed that none of the Wnt signaling components are gel electrophoresis, respectively. lacking in either of the two turtle genomes (Supplementary Figure

9

Nature Genetics: doi:10.1038/ng.2615 Library construction and sequencing 17-mers. Starting from regions of high K-mer frequency, we For both the soft-shell turtle and green sea turtle genomes, we first extended our correction base by base to both sides of the constructed three different short-insert (170 bp, 500 bp, 800 bp) low-frequency K-mer regions that are considered to have potential mate-pair libraries from the genomic DNA samples and sequenced erroneous sites. When all 17-mers that covered the altered residue them using the Illumina HiSeq 2000 system to obtain data for survey were changed into high frequency ones, we assumed that the analyses (e.g., genome size, GC content, complexity). After obtaining erroneous residue was indeed an error and corrected it. If the basic information for genome size and complexity, we further erroneous sites could not be corrected, we trimmed the low frequency constructed 2 Kb, 5 Kb, 10 Kb, 20 Kb and 40 Kb mate-pair libraries K-mers from the reads. We did not correct for the reads from from the same DNA sample and sequenced them for further long-insert libraries, which were only used in scaffolding and can assembly. For long-insert (>1 Kb) mate-pair libraries, approximately tolerate some sequencin g errors. In total, we corrected 0.21% of the 20-50 µg of genomic DNA was fragmented, biotin labeled, bases and trimmed 2.39% of the bases from the filtered reads. self-ligated to form circularized DNA, merged at the two ends of the DNA fragment, broken into linear DNA fragments again, enriched Genome assembly using biotin/streptavidin, and prepared for sequencing. For both the We assembled the genome with the filtered and corrected (=clean) soft-shell turtle and green sea turtle genomes, we have constructed a data described above using SOAPdenovo31,32 (updated version based total of 18 and 17 libraries, respectively. on SOAPdenovo1.05) software. The assembly was carried out using

the following steps: K-mer estimation of genome size [1] Contig construction. The reads from short-insert size (less than 1 Genome size can be estimated by analyzing the occurrence and K) library data were split into K-mers and used to construct a de distribution of K-mers with following formula: Bruijn graph. The graph was then simplified to achieve the contigs by Estimated genome size (bp) = K-mer number / depth removing tips, merging bubbles and solving repeats. Based on the rate of occurrence of K-mers in each genome,the read depths for soft-shell turtle and green sea turtle were estimated as 38 [2] Scaffold construction. All the sequenced reads were re-aligned and 36.1, respectively, leading to genome size estimations of onto the contig sequences. Scaffolds were then constructed by approximately 2.0 Gbp for soft-shell turtle and 2.2 Gbp for green sea weighting the rate of consistent and conflicting paired-ends turtle. relationships, and only the scaffolds supported by a high weight of paired-end relationships were kept in the assembly.

Raw read filtering [3] Gap filling. We retrieved the read pairs that had one end that HiSeq raw reads with the following features were regarded as low uniquely mapped to the contig with the other located in the gap quality reads and were filtered out (discarded) according to region and carried out a local assembly for these collected reads to fill previously published methods32, [1] Reads with more than 10 bp the gaps. The parameters used for the SOAPdenovo assembler are as () aligned to the adapter sequence (allowing ≤ 3 bp follows: the options of “K=27, -d, -M 2” were set for the soft-shell mismatch); [2] Reads with more than 2% N residues; [3] Reads turtle and “K=35” was set for the green sea turtle. containing polyA structures; [4] Possible PCR duplicates (two paired end reads with completely identical sequences); [5] Small-insert Repeat annotation 33 library reads that have 40 bases with quality scores ≤ 7; [6] Repeat detection was performed with the program RepeatMasker Large-insert library reads that have more than 30 bases with quality using the repeat library issued on April 18, 2012, by Genetic scores ≤ 7; [8] Small-insert reads with more than 10 bp of Information Research Institute (GIRI, http://www.girinst.org/). The overlapping reads between read1 and read2 (allowing 10% categorization of the repeat types in Supplementary Figure 4 is based mismatch). on the classification produced by RepeatMasker. Prior to the gene prediction process, the repeat elements were further screened and Error correction of short libraries before assembly masked using a combination of homology-based and de novo Artificial K-mers generated from sequencing errors normally occur approaches. For homology-based prediction of repeats, we used the 34 at a low frequency. This indicates that K-mer frequency information known repeat library in the Repbase database (version 2008-08-01, 33 can be used for the correction of reads with a low frequency of Repbase-16.02) with the software RepeatMasker (version 3.2.6) K-mers. We used K=17 because 417 = 17179869184 (> 17 Gbp) is and RepeatProteinMask to identify TEs at the DNA and protein level, larger than the estimated soft-shell turtle and green sea turtle genome respectively. De novo prediction of repeats involved building a de 35 size (2.1 Gbp and 2.2 Gbp, respectively); this is also sensitive enough novo repeat library using RepeatModeler and subsequently to identify reads embracing the K-mer. employing RepeatMasker to find repeats in the genome and classify Based on the K-mer distribution curve, frequencies lower than the the repeats. We searched for tandem repeats in the genome with the 36 turning point (13 for soft-shell turtle and 10 for green sea turtle) were TRF (Tandem Repeats Finder) program. considered to be low frequency K-mers. We then constructed a hash table storing the frequencies of all 17-mers (which occupied 16 G Assessment of genome assembly coverage 115 bytes of memory) and tested whether the substitution of any residues The coverage of the assemblies was assessed using the CEGMA program (version 2.3)116. Coverage of 248 core eukaryotic genes with the other three nucleotides could lead to a high frequency of (CEGs) that are present in a wide range of taxa was assessed to

10

Nature Genetics: doi:10.1038/ng.2615 measure the completeness of genome assembly. CEGMA combines Gene prediction pipeline for two turtles TBLASTN (blast-2.2.25), genewise (wise2.2.3), hmmer [1] Gene prediction for the two turtle genomes: The BGI annotation (hmmer-3.0), and geneid (geneid v1.4) to find gene models in the pipeline employs both the ab initio approach (GENSCAN , AUGUSTUS ) and a homolog-based approach

(Western clawed frog, chicken, human, anole lizard, bottlenose Sanger-based quality check of the soft-shell turtle genome dolphin) against the repeat-masked genome, and it further The CopyControl Fosmid Library Production Kit with the consolidates with the GLEAN41 program. Homolog-based gene pCC1FOS vector (EPICENTRE Biotechnologies) was used to prediction employs GeneWise43 as the core program. construct the fosmid libraries from the same DNA sample that was Homologous proteins of other species (human, anole lizard, chicken, used for genome sequencing. After randomly selecting 5 fosmid bottlenose dolphin and Western clawed frog from the Ensembl 61 clones, we sheared the DNA (using Gene Machines) into fragments release) were mapped to the genome using TBLASTN (Legacy of approximately 1~3 kb in length and ligated these into the pUC118 Blast56 ver. 2.2.23) with an E-value cutoff of 1e-5. The aligned vector. Cloned DNA fragments were transformed into E. coli by sequences were then filtered and passed to GeneWise43 along with electroporation, plated and grown overnight on LB plates containing the query sequences for searching accurate spliced alignments. X-gal, IPTG and ampicillin. Sanger sequencing was performed with Source evidence generated from the above two approaches was an ABI 3730 to approximately 6-fold coverage. We then assembled integrated by GLEAN41 to produce a consensus gene set. Gene the Sanger reads by overlapping and filled gaps by further rounds of functions were also assigned according to the best match with the sequencing to obtain complete maps of the fosmid clones. Alignment alignments using BLASTP and the SwissProt117 and TrEMBL of the assembled genome sequences to the five fosmid clones was databases. The motifs and domains of genes were annotated by performed using the BLAST algorithm (blastn, E-value = 1e-20). InterProScan44 against protein databases such as ProDom, PRINTS, Pfam, SMART, PANTHER and PROSITE. Gene Ontology45 IDs Whole genome alignment for each gene were obtained from the corresponding InterPro entries. Whole genome pair-wise alignments were generated by All genes were aligned against KEGG118 proteins, and the pathway in 37,38 LASTZ , which is known to have a higher sensitivity than BLAST which the gene might be involved was derived from the matching and is suitable for genome comparisons. The parameters used genes in KEGG. Both of the soft-shell turtle gene sets (Ps-ens_gene for the LASTZ program are as follows: T=2, C=2, H=2000, Y=3400, and Ps-BGI_gene) are available through Ensembl website L=6000, K=2200 T=2 is short for “--seed=12of19 --notransition”, (http://www.ensembl.org/Pelodiscus_sinensis/Info/Index) and C. which sets the seed pattern and allowing transition or not for lastz mydas can be available through NCBI website (Accession number: alignment; C=2 is short for “--chain --gapped”, which means AJIM01000000). performing chaining after lastz alignment, and gap is allowed in the [2] Gene prediction for crocodilians: For gene predictions in alignment; “H=2000, Y=3400, L=6000, K=2200” are short for crocodilian genomes, two primary assemblies119 (saltwater crocodile “--inner=2000, --ydrop=3400, --gappedthresh=6000, and American alligator) were provided from the crocodile genome --hspthresh=2200”, which are thresholds controlling the alignment consortium (http://www.crocgenomes.org/) courtesy of the project process. The main chains of the two selected genomes were coordinator, Dr. David A. Ray. The same BGI prediction pipeline 37 calculated by chainNet , which is designed to link aligned segments used for the green sea turtle was applied to the gene prediction of the into larger structures such as chains and nets. The alignments were saltwater crocodile. For the American alligator, the annotation used as a basis for comparative analysis, e.g., detecting conserved procedure was different from that of the saltwater crocodile only in blocks, comparing indels within genomes, and calculating the integration step. The predicted gene sets were further analyzed for evolutionary distances. orthology and used for phylogenetic analysis. [3] Gene family identification: We identified gene families with Dataset used for gene predictions TreeFam32 using the following steps: [1] BLASTP was used to Genomic sequences, together with registered sequences, were used compare all the protein sequences within the database containing for predicting the gene sets of soft-shell turtle and green sea turtle. sequences of all species with E-values less than 1e-7. [2] HSP For soft-shell turtle, additional RNA-Seq data of 146.7 G bases were segments were concatenated between the same pair of proteins with used for predicting genes. In addition to the gene sets prepared for the solar; this was followed by identification of homologous gene-pair two turtles, we performed an additional prediction for soft-shell turtle relationships among protein sequences with Bit-score. [3] Gene to ensure the robustness of our results (e.g., comparative expression families were constructed by clustering with hcluster_sg, the analysis). The additional prediction for soft-shell turtle was algorithm of which is similar to average hierarchical clustering. performed by taking advantage of the Ensembl prediction pipeline (Ps-ens_gene). Both of these soft-shell turtle gene sets can be accessed through the Ensembl website Additional gene prediction for soft-shell turtle by the (http://www.ensembl.org/Pelodiscus_sinensis/Info/Index). A Ensembl-prediction-pipeline sequencing depth for the miRNA was higher than 62,000 for each [1] Raw Compute Stage: This initial stage involved searching for sample (see the following section for more details). sequence patterns as well as aligning proteins and cDNAs to the genome. The annotation process of the high-coverage Chinese soft-shell turtle assembly began with the raw compute stage whereby

11

Nature Genetics: doi:10.1038/ng.2615 the genomic sequence was screened for sequence patterns (including [6] Filtering Coding Models: Coding models from the Similarity repeats) using RepeatMasker33 (version 3.2.8 with the parameters stage were filtered using modules such as TranscriptConsensus and ‘-nolow -species “pelodiscus_sinensis” –s’), RepeatModeler120 LayerAnnotation. RNA-Seq spliced alignments supporting introns (version open-1.0.5 to obtain a repeats library and filtered for an were used to assist in filtering the set. Apollo software131 was used to additional RepeatMasker run), Dust121 and TRF36. A combination of visualise the results of filtering. all the repeat analyses (RepeatMasker, RepeatModeler, Dust and [7] Addition of RNA-seq models: The largest set of turtle-specific TRF) brought the total proportion of the masked genome to 43.59%. evidence was from paired-end RNA-seq, which was used where Transcription start sites were predicted using Eponine–scan122 and appropriate to help inform our gene annotation. A set of 1.2 billion FirstEF123. CpG islands longer than 400bases and tRNAs124 were also reads that passed QC were aligned to the genome using BWA, predicted. Genscan125 was run across RepeatMasked sequence, and resulting in 1.1 billion (87.6%) reads aligning and properly pairing. the results were used as input for UniProt126, UniGene127 and The Ensembl RNA-Seq pipeline was used to process the BWA Vertebrate RNA128 alignments by WU-BLAST129 (Passing only alignments and create a further 120 million split read alignments Genscan results to BLAST is an effective way of reducing the search using Exonerate. The split reads and the processed BWA alignments space and therefore the computational resources required). This were combined to produce 21,417 transcript models in total (one resulted in 378476 UniProt, 328450 UniGene and 322092 Vertebrate transcript per ). The predicted open reading frames were RNA sequences aligning to the genome. compared to Uniprot Protein Existence (PE) classification level 1 and [2] Targeted Stage: This stage involved the generation of coding 2 proteins using WUBLAST; models with no BLAST alignment or models from Chinese soft-shell evidence. Turtle protein sequences poorly scoring BLAST alignments were discarded. The resulting were downloaded from the public databases UniProt, models were added into the gene set where they produced a novel SwissProt/TrEMBL126 and RefSeq127. Models of the coding model or splice variant. In total, 10892 models were added. sequences (CDS) were produced from the proteins using Genewise43 [8] Addition of UTRs to coding models: The set of coding models and Exonerate130. The generation of transcript models using was extended into the untranslated regions (UTRs) using turtle turtle-specific data is referred to as the “Targeted stage”. This stage cDNA and contigs from the 454 sequencing project. This resulted in uncovered 32 of the 33 turtle proteins that were used to build coding 5,935 of 32,470 coding models with UTRs. In addition, 10,892 models. However, none of these models were used in subsequent RNA-Seq models also contributed to the addition of UTRs to the analyses, as they were overridden with longer models from the final models. Similarity stage. [9] Generating multi-transcript genes: The above steps generated a [3] cDNA and EST Alignment: Turtle cDNAs and ESTs were large set of potential transcript models, many of which overlapped downloaded from Genbank, clipped to remove polyA tails, and one another. Redundant transcript models were removed, and the aligned to the genome using Exonerate. Of the 334 turtle cDNAs, remaining unique set of transcript models were clustered into 216 sequences aligned, while 142 of the 178 ESTs aligned. The multi-transcript genes where each transcript in a gene has at least one cutoffs for both data sets were 90% coverage and 90% identity. coding exon that overlaps a coding exon from another transcript Contig sequences generated by the Chinese soft-shell turtle within the same gene. The final gene set of 18,272 genes included consortium using 454 sequencing were also aligned to the genome. 4,603 genes built only using proteins from other species and 8,070 Of 84,680 initial set, 54,739 aligned with a cut-off of 90% coverage genes built only from RNA-Seq evidence. Overall, 3,322 genes had a and 95% identity. mixture of RNA-Seq evidence and evidence from proteins of other [4] Similarity Stage: This stage involved the generation of additional species. A further 1,263 genes were supported only by Ensembl coding models using proteins from related species. Due to the chicken or Ensembl lizard translations. The remaining 917 genes scarcity of turtle-specific protein and cDNA evidence, the majority of contained transcripts from all four sources. The final set of 20,752 gene models were based on proteins from other species. UniProt transcripts included 12,384 transcripts with support from RNA-Seq alignments from the raw compute step were filtered, and only evidence, 8,616 transcripts with support from proteins of other sequences belonging to UniProt's Protein Existence (PE) species and 2,236 transcripts with support from Ensembl chicken or classification level 1 and 2 were kept. WU-BLAST was rerun for lizard data. A small subset of the transcripts (2,581) was supported by these sequences and the results were passed to Genewise43 to build evidence from two sources. coding models. The generation of transcript models using data from [10] Pseudogenes, non-coding genes, Stable Identifiers: The gene set related species is referred to as the “similarity stage”. This stage was screened for potential pseudogenes. Before public release the resulted in 53,646 coding models. transcripts and translations were given external references cross [5] Alignment of Ensembl chicken and anole lizard translations. references to external databases), while translations were searched for Ensembl chicken and anole lizard translations were aligned against domains/signatures of interest and labeled where appropriate. Stable the turtle genome. The cutoff values for coverage and identity were Identifiers were assigned to each gene, transcript, exon and set at 80% and 60%, respectively. Of the chicken translations, 14,935 translation. (When annotating a species for the first time, these of the 16,736 total retrieved translations aligned. From the 17,805 identifiers are auto-generated. In all subsequent annotations the stable lizard translations, 17,264 sequences aligned above the set thresholds. identifiers are propagated based on comparison of the new gene set to The resulting coding models were taken through all subsequent steps. the previous gene set.) Small structured non-coding genes were added using annotations taken from RFAM132 and miRBase133. The

12

Nature Genetics: doi:10.1038/ng.2615 final gene set consists of 18188 protein coding genes including families, we next extracted CDS sequences from each single-copy mitochondrial genes, these contain containing 20752 transcripts. A family and made CDS sequence alignments guided by its amino acid total of 97 pseudogenes were identified and 1018 ncRNAs. alignments created by MUSCLE program60. The sequences were then concatenated to one super gene sequence for each species. Codon Gene ontology analysis position 1, position 2, position 3, position 1+2 sequences were Over-represented GO IDs were investigated by testing the bias in extracted from CDS alignments and were concatenated and aligned, frequency to other GO IDs among certain gene sets (e.g., genes that and respectively used for building trees, along with protein, CDS were expressed differentially) using the total defined GO files as a sequences. Then, PhyML47,48 was applied to construct the control distribution. Fisher’s exact test was used for this analysis, phylogenetic tree under HKY85+gamma or GTR+gamma model for with an alpha level of 0.01. Developmental genes were defined as nucleotide sequences and JTT+gamma model for protein sequences. genes that have developmental GOs, and developmental GOs were aLRT values were taken to assess the branch reliability in PhyML. defined as GOs having GO:0032502 (developmental process) as an Also, RAxML80 was applied for the same set of sequences to build ancestor. A total of 5659 developmental GOs were extracted from phylogenetic tree under GTR+gamma or JTT+gamma model for nd ver. 1.2 (downloaded on Sep. 2 , 2012) of the gene ontology obo nucleotide and protein sequences respectively, 1,000 times of rapid formatted file using the obo edit program (http://oboedit.org/). bootstrap were employed to assess the branch reliability in RAxML.

Turtle ultra conserved non-coding elements (TUCNE) Divergence time estimation Turtle conserved non-coding regions have been determined using The same set of codon position 1+2 sequences that was used for a pairwise alignment between P. sinnensis and C. mydas genomes phylogenetic tree construction was used for estimating divergence and filtering out: (i) regions with a good alignment with human, time. Fossil calibration times were set as described in Table 1.1. The chicken or anole lizard. (ii) regions that are coding. (iii) final elements PAML mcmctree (PAML version 4.5)50-52 program was used to that are shorter than 30 bp (this can happen if an element partially determine split times with the approximate likelihood calculation overlaps a coding element). The turtle ultra conserved regions were method and the “correlated molecular ” and “REV” substitution searched using a 30bp sliding window with perfect matches in the P. model. The shape and scale parameters describing the gamma prior sinnensis - C. mydas alignment and no match in any other alignment. for the overall substitution rates were set according to the substitution After removing all regions that correspond to repeats (and delete rate per time unit computed by PAML baseml52. The alpha parameter elements shorter than 30bp), the conserved regions were further for gamma rates at sites was set to that computed by PAML baseml. filtered into those reside within upstream region (10kb upstream from The MCMC process of PAML mcmctree was set to sample 10,000 start codon) of each gene. times with the sample frequency set to 5,000 after a burn-in of 5,000,000 iterations. The fine-tuned parameter was set to make the acceptance proportions fall in the interval (20%, 40%). The other Turtle-specific genes parameters were set at the default values. Tracer (v1.5.0) was applied The genes existing in both the soft-shell turtle and green sea turtle to check convergence, and two independent runs were performed to genomes, but not in C. mydas, A. mississippiensis, C. porosus, G. confirm convergence. Additionally, codon position 1, codon position gallus, T. guttata, A. carolinensis, C. familiaris, H. sapiens, O. 2, and protein sequences were used to estimate divergence time with anatinus, and X. tropicalis, were extracted. Then enrichment analysis similar methods with PAML mcmctree. for the turtle-specific genes were performed based on the algorithm When the multidivtime135,136 program was used to calculate split 97 presented by GOstat , with the soft-shell turtle genes (Ps-ens_genes) time, the MCMC chain was run for 10,000,000 generations as burnin as the control. The p-value was approximated by the chi-square test. and approximately 50,000,000 generations to calculate posterior Fisher’s exact test was used when any expected value was below 5, distributions. The Western clawed frog was identified as an outgroup which will make the chi-square test inaccurate. This program was taxon in the estimation and was discarded from the tree by 134 implemented as a pipeline . To provide succinct results in the GO multidivtime. Other parameters were set as suggested in the manual. and IPR enrichment analyses, if one of the items was ancestral Likelihood values in the first (baseml) and second (estbNew135,136) to another and the enriched gene list of these two items was steps were checked to ensure the global optimizations were reached. same, the ancestral item was deleted from the results. To adjust Meanwhile, two independent runs were performed to check for multiple testing, we calculated the False Discovery Rate (FDR) convergence. 151 using the Benjamini-Hochberg method for each class. We enriched For r8s137, the maximum likelihood trees inferred by RAxML 8 GO categories and 3 IPR categories. (with branch lengths) were used as input to calculate split times in the global molecular clock with default settings. Phylogenetic tree reconstruction To reconstruct the phylogenetic tree, we used single-copy gene Gene loss analysis families conserved among P. sinensis (Ps-BGI_gene), C. mydas, A. We used the protein sequences of the two turtles and their related mississippiensis, C. porosus, G. gallus, T. guttata, A. carolinensis, C. species (Gallus gallus, , Xenopus tropicalis and familiaris, H. sapiens, O. anatinus, and X. tropicalis. Single-copy Taeniopygia guttata) blasted against the human protein sequences gene families were defined as follows: in each family each species (downloaded from Ensembl gene v.68). The proteins with blast hits to has just one gene copy. By using above determined single-copy gene

13

Nature Genetics: doi:10.1038/ng.2615 human proteins (threshold of identity 30 and align ratio 30) were we chose the longest transcript as the homolog in the two turtles, and identified as the homologs of the human protein. Subsequently, the TblastN56 was used to identify the probable location of each gene in human proteins that lacked homologs in both of the turtles but had the turtle genomes with the parameter of “-m 8 -a 4 -F F -e 1e-5”. homologs in some of the other related species (Gallus gallus, Anolis GeneWise43 was then used to identify the gene structure of turtles carolinensis, Xenopus tropicalis and Taeniopygia guttata) were with the parameter of “-genesf -trev -quiet”. After obtaining the gene identified as turtle lost genes. structures for each species, the genes that were defined to be pseudogenes in both turtles were used for further analysis. Statistical analysis of gene family expansion and contractions Subsequently, manual checking was performed to test whether these (E/C genes) mutations were falsely caused by bad assembly results or incorrect We generated pairwise whole genome alignments for anole lizard - homolog prediction, and the false positive cases were filtered out. soft-shell turtle and anole lizard - green sea turtle using LASTZ38,53. Transcripts of the human genes were mapped to the turtle genomes to Subsequently, we created three-way alignments using MULTIZ54, ascertain whether possible splicing variants exist that the which showed approximately 61 Mbp of conserved genome mutated exon(s). alignment sequences. When an anole lizard gene fell in an area of conserved sequence (overlap>100 bp), and there was no homologous Predictions of Olfactory Receptor (OR) Genes gene in the corresponding aligned sequences of soft-shell turtle and The methods we utilized for identifying OR genes were performed green sea turtle (Ps-ens_gene) we hypothesized that a potential gene as described previously55 with minor modifications. We first loss occurred at that locus for the turtle genomes. Further, we added a conducted TBLASTN56 searches against the genome sequences of a 100-Kb extension at each end of the conserved region for the two given species with known OR genes as queries. For query sequences, turtle genomes to find genes by GENEWISE (2.2.0) using the we used 119 functional OR genes in human, mouse, and zebrafish, corresponding anole lizard gene as a query. When the alignment rate all of which show 50% or less amino acid identities to one another. of a predicted homologous gene fragment in the synteny locus was We did this rather than using the same 920 OR genes that were used less than 30% compared to the query sequence, and there were no in our previous study55 to reduce the computational time. The query large gaps in the neighboring genome sequence, we considered it a genes include OR genes from groups δ, ε, ζ, and η that are absent gene loss. Additionally, when there was a frame-shift or premature from the amniote genomes91. We used an E-value cutoff of 1e-5 for termination in the predicted homologous gene fragment at the the TBLASTN searches. The OR gene identification methods were synteny locus, we also considered this to be a gene loss. For each the same as those previously described55 with the exception of this gene family we estimated rates of gene gain and loss using the CAFE first-round TBLASTN search. 83 software . This software models gene family evolution as a To construct the phylogenetic tree in Figure 1C, the translated stochastic birth-and-death process where genes are gained and lost amino acid sequences of OR genes were aligned using the program independently along each branch of a phylogenetic tree. We used a E-INS-i in MAFFT57. Poisson correction distances were calculated species tree containing the Chinese soft-shell turtle, the green sea after all the alignment gaps were eliminated. We then constructed a turtle plus 10 additional species (human, dog, platypus, chicken, phylogenetic tree from these distances by the neighbor-joining (NJ) zebra finch, anole lizard, alligator, crocodile, western clawed frog and method58 using the program LINTREE59 (available at medaka). In CAFE, the lambda parameter describes the rate of https://homes.bio.psu.edu/people/faculty/nei). change as the probability that a gene family either expands or The numbers of OR genes in ancestral species and the numbers of contracts (via gene gain and loss) per gene per million years. We gene gains and losses in evolution (Figure 2b) were calculated by the estimate over 0.001 gene gains and losses per million years for both reconciled tree method described in Niimura et al.55. We used 70% as turtles and the lineage from their common ancestors. CAFE assumes the bootstrap value cutoff. Group β genes were used as the out-group at least one member of each gene-family in the root of the species for the estimation. The programs are available at tree, for this reason only the gene families having at least 1 gene at http://bioinfo.tmd.ac.jp/~niimura/software.html. the root of the species tree were used. Genes in these expanded or contracted families were defined as E/C genes (See also Genes of accelerated evolutionary rate in the turtle lineage supplementary Table 16). Homologous genes in soft-shell turtle, green sea turtle, and other We then analyzed gene gains and losses at the gene-family level related species (chicken, zebra finch, anole lizard, Xenopus tropicalis focusing on those families having significant changes in any of the and platypus) were first detected by the all-versus-all blastp program. two turtles or in their ancestors (two turtles and sauropsis). In these The orthologs were defined by a reciprocal best blast hit between analyses, we run CAFE several times grouping the gene families by human and the other species. The full orthologous gene sets were lowest common ancestor to preserve the assumption of at least one aligned using the program MUSCLE 60. We then compared a series member at the root of the tree. The significant families were further of evolutionary models within the likelihood framework using the inspected manually. phylogenetic tree obtained by our analysis. A branch model52 was used to detect the average length (ω) across the tree (ω0), the ω of the

Pseudogene / frame-shifted gene detection ancestor of soft-shell turtle, the green sea turtle branch (ω2) and the ω Human proteins (Ensembl release 61) were used as the reference of all of the other branches (ω1). proteins to detect the pseudogenes in turtles. Firstly, for each gene,

14

Nature Genetics: doi:10.1038/ng.2615 Embryo sampling and mRNA extraction [3] HiSeq non-stranded RNA-Seq: The deep sequencing data for Fertilized soft-shell turtle and chicken eggs were obtained from “RNA-Seq for gene expression analysis (see following sections)” local farms in Japan. Turtles eggs were obtained during the breeding was also used for transcriptome identification. See “Gene prediction season, which ranges from June to September. After allowing the by Ensembl-prediction-pipeline” for details regarding how we eggs to grow in a humidified incubator (30 °C for turtle eggs and integrated the data for transcriptome identification and gene 38 °C for chicken eggs), the developmental stages were determined predictions. based on previous descriptions (the TK stage for turtle24 and the HH stage for chicken61, Supplementary Figure 13), and the eggs were De novo transcriptome assembly collected. Amniotic membranes (a yellow-like region in the gastrula Using the two RNA-Seq reads with strand information, we first 140 ~ TK9 embryos for turtle embryos, and a primitive streak ~ HH16 for merged the pair-reads that had overlapping regions using SeqPrep chicken embryos) were removed before mRNA extraction, and more (with default settings). Next, de novo transcriptome assembly was 141,142 than two individual embryos were pooled for each sample. performed by Trinity with default parameter For the mRNA extraction, excised staged embryos (2 ~ 60 settings except the heap space setting for the butterfly program individuals depending on stage) were quickly frozen, crushed in (--bflyHeapSpace 300G). liquid nitrogen, and total RNA was extracted using the RNeasy Lipid Tissue (QIAGEN, . #: 74804, 75842). After testing the integrity RNA-Seq for gene expression analysis. and purity of the RNA with gel electrophoresis and an absorption Samples at each stage were composed of at least two or more spectrometer, the mRNA was extracted using the Ambion MicroPoly individual embryos to obtain ample RNA and to average the (A) Purist kit (Life technologies cat. # AM1919). To obtain samples fluctuations derived from individual embryos. Biological replicates with minimal rRNA contamination, up to 100 µg of total RNA was for each developmental stage were created from an independent used for this step. The quality of purified mRNA samples was sample pool so that proper statistical populations could be estimated. checked with an Agilent 2100 BioAnalyzer before preparing the For each mRNA sample prepared (see Embryo sampling and mRNA libraries for sequencing. extraction section), a barcoded sequencing library was constructed using the standard protocol of the TruSeq RNA Sample Prep Kit RNA-Seq for transcriptome identification (Illumina) with a minor modification for RNA fragmentation (four [1] 454 Titanium sequencing: Six independent libraries were minutes instead of eight minutes at 94ºC, no poly (A) selection). constructed from soft-shell turtle embryos at the gastrula, TK5, TK7, Multiplex sequencing with 101bp single-end reads was performed on TK12, TK18, and TK26 stage. For each sample, sequencing libraries an Illumina HiSeq2000 instrument at the NIBB Core Research were constructed by following the manufacturer’s standard protocol. Facilities. This was followed by raw data processing, base calling and [2] HiSeq Strand-specific RNA-Seq: Approximately equal amounts quality control by the manufacturer’s standard pipeline using RTA, of mRNAs from each embryonic stage were mixed together, and OLB and CASAVA. The quality of the output sequences was libraries were constructed for further sequencing. Two libraries were inspected using the FastQC program prepared by methods that retain strand-specific information. One of (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/). Adapter 138 these is a dUTP-based method62,63, which was modified to comply sequences were trimmed using the cutadapt program (ver 1.0) with with Illumina’s TruSeq RNA sample prep kit the options “-q 20 -O 4 -e 0.1”. (Ps_stranded_RNA-Seq_dUTP); the other is an original method Sequencing depth of RNA-Seq: The curves in Supplementary Figure developed at BGI (Ps_stranded_RNA-Seq_BGI). In brief, the 9A suggest that the HiSeq reads for each sample were large enough dUTP-based method takes advantage of complementary to cover almost all the expressed genes in each sample, which is strand-specific dUTP degradation. The original method begins with practically accepted as being less biased by read depth. Nonetheless, 143 fragmentation of mRNA and reverse transcription to synthesize first unfavorable biases are noteworthy as noted by a recent study . We strand cDNA. Next, the second strand of cDNA was synthesized, the therefore made a depth-adjusted dataset for further expression ends were repaired, 3' adenosines were added along with adapters, analyses by randomly selecting mapped 10 Mega reads. Random and agarose gel electrophoresis was run to retrieve the fragments. For selection of mapped reads is more advantageous than simply making the version developed at BGI, after digesting the products with the a random selection from raw reads. This is because mapped reads are Uracil-N-glycosylase (UNG) enzyme, we PCR-amplified and theoretically free of the bias that arises from different levels of rRNA gel-purified to obtain the cDNA library contamination. (Ps_stranded_RNA-Seq_BGI). Sequencing was performed with

Illumina HiSeq 2000 (paired-end, 100 bp) followed by read clean up Obtaining gene expression scores from RNA-Seq data with cutadapt138 (ver 1.0) with the options “-q 20 -O 4 -e 0.1 -m 50 139 Clean reads of each sample were mapped to the genome using the --discard” . In brief, we trimmed low-quality (quality score lower 65 aln command of bwa software (ver. 0.5.9-r16) with the -t 12 option. than Pred 20) ends and adapter sequences (minimum overlap 4 bp, For reads from turtle and chicken samples, mapping was performed allowing 10% of mismatch) and discarded reads shorter than 50 bp. against the assembled soft-shell turtle genome and the chicken The RNA-Seq data are available through DRA under the accession genome sequences downloaded from the Ensembl database number DRA000567 66 67 (Ensembl66), respectively. SAMtools , BEDtools and the DEGseq (ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/). 68 package for the R program were used for calculating

15

Nature Genetics: doi:10.1038/ng.2615 count data that mapped to coding regions. Only the coding sequences stages (gastrula, neurula, TK7 and TK9). (2) The chicken orthologs of soft-shell turtle and chicken gene sets were used as average expression levels in HH28 & HH38 do not show more than references for mapped tag counting. five times higher expression than those in the Prim-HH14 stages).

Comparison of gene expression profiles Wnt gene identification, cloning and whole-mount in situ In comparing the gene expression profiles of soft-shell turtle and hybridization chicken embryos, we first determined the genes to compare by Wnt genes were identified by TBLASTN against the genomic defining one-to-one orthologous genes by Reciprocal Best BLAST sequence or Illumina RNA-seq data assembled by Trinity141, using Hit (RBBH). A Blast search was performed between the coding the corresponding Hox amino acid sequences of mouse, chicken or sequences of soft-shell turtle and chicken using the anole lizard as queries. The corresponding genomic sequences were BLASTP algorithm of NCBI BLAST+64 with an retrieved and a model was predicted by GeneScan (ver. 1.0)125 or E-value of 1e-5 (10-5). In total, we found 11602 ortholog-pairs. We GeneWise2 (ver. 2.1.20). The orthology of the predicted genes was next normalized gene expression scores by two independent methods confirmed by BLASTP against the NCBI Genbank database. cDNA to deduce a robust conclusion from these methods (RPKM and was generated from mRNA extracted from different stages of TMM normalization69). Normalization was performed with all soft-shell turtle using the GeneRacer Kit (Invitrogen). We designed samples at once so that similarity scores obtained from each sample primers to clone partial sequences of the 20 Wnt genes. The pair were analytically comparable. After obtaining the normalized sequences of the primers, vectors and GenBank accession numbers expression scores of 11602 genes for each sample, the scores were of the different genes are listed in the Supplementary Table 27. The log (base=2) transformed and plotted to make a scattergraph to clones were used to ribosynthesise digoxigenin-labeled RNA probes compare the expression scores of two samples. Next, the distributions according to manufacturer’s protocol (Roche). The whole-mount in of the scattergraphs were evaluated either by the Pearson situ hybridizations were performed as previously described28. product-moment correlation coefficient, the Spearman correlation coefficient, total Euclidean distances (t-Euclidean), or total

Manhattan distances (t-Manhattan) to estimate the similarities in the microRNA extraction, prediction, and expression analysis gene expression profiles of the two samples being compared. We microdissected limbs, body walls and CRs from 20 embryos Because t-Euclidean and t-Manhattan distances require a rectangular of the Chinese soft-shell turtle. The microdissections were performed coordinate system as a prerequisite, the mean quantile normalization in cold PBS, and the tissues were stored in RNAlater (Ambion) at was utilized to meet this requirement before the calculation of -20°C. The small RNA phase was extracted from the dissected distance. For analysis with mapped-10M reads, two independent tissues the mirVanaTM microRNA Isolation Kit (Ambion, Life random selections of mapped-10M reads were performed, and we Technologies). Small RNA libraries were prepared and sequenced averaged the mapped counts to make hypothetical data files. using a HiSeq2000 platform (Illumina) at the Beijing Genomics Most conserved stages were estimated by comparing the Institute. We obtained 24,168,754 reads for the CR, 24,772,037 reads similarities of gene expression profiles among all pairs of for the limbs and 28,025,813 reads for the body walls. The turtle-chicken embryos (9 turtle stages x 8 chicken stages = 72 pairs). sequencing depth for each tissue was 63,020X for CR, 84,540X Combinations of two biological replicates for each sample yielded 2 for limb, and 62,495X for body wall tissues (calculated against x 2 = 4 data points for each turtle-chicken pair, and these values were predicted mature sequences). We used miRDeep2 (ver. 2.0.0.3) compared by the Welch two-sample t-test or the Wilcoxon software, which is based in the Perl programming language71. Briefly, signed- test based on satisfactions of statistical requirements. The the soft-shell turtle genome v1.0 was indexed by bowtie-build Holm-corrected alpha level was applied for these multiple software using default parameters. The reads were manipulated using comparisons. Only results reproduced by the dataset from two the mapper module of miRDeep2 to trim adapters, eliminate reads different normalizations (RPKM70, TMM69) were considered to be smaller than 18 nucleotides long, collapse those with the same significant. sequence and map them to the genome index. The reads that mapped For Reciprocal Best Hit Stages (RBTS) analysis, stages that to the genome were used by the miRDeep2 core algorithm to predict exhibited the most similar gene expression profiles were tested by the probable hairpin structures around a stack of a minimum number of Welch two-sample t-test or the Wilcoxon signed-rank test based on reads determined by the internal controls of the software71. We used satisfactions of statistical requirements with a Holm-corrected alpha the miRDeep2 module with default parameters except for the score level. cutoff, for which we used the option –b 5. This was the value that

yielded a signal-to-noise ratio larger than 15, which yields a condition soft-shell turtle genes that showed a significant increase in that is 50% more stringent than that which was used in the expression level after the phylotypic period (IAP). description of the algorithm 71. Additionally, we selected mature The turtle IAP genes were screened by the following criteria: (1) microRNA sequences and their precursors from chicken, zebrafinch mean expression level after the phylotypic period (stages that begin and Anolis carolinensis from miRBase v.18 to serve as microRNAs to show turtle-specific morphologies, TK15-TK23) is more than five from related species. Using these microRNAs, miRDeep2 identified times higher (Wilcoxon test, alpha level = 0.01) than those of earlier predicted microRNAs with the same seed sequence. As a result, the

16

Nature Genetics: doi:10.1038/ng.2615 programs generated a list of microRNAs for every sample Kolmogorov–Smirnov test for normal distribution; otherwise, the (Supplementary_Table 23.xls). Only microRNA predictions that Wilcoxon signed-rank test was used. were lower than a significant Randfold alpha level (P < 0.05 mononucleotide shuffling, 999 permutations, see Friedländer et al.71. Software and computation environment 2008 for details) were taken into account for subsequent comparisons. Data processing and command pipelining were performed using The predictions were manipulated by normal bash shell scripts to sort customized Python, Perl scripts, R (http:/www.R-project.org/) and C unique sequences and compare common vs. differential predictions shell scripts. Heavy calculations were performed using the cluster among samples. The predicted miRNAs were aligned against the computers at RIKEN, NIBB, BGI, and the Sanger Institute. chiken or green sea turtle genomes using bowtie with –k 1 and –best parameters, allowing 0 (-v 0), 1 (-v 1) or 3 (-v 3) mismatches.

* Species names (general name: binomial name) Green sea turtle: C. mydas microRNA target predictions Soft-shell turtle: P. sinensis Annotated 3’-UTRs of P. sinensis transcripts were obtained from Chicken: G. gallus Ensembl build 68 using BioMart. Because some genes produce Zebra finch: T. guttata different alternatively spliced transcripts, the number of genes is Saltwater crocodile: C. porosus smaller than that of transcripts (see Supplementary Table 25). The American alligator: A. mississippiensis 72 prediction of the targets was performed with miRanda Anole lizard: A. carolinensis 144 according to the method reported by John et al. Dog: C. familiaris Human: H. sapiens Statistical tests Platypus: O. anatinus To avoid an inflated type I error rate, an alpha level of 0.01 (further Western clawed frog: X. tropicalis Bonferronni correction in case of multiple comparisons) was Medaka: O. latipes accepted for statistical significance throughout the analyses unless Bottlenose dolphin: T. truncatus otherwise specified. Statistical methods were carefully chosen to properly reflect the population of interest. The Welch two-sample t-test was used for two-sample comparison when the data passed the

17

Nature Genetics: doi:10.1038/ng.2615

Supplementary Figures and Tables

Supplementary Figure 1. Skeletal structure of a Chinese soft-shell turtle The skeletal structure of a Chinese soft-shell turtle (P. sinensis) is illustrated here. The turtle shell consists of a dorsal shell called a carapace (colored in blue) and a ventral shell called a plastron. Note that the carapace consists of axial skeleton (vertebrae and ribs) and dermal bone, which is one of the unique features of the turtle. Many of the trunk muscles are also absent in adult turtles. Unlike other tetrapods, the shoulder blade (scapula, colored in red) of the turtle is located ventral to the axial skeleton.

soft-shell turtle green sea turtle Read Insert Physical Length Total Data Sequence Physical Total Data Sequence Size Depth (bp) (G bases) Depth (X) Depth (X) (G bases) Depth (X) (X) 170 bp 100 61.38 29.23 24.85 51.83 23.56 20.03 500 bp 100 32.83 15.63 39.08 46.08 20.95 52.36 800 bp 100 22.77 10.84 43.38 24.22 11.01 44.04 2 Kbp 49 54 25.71 524.76 30.22 13.74 280.33 5 Kbp 49 32.76 15.60 795.98 15.87 7.21 368.04 10 Kbp 49 13.66 6.5 663.52 8.06 3.66 373.84 20 Kbp 49 3.31 1.58 321.87 3.18 1.45 294.99 40 Kbp 49 1 0.48 195.30 1.48 0.67 274.58 Total 221.71 105.58 2608.74 180.94 82.25 1708.21 Supplementary Table 1. Paired-end DNA libraries sequenced for soft-shell turtle and green sea turtle genome assembly. Sequencing depths were calculated for high-quality clean data based on genome size (2.1 Gb for soft-shell turtle and 2.2 Gb for green sea turtle). The basic statistics were performed on cleaned sequencing data.

18

Nature Genetics: doi:10.1038/ng.2615 Contig Scaffold soft-shell turtle Size (bp) Number Size (bp) Number N90 5,526 100,520 442,219 865 N80 9,577 71,950 1,039,533 542 N70 13,470 53,421 1,686,594 376 N60 17,505 39,658 2,486,603 266 N50 21,907 28,859 3,331,964 190 Longest 177,994 ---- 16,023,048 ---- Total Size 2,114,220,409 ---- 2,210,337,521 ---- Total Number (>=100 bp) ---- 265,137 ---- 76,151 Total Number (>=2 kb) ---- 143,208 ---- 4,548 Contig Scaffold green sea turtle Size (bp) Number Size (bp) Number N90 3,730 116,789 427,510 737 N80 8,131 79,427 1,258,672 451 N70 12,065 57,970 2,114,478 315 N60 16,060 42,643 3,012,194 229 N50 20,392 30,830 3,777,511 162 Longest 177,532 ---- 22,916,839 ---- Total Size 2,139,401,126 ---- 2,236,138,468 ---- Total Number (>=100 bp) ---- 561,968 ---- 352,958 Total Number (>=2 kb) ---- 140,840 ---- 5,442 Supplementary Table 2. Basic statistics of the assembled genomes of soft-shell turtle and green sea turtle. The statistics data were generated based on the original assembly files both for soft-shell turtle and green sea turtle.

Supplementary Figure 2. GC content of the two turtle genomes. (a) A comparison of the GC contents of five vertebrates. The x-axis indicates GC content and the y-axis indicates the proportion of the bin number divided by the total windows. We used 500-bp bins (with a 250-bp overlap) sliding along the genome. The data shows that the mode GC content of the soft-shell turtle genome and green sea turtle genome are approximately 44% and 43%, respectively, which is similar to that of the human, chicken and anole lizard genomes. (b-c), GC content and sequencing depth. A scatter plot was made by sliding 50 kb non-overlapping windows against the assembled soft-shell turtle (b) and green sea turtle genome (c), and the GC content and average depth were calculated within the sliding window. The x-axis in the scatter plot represents GC content (%), whereas the y-axis represents the average depth. The average depth was obtained by aligning the filtered reads onto the assembled genome sequence using SOAPaligner and allowing 3 mismatches for 49-bp reads and 5 mismatches for 100-bp reads. Depth frequency was then calculated for each of the genome bases. Summary graphs are provided as histograms illustrating the frequency at various depths (histogram at the right) and GC contents (histogram at the top). The percentage of sequencing depths below 10 was less than 3% in both genomes, indicating an extremely high sequencing depth covering the whole genome.

19

Nature Genetics: doi:10.1038/ng.2615 soft-shell turtle green sea turtle Proteins Completeness Completeness Proteins (%) (%) Complete 197 79.44 205 82.66 Group 1 47 71.21 53 80.3 Group 2 43 76.79 44 78.57 Group 3 48 78.69 51 83.61 Group 4 59 90.77 57 87.69 Partial 234 94.35 243 97.98 Group 1 61 92.42 66 100 Group 2 54 96.43 56 100 Group 3 56 91.80 59 96.72 Group 4 63 96.92 62 95.38 Supplementary Table 3. Coverage rate of core eukaryotic genes (CEGs) in the assembled genomes by CEGMA. “Complete” indicates that proteins from 248 core eukaryotic genes (CEGs) were covered by the genome assembly with an alignment length longer than 70%. “Partial” indicates that CEG proteins were covered by the assembly with a coverage rate that exceeded a pre-computed minimum alignment score. The coverage rates were calculated based on assembly PelSin_1.0 obtained from NCBI after uploading original assembly to NCBI for soft-shell turtle, while based on original assembly before processing and uploading to NCBI for green sea turtle.

Supplementary Figure 3. Quality check of the assembled genome by alignment with Sanger-sequenced, randomly picked up 5 control libraries. Red bars and blue bars indicate the scaffold sequences and the Sanger-sequenced control clone sequences, respectively. Scaffold sequences that aligned well to the fosmid clone sequences are indicated in yellow polygons. Further analysis clarified that the region covered by the zhbcxa clone is rich in repetitive sequences (data not shown), which explains its low alignment quality. Sanger sequencing was performed with an ABI 3730 to approximately 6-fold coverage. Sanger reads were assembled by overlapping and filled gaps by further rounds of sequencing to obtain complete maps of the fosmid clones. Alignment of the assembled genome sequences to the five fosmid clones was performed using the BLAST algorithm (blastn, E-value = 1e-20).

20

Nature Genetics: doi:10.1038/ng.2615

Gap Fosmid Length Coverage Alignment Aligned Aligned Scaffold Gap Length Gaps Ratio ID (bp) Ratio (%) Blocks Scaffold Length (bp) (bp) (%) zhbaxa 33,547 97.05 6 1 5,958,399 2 936 2.79 zhbbxa 40,286 97.97 6 1 5,187,697 2 200 0 zhbcxa 33,442 89.4 1, 4, 21 3 743,541 4 202 0.91 zhbexa 39,977 99.77 7 1 5,664,467 0 0 0.07 zhbfxa 34,888 98.54 10 1 7,370,469 1 100 0.29 Supplementary Table 4. High coverage of the assembled genome against Sanger-sequenced control libraries.

Coverage rate Number With > 90% sequence With > 50% sequence Total length of the EST Dataset of EST in one scaffold in one scaffold (bp) clusters by the clusters assembly (%) Number Percent Number Percent > 0bp 79,305 80,382,077 97.11 70,476 88.87 77,758 98.05 > 200bp 71,973 79,257,231 97.09 63,529 88.27 70,481 97.93 Ps_454_mRNA-Seq > 500bp 51,023 71,828,930 97.07 44,787 87.78 49,928 97.85 > 1000bp 22,793 52,043,715 96.93 19,530 85.68 22,174 97.28 Ps_dUTP_RNA-Seq > 0bp 307,462 287,345,309 96.29 258,997 84.24 303,128 98.59 and > 200bp 307,462 287,345,309 96.29 258,997 84.24 303,128 98.59 Ps_strand_RNA-Seq > 500bp 152,410 228,387,625 96.17 125,122 82.10 150,251 98.58 assembled together > 1000bp 69,049 171,690,255 96.12 55,980 81.07 68,187 98.75 Supplementary Table 5. EST cluster coverage rate by the soft-shell turtle genome. Transcriptome data generated by 454 sequencing and illumina sequencing (see Methods section) were assembled to produce 103,823 and 355,258 EST clusters respectively. Then we mapped whole genome sequencing reads to these EST clusters using SOAP, and filtered out the EST clusters with low alignment (< 30% alignment or average depth < 5) to remove potential miss-assembled clusters. Remaining 79,305 EST clusters generated by 454 sequencing and 307,462 EST clusters generated by illumina sequencing were used assess the genome assembly quality.

21

Nature Genetics: doi:10.1038/ng.2615

Supplementary Figure 4. Basic statistics of predicted genes of soft-shell turtle and green sea turtle. Basic statistics of predicted genes of soft-shell turtle (Ps-BGI_gene) and green sea turtle. The x-axis indicates length (bp) of each genetic feature and the y-axis indicates the percentage of genes that have the corresponding length. Features of predicted (GLEAN) gene sets for soft-shell turtle and green sea turtle, together with those of Western clawed frog, chicken and anole lizard (from Ensembl release 64) are illustrated.

Supplementary Figure 5. Evolutionarily conserved profiles of genes in two turtles. (a-b) Protein orthology comparisons of predicted genes. (a) Gene orthologues of chicken, zebra finch, green sea turtle, soft-shell turtle, saltwater crocodile, American alligator, anole lizard, dog, human, platypus, Western clawed frog and medaka. 1:1:1 indicates conserved single-copy genes. N:N:N indicates genes that are not 1:1:1, or genes that have sub-family homologs in either of the 12 species. Sauropsida specific indicates Sauropsida specific genes that present in at least two Sauropsida species and not in any non-Sauropsida species. Non-Sauropsida specific indicates non-Sauropsida specific genes that present in at least two non-Sauropsida species and not in any Sauropsida species. Patchy indicates orthologs that are present in at least one Sauropsida species and at least one non-Sauropsida species, but not present in all of the 12 species simultaneously. SD, species-specific duplicated genes; ND, species-specific genes. (b) Venn diagram showing the shared orthologous groups among genomes of green sea turtle, soft-shell turtle, anole lizard and human. Numbers of miRNAs are not to scale in the Venn diagram.

22

Nature Genetics: doi:10.1038/ng.2615

Repeat type Proportion (%) in total nucleotide bases Category Subcategory lizard soft-shell turtle green sea turtle chicken ALU 0.01 0.00 0.00 0.00 SINE MIR 0.18 0.52 1.08 0.23 Total 0.26 1.03 2.09 0.35 LINE1 0.19 0.01 0.01 0.03 LINE2 0.13 0.04 0.17 0.01 LINE L3/CR1 3.06 6.26 5.73 4.56 Total 3.47 6.32 5.91 4.60 ERVL 0.00 0.00 0.00 0.05 ERVL-MaLRs 0.00 0.00 0.00 0.00 LTR elements ERV_class I 0.04 0.01 0.01 0.02 ERV_class II 0.00 0.00 0.00 0.01 Total 0.04 0.03 0.04 0.09 hAT-Charlie 0.04 0.09 0.07 0.10 DNA elements TcMar-Tigger 0.38 0.05 0.06 0.01 Total 0.46 0.34 0.51 0.17 Unclassified 0.01 0.08 0.18 0.04 Total interspersed repeats 4.24 7.79 8.73 5.24 Small RNA 0.12 0.30 0.59 0.06 Satellites 0.00 0.00 0.00 0.00 Simple repeats 1.38 0.31 0.27 0.56 Low complexity 0.69 0.41 0.41 0.54 Total number of bases masked 6.43 8.53 9.46 6.34 Supplementary Table 6. General statistics of repeat elements in anole lizard, two turtles and chicken.

Average Number of predicted Average CDS Average exon Average intron Gene set exon per coding genes length (bp) length (bp) length (bp) gene Soft-shell turtle 18,188 1,587 7 259 4,644 (Ps-ENS_gene) Soft-shell turtle 23,649 1,268 7 187 4,211 (Ps-BGI_gene) Green sea turtle 19,633 1,456 8 179 3,860 Saltwater crocodile 17,795 972 5 189 2,715 American alligator 17,611 1,184 7 170 2,690 Supplementary Table 7. General statistics of predicted protein-coding genes of turtles and crocodiles.

Ps-BGI_gene Ps-ens_gene Cm_gene Type Number Percent (%) Number Percent (%) Number Percent (%) total KOGs 430 - 430 - 438 - one KOG aligns one gene 398 92.56 408 94.88 390 89.04 CDS overlap>0.8 163 37.91 278 64.65 178 40.64 CDS overlap>0.5 281 65.35 382 88.84 287 65.53 one KOG aligns several genes 3 - 4 - 3 - one KOG aligns zero genes 29 - 18 - 45 - Supplementary Table 8. Coverage rate of CEGMA-predicted Eukaryotic Orthologous Group (KOG) genes by annotated gene sequences. “Cm_gene” and “Ps-BGI_gene” indicate gene sets predicted by BGI’s annotation pipeline for the green sea turtle and soft-shell turtle, respectively. “Ps-ens_gene” indicates a gene set predicted by the Ensembl annotation for the soft-shell turtle. Statistical results indicated an overlap in gene numbers between KOG genes predicted by CEGMA and the three gene sets in the paper. The coverage of the assemblies was assessed using the CEGMA115 program (version 2.3)116.

23

Nature Genetics: doi:10.1038/ng.2615

Soft-shell turtle Green sea turtle Genome size (Gb) 2.21 2.24 Total clean data (Gb) 221.71 180.94 Assembly contig N50 length (Kb) 21.91 20.39 Assembly scaffold N50 length (Mb) 3.33 3.78 Number of scaffolds (>2Kb) 4,548 5,442 Percent of repeat (%) 42.47 37.35 GC content (%) 44.4 43.5 Ps-ens_gene Ps-BGI_gene gene Number of genes 19,327 23,649 19,633 Genes with InterPro annotation 16,062 14,719 14,164 Genes with GO annotation 15,154 12,492 12,043 Total length of coding region (bp) 27,939,847 29,981,919 28,581,024 Supplementary Table 9. Genome features of the soft-shell and green sea turtles.

Gene Sequence length RAxML PhyML Sequence type number (bp) log (likelihood) Model log (likelihood) Model Protein 1,113 913,304 -8,187,925 JTT+gamma -8,188,695 JTT+gamma Position 1 1,113 913,304 -4,565,465 GTR+gamma -4,565,465 GTR+gamma Position 2 1,113 913,304 -3,808,462 GTR+gamma -3,833,686 HKY85+gamma Position 1+2 1,113 1,826,608 -8,422,863 GTR+gamma -8,422,862 GTR+gamma Position 3 1,113 913,304 -7,708,981 GTR+gamma -7,708,981 GTR+gamma CDS 1,113 2,739,912 -16,558,373 GTR+gamma -16,579,999 HKY85+gamma Supplementary Table 10. Single-copy gene families identified for phylogenetic analysis.

P-value Tree # Tree topologya ΔlogL ±S.E. KH-test SH-test AU-test 1 (((((HS,CF),OA),((((TG,GG),(CP,AM)),(PS,CM)),AC)),XT),OL) ML 1.00 1.00 1.00 2 (((((HS,CF),OA),(((TG,GG),((CP,AM),(PS,CM))),AC)),XT),OL) 1069.4 ± 160.9 0.00 0.00 0.00 3 (((((HS,CF),OA),(((CP,AM),((TG,GG),(PS,CM))),AC)),XT),OL) 3778.2 ± 122.3 0.00 0.00 0.00 4 (((((HS,CF),OA),((((TG,GG),(CP,AM)),AC),(PS,CM))),XT),OL) 5119.2 ± 127.5 0.00 0.00 0.00 Supplementary Table 11. Statistical assessment of the phylogenetic position of the turtles using the maximum-likelihood method. aThe first letters of the genus and species names are shown for the individual species involved in the analysis (see Supplementary Figure 6 for details).

24

Nature Genetics: doi:10.1038/ng.2615 a b 100 Green sea turtle CDS 100 Green sea turtle 100 Soft-shell turtle Soft-shell turtle [I] 100 100 American alligator 100 Chicken 100 Saltwater crocodile 100 Chicken 100 100 100 100 American alligator Saltwater crocodile 100 Anole lizard 100 Anole lizard Dog 100 100 Dog 100 Human Human Platypus 100 Platypus Western clawed frog Western clawed frog Medaka Medaka 0 0.3 0.6 0 0.3 0.6

position 1+2 position 1 position 2 100 Green sea turtle 100 Green sea turtle 100 Green sea turtle [II] Soft-shell turtle Soft-shell turtle Soft-shell turtle 100 100 Chicken 100 Chicken 100 Chicken 100 100 100 100 100 100 100 American alligator American alligator 100 American alligator 100 Saltwater crocodile 100 Saltwater crocodile 100 Saltwater crocodile 100 Anole lizard 100 Anole lizard 100 Anole lizard 100 Dog 100 Dog 100 Dog Human Human Human 100 Platypus 100 Platypus 100 Platypus Western clawed frog Western clawed frog Western clawed frog Medaka Medaka Medaka 0 0.3 0 0.3 0 0.3

100 American alligator position 3 Saltwater crocodile [III] 100 100 Green sea turtle 100 Soft-shell turtle 100 100 Chicken

Anole lizard 100 Dog 100 Human Platypus Western clawed frog Medaka 0 0.3 0.6 0.9 1.2 1.5

Supplementary Figure 6. Phylogenetic analysis supports a close relationship between turtle and bird/crocodilian lineages. (a) Three major hypotheses of turtle origin, illustrating turtles as the [I] sister group to the lizard-snake-tuatara (Lepidosauria) clade, [II] sister group to birds and crocodilians (Archosauria), or [III] outside the diapsida (a clade composed of Archosauria and Lepidosauria). (b) Phylogenetic tree of 12 species constructed with RAxML under the GTR/JTT+gamma model based on CDS, , 1st codon position, 2nd codon position, 3rd codon position, and 1st + 2nd codon positions of the codon sequences of 1,113 genes. Each tree was run with 1,000 replications. All internal branches of the above trees were 100% bootstrap supported. A phylogeny based on CDS, peptides, 1st + 2nd codon positions, 1st codon position, 2nd codon position, and 3rd codon position sequences is shown in the above figure. Note that all tree topologies, except those from codon position 3 and CDS, support hypothesis [II] of panel a, and none of the trees support hypotheses [I] or [III]. The slightly different topology of trees constructed with CDS and codon position 3 may presumably be due to mutation saturation in the 3rd codon position as observed in long branches.

Sequence Soft-shell turtle / green sea turtle Crocodiles/birds Turtles/birds Method type Age (Mya) 95% CI (Mya) Age (Mya) 95% CI (Mya) Age (Mya) 95% CI (Mya) Codon 1+2 114.5 38.2-216.4 241.2 234.7-250.0 257.4 248.3-267.9 PAML Codon 1 114.1 35.6-219.4 241.1 234.8-250.0 257.4 248.0-268.3 mcmctree50-52 Codon 2 117.5 36.2-216.2 241.2 234.8-250.1 257.8 248.7-268.3 Protein 118.3 30.3-226.2 241.1 234.8-250.0 259.1 248.9-271.0 Codon 1 188.7 170.8-203.8 237.8 235.1-243.9 255.6 250.4-260.7 Multidivtime135-136 Codon 2 185.8 165.4-202.0 238.7 235.1-246.5 259.8 253.3-267.0 Codon 1 122.4 - 235.0 - 245.7 - r8s137 Codon 2 136.7 - 235.0 - 247.0 - Protein 125.2 - 235.0 - 247.2 - Supplementary Table 12. Estimation of divergence time between turtles and crocodilians / birds. Several different sets of sequences were taken as input to estimate split times. Fossil calibration times used for PAML mcmctree and r8s analysis were described in Supplementary Table 13. While only four calibrations, (2), (3), (4), (7) in Supplementary Table 13 were used for Multidivtime analysis.

25

Nature Genetics: doi:10.1038/ng.2615 Species 1 Species 2 Lower bound (Mya) Upper bound (Mya) Reference (1) Alligatoridae Crocodylidae 70.60 83.50 Brochu et al. 1999145 (2) Galliformes Passeriformes 66.00 86.50 Benton et al. 2007146 (3) Aves Crocodylidae 235.00 250.40 Benton et al. 2007146 (4) Euarchontoglires Laurasiatheria 95.30 113.00 Benton et al. 2007146 Ornithorhynchus (5) Homo sapiens 162.50 191.10 Benton et al. 2007146 anatinus (6) Sauropsida Mammalia 312.30 330.40 Benton et al. 2007146 (7) Mammalia Amphibia 330.40 350.10 Benton et al. 2007146 (8) Homo sapiens Oryzias latipes 416.00 421.75 www.fossilrecord.net147 Supplementary Table 13. Calibration time points used in split time estimation.

Green sea turtle 114.5 (38.2-216.4) a Soft-shell turtle 257.4 (248.3-267.9) American alligator 77.3 (70.6-83.5) Saltwater crocodile 241.2 (234.7-250.0) 277.0 Chicken 76.7 (66.1-86.6) (266.0-289.3) 315.8 Anole lizard (311.7-324.2) Dog 94.5 (71.8-113.2) 346.7 177.2 (162.7-191.2) Human (338.0-351.0) 418.9 (416.0-421.8) Platypus Western clawed frog Medaka Silurian Devonian Carboniferous Permian Triassic Jurassic Cretaceous Paleogene Palaeozoic Mesozoic Cenozoic 444 416 360 299 251 200 146 65 23 0 Myr ago

Green sea turtle 117.5 (36.2-216.2) b Soft-shell turtle 257.7 (248.7-268.3) American alligator 77.4 (70.7-83.6) Saltwater crocodile 241.2 (234.8-250.1) 277.1 Chicken (266.2-289.2) 76.8 (66.1-86.6) 315.8 Anole lizard (311.7-324.3) Dog 94.8 (71.8-113.2) 346.6 177.4 (162.6-191.2) Human (337.6-350.9) 418.9 (416.0-421.8) Platypus Western clawed frog Medaka Silurian Devonian Carboniferous Permian Triassic Jurassic Cretaceous Paleogene Palaeozoic Mesozoic Cenozoic 444 416 360 299 251 200 146 65 23 0 Myr ago

Supplementary Figure 7. Estimated divergence time of vertebrate lineages Divergence time of 12 species was estimated by PAML mcmctree50-52 based on the 1st + 2nd codon positions (a) and 2nd codon position (b) respectively. Myr is short for Million years. The eight calibration times (dark red circles) were adopted from Brochu et al. 1999145, Benton et al.146, and www.fossilrecord.net147 , which including Alligatoridae-Crocodylidae divergence (70.6 ~ 83.5 Myr ago), Galliformes-Passeriformes divergence (66 - 86.5 Myr ago), Aves-Crocodylidae divergence (235 ~ 250.4 Myr ago), Euarchontoglires-Laurasiatheria divergence (95.3 - 113 Myr ago), Homo sapiens-Ornithorhynchus anatinus divergence (162.5 - 191.1 Myr ago), Sauropsida-Mammalia divergence (312.3 - 330.4 Myr ago), Mammalia-Amphibia divergence (330.4 - 350.1 Myr ago), Homo sapiens-Oryzias latipes divergence (416 - 421.75 Myr ago).

26

Nature Genetics: doi:10.1038/ng.2615

Turtles/ Family Others Annotation Dog

ID turtle shell BGI_gene) - Human alligator Medaka Chicken -

Platypus ratio crocodile Saltwater Saltwater American American Western Western Zebra finch Zebra Anole lizard Anole clawedfrog (Ps Soft Green sea turtle sea Green 1092 0 1 0 16 20 0 4 318 105 6 30 10 24.3 Olfactory receptor, Class I 1369 0 4 11 72 65 0 12 159 27 24 135 57 2.4 Olfactory receptor, Class II 1358 0 131 7 43 45 0 0 142 41 0 1 8 3.9 Olfactory receptor, Class II 1395 0 3 1 11 10 1 2 125 54 5 50 26 8.2 Olfactory receptor, Class I 1388 0 4 0 5 4 0 0 112 44 4 13 5 22.3 Olfactory receptor, Class I 1178 0 5 5 31 24 2 3 23 128 6 8 21 7.2 Zinc finger protein 936 8 34 28 48 37 4 34 74 22 52 30 52 1.5 Immunoglobulin V-set 873 1 2 40 5 21 0 4 58 30 19 22 73 2.4 Zinc finger protein 1377 0 0 12 49 28 0 0 75 10 79 3 12 2.3 Olfactory receptor, Class II 925 11 52 2 58 14 0 0 51 25 8 17 45 1.8 Immunoglobulin V-set 9945 1 3 2 0 2 0 0 70 1 0 2 3 27.3 HAT dimerization 1347 0 5 9 67 71 1 3 54 12 19 53 34 1.3 Olfactory receptor, Class II 926 0 8 18 80 24 3 8 40 24 17 70 39 1.2 Immunoglobulin V-set Peptidase S1/S6, 641 1 6 4 6 6 3 3 41 20 4 7 7 6.5 chymotrypsin/Hap 1389 3 10 0 11 12 0 1 41 11 8 40 11 2.7 Olfactory receptor, Class I 1361 0 0 0 5 5 0 0 45 6 1 13 8 8.0 Olfactory receptor, Class II Supplementary Table 14. Gene families expanded in the turtle lineage. For each species, the number of genes belonging to each gene family is shown. Gene families are listed in the descending order with respect to number of soft-shell turtle and green sea turtle genes. Only the gene families containing at least one non-turtle gene are shown. “Turtles/Others” was calculated as the mean number between t he two turtle species divided by the mean number among the 10 non-turtle species, and the gene families with the value of “Turtles/Others” >5 are depicted in bold.

27

Nature Genetics: doi:10.1038/ng.2615 40 Chicken PrPR-like (EU117423) a 91 Anole lizard (ENSACAP00000013681) 48 Green sea turtle (GLEAN_10017614) Soft-shell turtle (ENSPSIP00000001310) 0.1 substitutions / site 55 100 Western clawed frog (ENSXETP00000034922) 65 Elephant shark (AAVX01090495) 100 Zebrafish PrPR-like 3 (EU165206) 100 100 Stickleback (ENSGACP00000011511) Medaka (ENSORLP00000019391) PRLHR2 Zebrafish PrPR-like 4 (EU165207) 100 Stickleback (ENSGACP00000012421) 100 100 Green pufferfish (ENSTNIP00000010583) 59 Takifugu (ENSTRUP00000033310) 42 Medaka (ENSORLP0000008965) 88 Human PRLHR (NP_004239) 100 Dog (ENSCAFP00000017667) 72 Mouse GPR10 (AAQ84215) 100 anole lizard (ENSACAP00000010218) 100 Green sea turtle (GLEAN_10011387) 40 90 Soft-shell turtle (ENSPSIP00000019448) PRLHR1 Elephant shark (AAVX01115170) 69 Opossum (ENSMODP00000011706) 42 Anole lizard (ENSACAP00000018618) 87 Green sea turtle (GLEAN_10011388) 60 Chicken ENSGALP00000032414) 38 Soft-shell turtle (ENSPSIP00000000404)

55 Western clawed frog (ENSXETP00000052518) PRLHR3 Sea lamprey (ENSPMAP00000011235) 6 5 Green sea turtle (GLEAN_10008538) b 95 Soft-shell turtle (ENSPSIP00000016277) Anole lizard (ENSACAP00000020315) Chicken PrPR-like (ENSGALP00000004956) 51 100 Zebrafish PrPR (NP_001034615) turtles elephant shark teleost fishes chicken anole lizard eutherian mammals 97 Stickleback (ENSGACP00000014599) sea lamprey Xenopus tropicalis platypus opossum 100 73 Medaka (ENSORLP00000014239) 59 Zebrafish brain PrPR (EU165205) PRLHR1 ? ? ? ? ?

Salmon PrPR (DQ083990) PRLHR4 ? ? ? 94 Green pufferfish (ENSTNIP00000011955) PRLHR2 73 93 Takifugu (ENSTRUP00000021361) PRLHR3 ? ? Stickleback (ENSGACP00000022545) 98 Medaka (ENSORLP00000011124) PRLHR4 ? ? ? ? ? Amphioxus (XP_002228742)

c 100 Human (BC014547) 52 Platypus (ENSOANP00000017227) Chicken (ENSGALP00000020457) 98 80 Zebra finch (ENSTGUP00000011155) Anole lizard (ENSACAP00000016314) 99 57 Green sea turtle (GLEAN_10009465) 82 Soft-shell turtle (ENSPSIP00000008494) 39 99 Zebrafish (ENSDARP00000120260) Medaka (ENSORLP00000005169) 62 Chicken (ENSGALP00000037226) Globin E 99 Zebra finch (ENSTGUP00000009902) Green sea turtle (GLEAN_10008272) 83 Soft-shell turtle (ENSPSIP00000017186) 77 Green sea turtle (GLEAN_10001805) Globin Y 100 Soft-shell turtle (ENSPSIP00000011829) 97 Anole lizard (ENSACAP00000020833) 100 (AJ635233) Western clawed frog (ENSXETP00000061121) 66 Human (AJ315162) Platypus (ENSOANP00000018326) 86 66 Chicken (ENSGALP00000002924) 60 Zebra finch (ENSTGUP00000008090) Anole lizard (ENSACAP00000008219) 39 79 52 Green sea turtle (GLEAN_10016650) 67 99 Soft-shell turtle (ENSPSIP00000002766) Western clawed frog (ENSXETP00000042236) 100 Zebrafish (ENSDARP00000120009) 93 Medaka (ENSORLP00000000821) 84 Stickleback (ENSGACP00000025488) Medaka (ENSORLP00000015610) 69 93 Stickleback (ENSGACP00000016830) Sea lamprey (ENSPMAP00000006683) Cyclostome Hbs 100 Sea lamprey (ENSPMAP00000001752) 100 Sea lamprey (ENSPMAP00000005910) 94 Sea lamprey (ENSPMAP00000001744)

0.2 substitutions / site

Supplementary Figure 8. PRLHR and globin gene family members are well retained in the turtle lineage. (a) Molecular phylogenetic tree inferred with the maximum-likelihood method. Turtle sequences are in red, and non-turtle sauropsida sequences are in blue. Although the topology of the tree is not always consistent with the results of our phylogenetic analysis (Supplementary Figure 6), this is likely due to the low bootstrap values in this analysis. (b) Orthology table showing the identified gene repertoire of diverse vertebrates. (c) A molecular phylogenetic tree including vertebrate myoglobin, globin E, globin Y, cytoglobin, and cyclostome haemoglobin (Hbs) sequences was inferred with the maximum-likelihood method. Grouping of these genes was supported in a previous study by Hoffmann et al.148 in which other globin subfamilies, such as neuroglobin, were excluded from this group. Although the topology of the tree is not always consistent with the results of our phylogenetic analysis (Supplementary Figure 4), this is likely due to the low bootstrap values in this analysis.

28

Nature Genetics: doi:10.1038/ng.2615

α β γ δ ε ζ η (Class I) (Class I) (Class II) Turtles √ √ √ Birds √ √ Lizards √ √ Mammals √ √ √ Frogs √ √ √ √ √ √ Bony Fishes √ a) √ √ √ √ Supplementary Table 19. Turtles have group α, β, and γ OR genes. “√” indicates the presence of OR genes. a) One gene from group γ is present in the zebrafish genome, but members of this group of OR genes are absent from all other fish genomes examined91.

Class I Class II Number of Number of Pseudo- Total number of

α (%) β γ Intact Genes genes or Gene Fragments OR Genes soft-shell turtle 532 (46.8) 1 604 1137 607 1744 green sea turtle 158 (62.2) 1 95 254 595 849 Chicken 9 (4.3) 0 202 211 222 433 Zebra finch 2 (1.1) 0 180 182 362 544 Anole lizard 1 (0.9) 0 115 116 34 146 Human 61 (15.4) 0 335 396 425 821 Rat 132 (10.9) 2 1073 1207 560 1767 Dog 159 (19.6) 1 651 811 289 1100 Western clawed 8 (1.0) 14 752 824 * 814 1638 frog Supplementary Table 20. A large number of OR genes is found in turtles. The entire OR gene repertoire from the zebra finch genome149 was identified based on the same methods used for the two turtle genomes. The numbers of OR genes from chicken, anole lizard, and Western clawed frog were adopted from Niimura 200991, human was adopted from Matsui et al. 2010150, and rat and dog were adopted from Niimura et al. 200755. * The sum of the group α, β, and γ genes is not equal to this number because it includes OR genes that belong to other groups.

29

Nature Genetics: doi:10.1038/ng.2615

GO_ID GO_Term GO_Class GO_levl Gene_Num FDR GO:0050909 sensory perception of taste BP 7 27 1.46E-31 GO:0008527 taste receptor activity MF 6 11 1.18E-14 GO:0050912 detection of chemical stimulus involved in sensory perception of taste BP 6 11 2.10E-12 GO:0007600 sensory perception BP 5 28 2.35E-10 GO:0009593 detection of chemical stimulus BP 4 12 1.96E-06 GO:0045095 keratin filament CC 5 12 5.29E-06 GO:0019012 virion CC 2 4 3.69E-05 GO:0030345 structural constituent of tooth enamel MF 5 4 9.51E-05 GO:0051606 detection of stimulus BP 3 14 1.80E-04 GO:0019236 response to pheromone BP 5 3 3.36E-04 GO:0016503 pheromone receptor activity MF 6 3 3.36E-04 GO:0042612 MHC class I CC 5 5 1.15E-03 GO:0004930 G-protein coupled receptor activity MF 5 27 3.32E-03 GO:0070198 protein localization to , telomeric region BP 8 3 3.69E-03 GO:0004866 endopeptidase inhibitor activity MF 5 10 9.28E-03 GO:0033038 bitter taste receptor activity MF 7 2 9.93E-03 GO:0034509 centromeric core chromatin assembly BP 7 2 9.93E-03 GO:0005576 extracellular region CC 2 54 9.93E-03 GO:0038023 signaling receptor activity MF 3 36 1.08E-02 GO:0032205 negative regulation of telomere maintenance BP 4 3 1.12E-02 GO:0007186 G-protein coupled receptor protein signaling pathway BP 5 32 1.15E-02 GO:0030101 natural killer cell activation BP 5 5 1.15E-02 GO:0004888 transmembrane signaling receptor activity MF 4 34 1.15E-02 GO:0042379 chemokine receptor binding MF 6 6 1.15E-02 GO:0050913 sensory perception of bitter taste BP 8 3 1.36E-02 GO:0005125 cytokine activity MF 5 11 1.64E-02 GO:0004867 serine-type endopeptidase inhibitor activity MF 6 7 1.82E-02 GO:0033557 Slx1-Slx4 complex CC 4 2 2.04E-02 GO:0000783 nuclear telomere cap complex CC 5 3 2.05E-02 GO:0002690 positive regulation of leukocyte chemotaxis BP 5 5 2.34E-02 GO:0032103 positive regulation of response to external stimulus BP 4 8 2.91E-02 GO:0001664 G-protein-coupled receptor binding MF 5 10 2.91E-02 GO:0008009 chemokine activity MF 6 5 2.91E-02 GO:0000930 gamma-tubulin complex CC 4 3 2.95E-02 DNA double-strand break processing involved in repair via GO:0010792 BP 9 2 3.23E-02 single-strand annealing GO:0008821 crossover junction endodeoxyribonuclease activity MF 9 2 3.23E-02 GO:0031848 protection from non-homologous end joining at telomere BP 7 2 3.23E-02 GO:0019763 immunoglobulin receptor activity MF 5 2 3.23E-02 GO:0032816 positive regulation of natural killer cell activation BP 6 3 3.30E-02 GO:0003008 system process BP 3 41 3.64E-02 GO:0048520 positive regulation of behavior BP 4 6 3.70E-02 GO:0007004 telomere maintenance via TERF2IPTERF BP 8 3 3.78E-02 GO:0050877 neurological system process BP 4 32 4.39E-02 GO:0032202 telomere assembly BP 6 2 4.44E-02 GO:0032211 negative regulation of telomere maintenance via telomerase BP 5 2 4.44E-02 Supplementary Table 21. GO enrichment analysis of genes lost in the turtle lineage. An enrichment analysis for the genes lost in both turtles was performed based on the algorithm presented by GOstat97 using human genes as the background. The p-value was approximated using the chi-square test. Fisher’s exact test was used when any gene count was below 5, which will make the chi-square test inaccurate. This program was implemented as a pipeline134. To provide succinct results in the GO and IPR enrichment analyses, if one of the items was ancestral to another and the enriched gene list of these two items was same, the ancestral item was deleted from the results. To adjust for multiple testing, we calculated the False Discovery Rate (FDR) using the Benjamini-Hochberg method151 for each class.

30

Nature Genetics: doi:10.1038/ng.2615 IPR_ID IPR_Title Gene_Num P-value IPR001909 Krueppel-associated box 28 1.70E-18 IPR000725 Olfactory receptor 80 1.70E-18 IPR000276 GPCR, rhodopsin-like, 7TM 82 3.07E-17 IPR015880 Zinc finger, C2H2-like 39 1.47E-13 IPR007087 Zinc finger, C2H2-type 40 1.37E-12 IPR006689 ARF/SAR superfamily 8 0.000481 IPR019954 Ubiquitin conserved site 7 0.000821 IPR007960 Mammalian taste receptor 24 0.000876 IPR002472 Palmitoyl protein thioesterase 8 0.001884 IPR017907 Zinc finger, RING-type, conserved site 8 0.001884 IPR020904 Short-chain dehydrogenase/reductase, conserved site 10 0.002125 IPR000626 Ubiquitin 9 0.004343 IPR019955 Ubiquitin supergroup 9 0.004343 IPR021925 Protein of unknown function DUF3538 7 0.006798 IPR002198 Short-chain dehydrogenase/reductase SDR 10 0.01048 IPR022734 Apolipoprotein M 7 0.018826 IPR002347 Glucose/ribitol dehydrogenase 10 0.018826 IPR018957 Zinc finger, C3HC4 RING-type 9 0.028263 Supplementary Table 22. IPR enrichment analysis of genes lost in both turtles.

w0 w1 w2 Symbol Gene name p-value (average) (other) (turtle) MGST3 microsomal glutathione S-transferase 3 0.1216 0.1028 5.6803 5.93E-04 ABCB1 ATP-binding cassette, sub-family B (MDR/TAP), member 1 0.1386 0.129 2.8737 0.00E+00 FAH fumarylacetoacetate hydrolase (fumarylacetoacetase) 0.1293 0.1087 2.2988 2.06E-04 RFC4 replication factor C (activator 1) 4, 37kDa 0.1224 0.1044 1.6834 1.84E-06 HEATR2 HEAT repeat containing 2 0.1817 0.1676 1.5883 1.82E-03 APOBEC2 mRNA editing enzyme, catalytic polypeptide-like 2 0.1129 0.0986 1.5813 1.51E-03 SCYL3 SCY1-like 3 (S. cerevisiae) 0.2308 0.214 1.2154 1.46E-03 PDC 0.0796 0.0701 1.2149 4.60E-04 METTL15 methyltransferase like 15 0.1535 0.1401 1.1898 1.56E-03 MFSD1 major facilitator superfamily domain containing 1 0.0968 0.0804 0.8332 1.29E-04 EIF4H eukaryotic translation initiation factor 4H 0.0642 0.0543 0.7455 1.06E-03 SLC38A4 solute carrier family 38, member 4 0.1571 0.1409 0.61 1.59E-04 DDX20 DEAD (Asp-Glu-Ala-Asp) box polypeptide 20 0.1845 0.1699 0.5791 1.58E-03 ABCE1 ATP-binding cassette, sub-family E (OABP), member 1 0.0174 0.0157 0.5744 4.70E-04 MSH3 mutS homolog 3 (E. coli) 0.1465 0.1328 0.5679 6.37E-05 ADHFE1 alcohol dehydrogenase, iron containing, 1 0.1608 0.138 0.5627 7.66E-04 HRH3 histamine receptor H3 0.1155 0.1012 0.5527 2.12E-03 SLC34A2 solute carrier family 34 (sodium phosphate), member 2 0.1541 0.1382 0.5371 1.02E-04 NDFIP2 Nedd4 family interacting protein 2 0.1436 0.1198 0.5183 1.37E-04 MIPEP mitochondrial intermediate peptidase 0.1417 0.1311 0.5172 1.67E-04 VAT1L vesicle amine transport protein 1 homolog (T. californica)-like 0.0894 0.0731 0.4804 4.25E-07 CPNE3 copine III 0.1177 0.1016 0.4724 8.55E-07 TRAK2 trafficking protein, kinesin binding 2 0.1613 0.1399 0.4705 2.56E-09 RAD21 RAD21 homolog (S. pombe) 0.0414 0.0359 0.4642 6.64E-05 VPS13C vacuolar protein sorting 13 homolog C (S. cerevisiae) 0.1519 0.1436 0.4557 4.63E-04 CLUAP1 associated protein 1 0.0968 0.0836 0.4252 1.22E-03 COL11A1 collagen, type XI, alpha 1 0.1149 0.1062 0.4087 1.14E-05 SLC40A1 solute carrier family 40 (iron-regulated transporter), member 1 0.1181 0.1097 0.4027 1.47E-03 ZBTB37 zinc finger and BTB domain containing 37 0.0834 0.0635 0.4026 0.00E+00 JAK1 Janus kinase 1 0.0679 0.0613 0.3983 2.54E-06 DENND4C DENN/MADD domain containing 4C 0.1589 0.1514 0.3907 8.86E-04 ANKIB1 ankyrin repeat and IBR domain containing 1 0.072 0.0614 0.3874 1.48E-06 MKX mohawk homeobox 0.1158 0.0994 0.3772 4.79E-05

31

Nature Genetics: doi:10.1038/ng.2615 PPID peptidylprolyl isomerase D 0.1228 0.11 0.3669 2.93E-04 DEPDC1B DEP domain containing 1B 0.1614 0.1491 0.3611 8.20E-04 MORC3 MORC family CW-type zinc finger 3 0.1335 0.1229 0.3555 5.79E-04 IBTK inhibitor of Bruton agammaglobulinemia tyrosine kinase 0.1418 0.1355 0.3474 2.05E-03 KIT v-kit Hardy-Zuckerman 4 feline sarcoma viral oncogene homolog 0.1645 0.1515 0.3467 5.49E-04 GALNS galactosamine (N-acetyl)-6-sulfate sulfatase 0.1199 0.1099 0.3425 7.55E-04 TMEM106B transmembrane protein 106B 0.1017 0.0869 0.339 1.27E-04 OSBPL8 oxysterol binding protein-like 8 0.1013 0.0934 0.3303 1.96E-05 NUP107 nucleoporin 107kDa 0.1175 0.1022 0.3254 4.12E-10 ITSN2 0.137 0.1256 0.3149 1.51E-06 NR5A2 nuclear receptor subfamily 5, group A, member 2 0.0359 0.0282 0.298 1.12E-07 WDR69 WD repeat domain 69 0.0983 0.0859 0.2973 5.62E-05 BMPR1B bone morphogenetic protein receptor, type IB 0.0513 0.0432 0.2935 5.81E-08 TTC21B tetratricopeptide repeat domain 21B 0.1212 0.1145 0.2913 4.80E-04 MYBPC1 myosin binding , slow type 0.0894 0.0829 0.2901 1.99E-05 PARP1 poly (ADP-ribose) polymerase 1 0.0781 0.0715 0.2887 3.51E-04 ADCY1 adenylate cyclase 1 (brain) 0.0743 0.0688 0.2869 6.97E-05 Supplementary Table 24. Genes with accelerated evolutionary rates in the turtle lineage. Orthologous genes between soft-shell turtle, green sea turtle, and other related species (chicken, zebra finch, anole lizard, Xenopus tropicalis and platypus) were aligned using the program MUSCLE60 and compared with a series of evolutionary models in the likelihood framework using the phylogenetic tree obtained by our analysis. A branch model52 was used to detect the average dN/dS ratio (ω) across the tree (ω0), the ω of the ancestor of soft-shell turtle, the green sea turtle branch (ω2) and the ω of all of the other branches (ω1).

Supplementary Figure 9. Early to late stages of turtle and chicken embryos. Representative images of turtle and chicken embryos used in this study are shown (images are not to scale). TK (Tokita-Kuratani) stages for turtle24 and HH stages for chicken61 are denoted for each image.

32

Nature Genetics: doi:10.1038/ng.2615

Supplementary Figure 10. Gene Ontology (Biological Process) profiles of 11602 turtle-chicken orthologous genes Cumulative bar plots showing biological process-related slim Gene Ontologies (slim GO) of C. mydas (Cm), P. sinensis (Ps) and G. gallus (Gg) genes. The slim GOs of 11602 orthologous genes used for the comparative expression analysis are shown in between the Ps and Gg barplots (Ps_gg, soft-shell turtle 11602 orthologs; Gg_ps, chicken 11602 orthologs). Importantly, the similar profile between Ps_gg/Gg_ps confirms that the gene set used for the GXP analysis is comprehensive and not biased in terms of the GO predicted biological processes.

Supplementary Figure 11. Saturating sequencing depth and detected gene number in each sample. Plots of the expressed gene count versus the simulated RNA-Seq mapped read number show that the RNA-Seq reads for each sample are at a quasi-saturation point. The red bar represents the percentile of the gene repertoire detected in either sample; the blue lines indicate the height of the total gene number. (a-b) Detected genes were counted against a random selection of mapped reads for soft-shell turtle (a) and chicken (b) samples. (c-d) Detected genes among the 11602 orthologues were counted against a random selections of mapped reads for soft-shell turtle (c) and chicken (d) samples. The actual read numbers are larger than the numbers of mapped reads. The x-axis indicates the total number of reads for each sample, and the y-axis indicates the number of genes detected (at least one tag mapped to gene coding regions). Tips of the colored pins indicate the read number and expressed genes of each sample. Error bars in each curve indicate the standard deviation produced from two independent random selections. Ps-ens_gene sets were used for this analysis. Essentially the same saturation curve was also observed for Ps-BGI_gene sets (data not shown).

33

Nature Genetics: doi:10.1038/ng.2615

Supplementary Figure 12. Hourglass-like gene expression divergence between selected stages of soft-shell turtle and chicken. Gene expression diversities were calculated for five embryos from soft-shell turtle and chicken with various distance calculation methods and datasets of various depth. (a-d) Expression divergence calculated with the all-reads data. (a) Expression divergence calculated with the 1 - Pearson correlation coefficiency (b) Expression divergence calculated with the 1 - Spearman correlation coefficiency. (c) Expression divergence calculated with the total Euclidean distance. (d) Expression divergence calculated with the total Manhattan distance. (e-g) Expression divergence calculated with the mapped-10M reads data. (e) Expression divergence calculated with the 1 - Pearson correlation coefficiency. (f) Expression divergence calculated with the total Euclidean distance. (g) Expression divergence calculated with the total Manhattan distance. P values were calculated by ANOVA with heteroskedasticity. Note that similar tendencies are observed among the various distance evaluation methods. Essentially the same results were obtained for the RPKM-normalized data set (data not shown). A similar saturation curve was also observed for Ps-BGI_gene sets (data not shown). Error bars: S.D.

Supplementary Figure 13. The number of genes expressed does not correlate with the highest GXP similarity in the mid-embryonic stages. The bar-plots indicate the number of genes detected (> 1 read count) during soft-shell turtle (left) and chicken (right) embryogenesis (read depth controlled data: mapped 10 M). Although the number of genes detected in each developmental stage showed statistically significant differences (tested by ANOVA, Alpha level = 0.01), no correlation with the conserved expression profiles found in the mid-embryonic stages (Supplementary Figure 18) was observed. In contrast, the number of genes detected in each developmental stage showed a moderate increase during development (5-6 % increase between the earliest and latest stages). Essentially the same saturation curve was also observed for the Ps-BGI_gene sets (data not shown). Error bars: S.D.

34

Nature Genetics: doi:10.1038/ng.2615

Supplementary Figure 14. All-to-all comparison of turtle-chicken embryos reveals the period of most similar GXP. Gene expression divergence scores were compared among all the combinations of paired embryos from soft-shell turtle and chicken (9 Ps stages x 8 Gg stages = 72 combinations) using 11602 orthologous genes. See Supplementary Table 21 for the statistical testing of this result. Panels in (a) represent similarity scores made from the all-reads dataset, and panels in (b) represent those from the mapped-10M reads dataset. Distance calculation methods and normalization methods (in brackets) are shown at the top of each panel. Similar results were also observed for Ps-BGI_gene sets (data not shown). Error bars: S.D.

35

Nature Genetics: doi:10.1038/ng.2615 Distance method All reads Randomly mapped-10 M reads PS_TK11 ó GG_HH16 PS_TK11 ó GG_HH16 1- Pearson correlation co-efficiency PS_TK15 ó GG_HH28 PS_TK15 ó GG_HH28

PS_TK11 ó GG_HH16 PS_TK15 ó GG_HH28 PS_TK13 ó GG_HH16 PS_TK11 ó GG_HH16 PS_TK11 ó GG_HH14 1- Spearman correlation co-efficiency PS_TK15 ó GG_HH28 PS_TK13 ó GG_HH28 PS_TK13 ó GG_HH19 PS_TK11 ó GG_HH19 PS_TK11 ó GG_HH11

PS_TK11 ó GG_HH16 Total Euclidean distance PS_TK11 ó GG_HH16 PS_TK13 ó GG_HH16 PS_TK11 ó GG_HH14

PS_TK11 ó GG_HH16 PS_TK13 ó GG_HH16 PS_TK11 ó GG_HH16 PS_TK11 ó GG_HH14 PS_TK13 ó GG_HH16 Total Manhattan distance PS_TK15 ó GG_HH28 PS_TK11 ó GG_HH14 PS_TK7 ó GG_HH11 PS_TK15 ó GG_HH28 PS_TK13 ó GG_HH19 PS_TK11 ó GG_HH11 PS_TK11 ó GG_HH19

Supplementary Table 25. Pairs of turtle-chicken embryos with the highest expression similarity. Pairs of turtle-chicken embryos with the highest gene expression profiles are shown. Statistical tests (Welch two-sample student t-test or Wilcoxon test depending on satisfaction of statistical prerequisites and Holm corrected alpha level. All results less than p < 0.01) were performed to test the significance of the pairs of embryos with the highest similarity in gene expression. The results reproduced (statistically significant) by the dataset from both of the normalization methods (RPKM70, TMM69) are shown. Although the results varied depending on the dataset (all reads and 10 M-mapped reads), the normalization method (RPKM and TMM), and distance calculation method (1-Pearson, 1-Spearman, total Euclidean, and total Manhattan), the PS_TK11óGG_HH16 pair was robustly supported by these analyses.

Soft-shell X. laevis Mouse Chicken D. rerio Developmental event / organ structures turtle stage 28 E9.5 HH16 24 hpf TK11 / stage 31 Rhombomere + + + + / + + Neural crest cells + + + + / + + Axial Notochord + + + + / + + structures Somite + + + + / + + Neural tube / neural folds partially fused + + + + / + + Pharyngeal Pharyngeal arch + + + + / + + Olfactory Olfactory pit / placode + + + + / + + Otic Auditory system / Otic placode + + + + / + + Optic Lens / lens placode + + + – / + + Cardiovascular Aortic arches / Heart with chambers + + + – / + + system Kidney Mesonephric duct anlagen + + + + / + + Epidermal Epidermis + + + + / + + Supplementary Table 26. Anatomical structures shared between TK11 turtle embryo and phylotypic periods of four vertebrate species. +observed or rudimentary structure can be found; –not observed. The anatomical features of each embryonic stage were adopted from Kimmel et al.152, Faber et al.153, Hamburger et al.61, and Matthew et al.154. The phylotypic periods of mouse, chicken, X. laevis, and zebrafish were adopted from Irie et al. 201121.

36

Nature Genetics: doi:10.1038/ng.2615 Number of Common genes in soft Ensembl gene family ID description shell turtle ENSFM00250000001319, ENSFM00320000100119, ENSFM00500000269866, NSFM00500000269956, ENSFM00500000269958, ENSFM00500000270089, ENSFM00500000270237, NSFM00500000270471, Hox genes 35 ENSFM00500000272808, ENSFM00500000273322, ENSFM00500000273516, NSFM00500000274321, ENSFM00590000916375, ENSFM00600000921180, ENSFM00670001235443, NSFM00670001235520, ENSFM00670001235520, ENSFM00680001319051 Paired box pax 7 ENSFM00420000140576, ENSFM00440000236852, ENSFM00440000236845, NSFM00420000140576

ENSFM00500000271792, ENSFM00500000271542, ENSFM00500000273333, NSFM00480000262783, FGF / FGF ENSFM00500000272017, ENSFM00500000271676, ENSFM00500000270249, NSFM00500000270250, 22 receptor ENSFM00500000271664, ENSFM00500000269773, ENSFM00500000275261, NSFM00400000131981, ENSFM00250000000093

BMP / BMP ENSFM00570000851071, ENSFM00500000269724, ENSFM00320000100141, NSFM00440000236850, 16 receptor ENSFM00430000230170, ENSFM00570000851071, ENSFM00250000000213 Hedgehog ENSFM00250000000992, ENSFM00250000000992, ENSFM00250000001574, NSFM00250000008297, (N-term, 6 ENSFM00500000270772 C-term) ENSFM00670001235288, ENSFM00670001235321, ENSFM00670001235288, NSFM00670001235377, Wnt 19 ENSFM00500000270223, ENSFM00500000270294 Notch / Notch 6 ENSFM00500000269589, ENSFM00570000851057, ENSFM00570000851117 ligands 7 ENSFM00250000000221 Activin 6 ENSFM00250000000213 receptors TGF-β / receptor 5 ENSFM00250000000213, ENSFM00250000005418, ENSFM00250000000840 Supplementary Table 27. Identified developmental toolkit gene families in soft-shell turtle. Developmental toolkit genes105 in soft-shell turtle classified by Ensembl family IDs.

37

Nature Genetics: doi:10.1038/ng.2615 No. of genes within No. of genes out % of genes in GO ID GO description 2-fold range of 2-fold range 2-fold range GO:0006611 protein export from nucleus 8 0 100% GO:0034968 histone lysine methylation 10 0 100% GO:0071526 semaphorin-plexin signaling pathway 9 0 100% GO:0018024 histone-lysine N-methyltransferase activity 16 1 97% GO:0016571 histone methylation 11 1 92% GO:0030032 lamellipodium assembly 14 1 92% GO:0000045 autophagic vacuole assembly 14 2 90% GO:0006367 transcription initiation from RNA polymerase II promoter 19 2 89% GO:0009987 cellular process 15 2 88% GO:0005643 nuclear pore 13 2 87% GO:0019005 SCF ubiquitin ligase complex 16 3 86% positive regulation of proteasomal ubiquitin-dependent GO:0032436 15 3 85% protein catabolic process GO:0006418 tRNA aminoacylation for protein translation 21 4 83% GO:0016607 nuclear speck 28 6 82% GO:0000776 kinetochore 32 8 81% GO:0004812 aminoacyl-tRNA ligase activity 21 5 80% GO:0006200 ATP catabolic process 27 8 78% GO:0005741 mitochondrial outer membrane 23 7 78% GO:0005819 spindle 23 7 77% GO:0016023 cytoplasmic membrane-bounded vesicle 26 8 76% GO:0030529 ribonucleoprotein complex 34 12 74% GO:0004386 helicase activity 58 20 74% GO:0010468 regulation of gene expression 39 14 74% GO:0031625 ubiquitin protein ligase binding 54 20 73% GO:0071013 catalytic step 2 spliceosome 35 14 72% negative regulation of canonical Wnt receptor signaling GO:0090090 39 16 71% pathway GO:0003714 transcription corepressor activity 41 17 71% GO:0005488 binding 201 84 71% GO:0006511 ubiquitin-dependent protein catabolic process 63 27 70% GO:0007275 multicellular organismal development 54 23 70% Supplementary Table 28. Groups of genes that show similar expression levels in turtle/chicken phylotypic stages. Genes grouped by their GO IDs were investigated for similar expression levels (less than 2-fold change) between soft-shell turtle TK11 and chicken HH16 embryos. Groups having a significantly higher ratio (compared to all genes) of similar expression are listed (Fisher’s exact test, alpha level = 0.01; the table is restricted to GO groups having more than 70% of similarly expressed genes). All of the results were corroborated by all of the data sets (all reads, mapped 10 M reads, RPKM normalization, and TMM normalization). The number of genes and percentiles were averaged within the four different data sets and are rounded off to the closest whole number.

38

Nature Genetics: doi:10.1038/ng.2615

Supplementary Figure 15. Turtle-chicken divergence also follows the hourglass model (a) The horizontal width of the hourglass model16,17 represents the evolutionary divergence observed during embryogenesis (which flows upward in this drawing), which achieves maximum similarity at the phylotypic period (constricted part of the model). The possible phylotypic periods of mice (E 9.5), chicken (HH16), X. laevis (stages 28-31), and Zebrafish (24hr post fertilization) were identified by similarity of gene expression profiles, and reported previously21. Although recent studies19-21 supported the idea that the model could be expanded to include bilaterian or larger phylogenetic groups104, the exact relationship between the vertebrate hourglass and that of bilaterian or larger animal groups104 remains to be clarified. What is more, the hierarchical relationship of evolution and development proposed by von Baer25 that “The general features of a large group of animals appear earlier in development than do the specialized features of a smaller group” could still be explained by the nested hourglasses model with a “later-shifted hourglass” (b), however this was not supported by our investigation (Figure 2). Although direct evidence is yet to be provided, the hourglass model was originally proposed with the mechanism that makes the phylotypic period as the most conserved stages. Circles and Hox on the right side of the model (a) represent explanations regarding the cause of phylotypic period conservation, particularly modularity17,103 and Hox co-linearity16, respectively. However, the possibility that the developmental system itself could be a major reason for this conservation is still under discussion (22). Additionally, “the early flexibility” issue, or the mechanism of how vertebrate embryogenesis tolerated divergence during early development while conserving the subsequent phylotypic period is not well understood21,155. Embryos in the dark circle are the possible vertebrate phylotypic period identified in our previous21, and the current study (for soft-shell turtle). However, exact relationship of these stages against vertebrate phylotypic period awaits further study that directly compares all other vertebrates, especially including cyclostomes. The hourglass model in (a) was adapted with permission from Irie N., Kuratani S.17. .

39

Nature Genetics: doi:10.1038/ng.2615

Supplementary Figure 16. Expanded/contracted genes in the turtle lineage show a low expression level during embryogenesis The expression levels of genes in a family that are predicted to have experienced significant expansion or contraction in the turtle lineage (both soft-shell turtle and green sea turtle) were analyzed. (a) Genes in a family of significant expansion or contraction in the turtle lineage (E/C genes, 244 genes, see the Online methods for the prediction of E/C genes) showing significantly lower levels of expression throughout soft-shell turtle embryogenesis (Wilcoxon test, Bonferroni-corrected multiple comparison, p < 0.01). Y-axis, relative expression level; X-axis, developmental stages. The lowest average expression level was marked in the phylotypic period of soft-shell turtle (TK11). TMM-normalized, 10 M-mapped data sets were used for this plot. The RPKM-normalized data set also showed similar results and supported the same conclusion. (b) A large portion (65-79%) of the E/C genes were not expressed (no tag count in the read-depth-controlled data set [mapped 10 M]) during embryogenesis (upper, blue bar-plots), and this was significantly lower when compared to all of the genes (bottom, green bar-plots, Wilcoxon test p < 0.01 at all stages). The highest score of the un-expressed ratio of E/C genes was marked by the phylotypic period (TK11). Error bars: S.D.

40

Nature Genetics: doi:10.1038/ng.2615

Supplementary Figure 17. Molecular estimation of the correspondence between turtle and chicken developmental timetables. Pairs of turtle-chicken embryos with reciprocal best transcriptome similarities (RBTS) are connected by arrowed lines. Only RBTS supported by all the distance calculation methods and both normalizations are drawn. Multiple lines extending from a single embryonic stage indicate that there were no significant differences in expression similarity. Left: RBTS embryos estimated with the all-reads dataset. Right: RBTS embryos estimated with the mapped-10M reads dataset. Similar results were also observed for Ps-BGI_gene sets (data not shown). For Reciprocal Best Hit Stages (RBTS) identification, stages that exhibited the most similar gene expression profiles were tested by the Welch two-sample t-test or the Wilcoxon signed-rank test based on satisfactions of statistical requirements with a Holm-corrected alpha level.

41

Nature Genetics: doi:10.1038/ng.2615 Supplementary Figure 18. Genes increasingly expressed after the phylotypic period in turtle (a: Same as Fig. 4b but illustrated for comparison with the chicken ortholog dataset on the right.) The expression dynamics of soft-shell turtle genes that showed a significant increase in expression level after the phylotypic period (IAP). Each line represents the average expression level of each IAP gene calculated with biological replications of each stage. The turtle IAP genes were screened by the following criteria: (1) mean expression level after the phylotypic period (stages that begin to show turtle-specific morphologies, TK15-TK23) is more than five times higher (Wilcoxon test, alpha level = 0.01) than those of earlier stages (gastrula, neurula, TK7 and TK9). (2) The chicken orthologs (if there are any) of turtle IAP genes do not show such increases (the average expression levels in HH28 & HH38 do not show more than five times higher expression than those in the Prim-HH14 stages). The gene names of the top three highest expression levels in TK23 are shown (right panel). Consequently, 233 turtle IAP genes were found. The chicken orthologs of turtle IAP genes (206 genes) are also shown in the left panel.

Detected hairpin precursor with Tissues Unique sequences miRNAs significant Randfold p-value CR 715 671 577 Limbs 564 531 459 Body walls 868 798 680 Supplementary Table 29. miRNAs detected in turtle embryonic tissues. Of these predictions, 94+/-1% are estimated to be true positives.

42

Nature Genetics: doi:10.1038/ng.2615

Supplementary Table PsWnt11b PsWnt10b PsWnt10a PsWnt16 PsWnt11 PsWnt9b PsWnt9a PsWnt8b PsWnt8a PsWnt7b PsWnt7a PsWnt5b PsWnt5a PsWnt3a PsWnt2b PsWnt6 PsWnt4 PsWnt3 PsWnt2 PsWnt1 Gene

ba nd Size Expected 675 993 841 648 612 945 965 498 896 783 674 877 733 857 957 803 803 838 684 607

3 1 AAGGAGCTGTGCAAGAGGAA GGGGAAAGGACCGTCTTTGG GCATCTGCCGGAAGACGAAG TGGAGACGACGTGTAAGTGC CGAGGAGTGCCAGTACCAGT CAAAGACGGGCATTAAGGAA TCCGCTATGGCAGGTGGAAC AGTCGAGAGGCTGCATTCAC CCTCAGCCCTGCAGGGCATC

CCTGGAGCTCATGCACAGTA TCATGGAGTGCCAGTACCAA TGAGTCAAGCCATGTCAAGC CGTTAGGCCAGCAATACACC TGGAACTGCAACACCCTCCA TCCTGCACGTGTGACTATCG GGCACCGCTCTACTACCTTG CCCTTCAGCTCTCCACTCAC TCTCGCGTCCTTCCTCTGCT P ACAA GCAACGGCAACTCTG Forward Primer Sequence AGATCGCCACCCACGAGT rimers, vectors and GenBank accession numbers of soft numbers accession GenBank and vectors rimers,

ACAGCAGCAGAAGGGCTAAG ACCAGTGGAAGGTGCAGTTG GGTCTGGTGGTGTCTCAGGT GGTTGTGTGCGTTGATGAAG AAAGTTGGGGGAGTTCTCGT GGTCGACGATCTCTGTGCAC TGCCACTTGAGGAACAACTG CCTCGACCGCAACACATCAG CACAGGCAGTTCTCCTCCAG TTGTCATGAGTCCGCTTGAG TGTTGCACTTCACAAAGCAG GCCTCTCCCACAGCACATCA AACTTGCACTCGCACTTGGT AAGCTCACAGCTTCCCATGT ATCTTCCTTGCGGATTGGTA GAACGCTCCGTGTCTTTCTC CGCTTCATGGTCCCCTTCAC CCCCTA TTTGCACACGAACT GTCTCTGGTCCCAAAGGAG Reverse Primer Sequence AGAAGCGCTCCTTGAGCA -

shell turtlegenes wnt

restriction enzyme Antisense PCR or M13R/PsWnt7b_R2 M13R/PsWnt16_F2 M13R/PsWnt11b_F M13R/PsWnt5b_F2 M13F/PsWnt8b_F3 M13R/PsWnt11_F M13R/PsWnt9a_F M13R/PsWnt8a_F M13R/PsWnt5a_F M13R/PsWnt3a_F M13R/PsWnt2b_F M13R/PsWnt2_F3 M13F/PsWnt9b_F M13R/PsWnt3_F M13R/PsWnt1_F M13F/PsWnt4_F EcoRV NotI NotI NotI

Genbank ID JQ968433 JQ968433 JQ968433 JQ968433 JQ968433 JQ968433 JQ968433 JQ968433 JQ968433 JQ968433 JQ968433 JQ968433 JQ968433 JQ968433 JQ968433 JQ968433 JQ968433 JQ968433 JQ968433 JQ968433

43

Nature Genetics: doi:10.1038/ng.2615 Targets predicted* Specific targets predicted** MicroRNAs from: Ensembl Transcripts Ensembl Genes Ensembl Transcripts Ensembl Genes Carapacial ridge 8358 7888 146 131 Body wall 8670 8150 294 249 Limbs 8119 7670 124 114 Supplementary Table 32. miRNA target prediction statistics *Ensembl transcripts or genes that have been predicted by miRanda to be putative targets of specific microRNAs in each tissue. For example, the 236 (212 unique sequences) specific microRNAs of the CR are predicted to target 8358 genes in the CR. **From the previous predictions, the three tissues were compared to identify the specific targets for each, i.e., the targets that have been predicted in the CR and not the rest of the body, predicted in the body wall and not the rest of the body, or predicted in the limbs and not the rest of the body.

44

Nature Genetics: doi:10.1038/ng.2615

Supplementary Figure 19. Tissue-specific miRNAs found in the soft-shell turtle (a) Venn diagram illustrating the number of predicted microRNAs, represented as unique sequences, that are shared among structures or completely specific to each structures (limb, body wall, and CR). Numbers of miRNAs are not to scale in the Venn diagram. (b) Hairpin prediction for miR-187, which is the most abundant microRNA (in terms of number of reads) that is specific to the CR, showing both the mature miRNA (in red) and the star miRNA (in purple). (c) RNA level evidence for miR-187 prediction. In the upper section, the relative frequency of reads, which reflects the depth over the predicted-precursor sequence, is shown. Below this, the hairpin precursor prediction is shown lineally with the mature sequence in red, the loop in yellow, the expected star sequence in light blue, and the observed star sequence in purple. The reads that mapped to the mature miRNA and miRNA* are listed in the lower section with the read number at the right. The complete output files illustrate several more features of the prediction156 (Friedländer et al., 2012) that are not shown here for clarity. (d) Potential targets of soft-shell turtle transcripts and genes (Ps-ens_genes) predicted by miRanda72.

45

Nature Genetics: doi:10.1038/ng.2615

Supplementary Figure 20. Sequence conservation of soft-shell turtle miRNAs in green sea turtle and chicken. Sequences of soft-shell turtle mature miRNAs were searched in green sea turtle and chicken genomes. The larger pie-charts indicate the percentile (%) of soft-shell turtle mature miRNA sequences (1082 unique miRNAs in total) found in the two species with a various allowance of mismatches (0, 1, and 3 bases). The smaller pie-charts are those of the soft-shell turtle miRNA (393 unique miRNAs in total) that were predicted to target the Wnt and Wnt components listed on Supplementary Figure 29. Note that the miRNAs that are predicted to target Wnt downstream components have a higher ratio of conservation. The alignments were performed using bowtie, with -k 1 and --best parameters.

46

Nature Genetics: doi:10.1038/ng.2615

Supplementary Figure 21. Conserved molecular components of the Wnt signaling cascade are potential targets of miRNAs. (a) A simplified scheme depicting a general β -catenin-dependent Wnt pathway157. (b) Table representing the genes found in the genomes of P. sinensis and C. mydas. Dark green, found in the gene predictions (Ps-ens_gene; in-house gene predictions for C. mydas); light green, found using blastp against the genome and using the human orthologs as a query. The asterisk in FZD8 indicates that it was found by blast against the RNA-Seq data. FZD, frizzled; SMOH, ; DVL, -like; Tcf7, transcription factor 7; Tcf7l, transcription factor 7-like; Lef1, lymphoid enhancer-binding factor 1; LRP, low-density lipoprotein receptor-related protein; APC, Adenomatous polyposis coli protein; GSK3, glycogen synthase kinase-3. (c) List of genes that are potential targets of miRNAs expressed in each tissue. Genes that are predicted to be targeted by one or more miRNA expressed in each tissue are colored (yellow, body wall; red, limb; blue, CR).

Supplementary Figure 22. Wnt gene expression in the carapacial ridge. Among the 20 Wnt genes investigated, for those with expression in or nearby the CR, a focus is shown. fl, forelimb; hl, hindlimb. The red asterisk marks the CR. Only the Wnt5a gene seems to be clearly expressed in the CR, while Wnt6, Wnt7a and Wnt8a are expressed in more medial parts of the embryo.

47

Nature Genetics: doi:10.1038/ng.2615

Accession No. Species Data type Data base Web site / URL / Identifier ftp://ftp.ncbi.nih.gov/genbank/genomes/E AGCU0000000 Genome NCBI ukaryotes/vertebrates_other/Pelodiscus_si DDBJ/EMBL/ 0 / PelSin_1.0 nensis/ GenBank http://www.ensembl.org/Pelodiscus_sinen Annotated genes ensembl PelSin_1.0 sis/Info/Index/ Soft-shell Ps_454_mRNA-Seq Poly (A) turtle Ps_dUTP_RNA-Seq RNA-Seq ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/f DRA000567 Ps_strand_RNA-Seq DDBJ DRA data astq/ Staged embryos Small RNA DRA000639 Cloned Wnt genes GenBank NCBI http://www.ncbi.nlm.nih.gov/nuccore/ JQ968433-52 DDBJ/EMBL/ ftp://ftp.ncbi.nih.gov/genbank/genomes/E NCBI AJIM00000000 Green sea GenBank ukaryotes/vertebrates_other/ Genome / genes turtle Short Read SRA http://www.ncbi.nlm.nih.gov/sra SRA050949 Archive Poly (A) RNA-Seq data staged ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/f Chicken DDBJ DRA DRA000567 embryos astq/ Supplementary Table 34. Summary of data registered on public databases.

48

Nature Genetics: doi:10.1038/ng.2615 References

1. Kuratani, S., Kuraku, S. & Nagashima, H. Evolutionary 20. Domazet-Loso, T. & Tautz, D. A phylogenetically based developmental perspective for the origin of turtles: the transcriptome age index mirrors ontogenetic divergence folding theory for the shell based on the developmental patterns. Nature 468, 815-8 (2010). nature of the carapacial ridge. Evolution & development 21. Irie, N. & Kuratani, S. Comparative transcriptome analysis 13, 1-14 (2011). reveals vertebrate phylotypic period during 2. Rieppel, O. Turtles as hopeful monsters. BioEssays 23, organogenesis. Nature communications 2, 248 (2011). 987-991 (2001). 22. Richardson, M.K., Minelli, A., Coates, M. & Hanken, J. 3. Romer, A.S. Vertebrate Paleontology. 3rd Ed, (University Phylotypic stage theory. Trends in ecology & evolution 13, of Chicago Press, Chicago, 1966). 158 (1998). 4. Rieppel, O. & deBraga, M. Turtles as diapsid reptiles. 23. Nagashima, H. et al. Evolution of the turtle body plan by Nature 384, 453 - 455 (1996). the folding and creation of new muscle connections. 5. Hedges, S.B., Moberg, K.D. & Maxson, L.R. Tetrapod Science 325, 193-6 (2009). phylogeny inferred from 18S and 28S ribosomal RNA 24. Tokita, M. & Kuratani, S. Normal Embryonic Stages of the sequences and a review of the evidence for amniote Chinese Softshelled Turtle Pelodiscus sinensis relationships. Molecular biology and evolution 7, 607-33 (Trionychidae). Zoological science 18, 705-715 (2001). (1990). 25. von Baer, K.E. Uber Entwickelungsgeschichte der Thiere: 6. Crawford, N.G. et al. More than 1000 ultraconserved Beobachtung und Reflektion, (Koenigsberg, 1828). elements provide evidence that turtles are the sister 26. Quint, M. et al. A transcriptomic hourglass in plant group of archosaurs. Biology letters (2012). embryogenesis. Nature 490, 98-101 (2012). 7. Chiari, Y., Cahais, V., Galtier, N. & Delsuc, F. 27. Nagashima, H. et al. On the carapacial ridge in turtle Phylogenomic analyses support the position of turtles as embryos: its developmental origin, function and the the sister group of birds and crocodiles (Archosauria). chelonian body plan. Development 134, 2219-26 (2007). BMC biology 10, 65 (2012). 28. Kuraku, S., Usuda, R. & Kuratani, S. Comprehensive 8. Tzika, A.C., Helaers, R., Schramm, G. & Milinkovitch, survey of carapacial ridge-specific genes in turtle implies M.C. Reptilian-transcriptome v1.0, a glimpse in the brain co-option of some regulatory genes in carapace evolution. transcriptome of five divergent Sauropsida lineages and Evolution & development 7, 3-17 (2005). the phylogenetic position of turtles. EvoDevo 2, 19 29. Burke, A. Development of the turtle carapace: (2011). Implications for the evolution of a novel bauplan. Journal 9. Lyson, T.R. et al. MicroRNAs support a turtle + lizard of morphology 199, 363-378 (1989). clade. Biology letters 8, 104-7 (2012). 30. Hedges, S.B., Dudley, J. & Kumar, S. TimeTree: a public 10. Li, C., Wu, X.C., Rieppel, O., Wang, L.T. & Zhao, L.J. An knowledge-base of divergence times among organisms. ancestral turtle from the Late Triassic of southwestern Bioinformatics 22, 2971-2 (2006). China. Nature 456, 497-501 (2008). 31. Li, R. et al. De novo assembly of human genomes with 11. Zhong-Qiang, C. & Benton., M.J. The timing and pattern massively parallel short read sequencing. Genome of biotic recovery following the end-Permian mass research 20, 265-72 (2010). extinction. Nature Geoscience 5, 375-383 (2012). 32. Li, R. et al. The sequence and de novo assembly of the 12. Niimura, Y. Olfactory receptor multigene family in giant panda genome. Nature 463, 311-7 (2010). vertebrates: from the viewpoint of evolutionary genomics. 33. Smit, A., Hubley R and Green P. RepeatMasker Curr Genomics 13, 103-111 (2012). Open-3.0. (1996-2010). 13. Hayden, S. et al. Ecological adaptation determines 34. Jurka, J. et al. Repbase Update, a database of eukaryotic functional mammalian olfactory subgenomes. Genome repetitive elements. Cytogenetic and genome research research 20, 1-9 (2010). 110, 462-7 (2005). 14. Kishida, T., Kubota, S., Shirayama, Y. & Fukami, H. The 35. Smit, A.F.A. & Hubley, R. RepeatModeler. olfactory receptor gene repertoires in secondary-adapted http://www.repeatmasker.org. marine vertebrates: evidence for reduction of the 36. Benson, G. Tandem repeats finder: a program to analyze functional proportions in cetaceans. Biology letters 3, DNA sequences. Nucleic acids research 27, 573-80 428-30 (2007). (1999). 15. Toba, G. & Aigaki, T. Disruption of the microsomal 37. Brudno, M. et al. Automated whole-genome multiple glutathione S-transferase-like gene reduces life span of alignment of rat, mouse, and human. Genome research . Gene 253, 179-87 (2000). 14, 685-92 (2004). 16. Duboule, D. Temporal colinearity and the phylotypic 38. Harris, R.S. LASTZ. progression: a basis for the stability of a vertebrate http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.0 Bauplan and the evolution of morphologies through 2.00/README.lastz-1.02.00a.html heterochrony. Development, 135-42 (1994). 39. Burge, C. & Karlin, S. Prediction of complete gene 17. Raff, A. The shape of life: genes, development, and the structures in human genomic DNA. J Mol Biol 268, 78-94 evolution of animal form, (University of Chicago Press, (1997). 1996). 40. Stanke, M. et al. AUGUSTUS: ab initio prediction of 18. Irie, N. & Sehara-Fujisawa, A. The vertebrate phylotypic alternative transcripts. Nucleic Acids Res 34, W435-9 stage and an early bilaterian-related stage in mouse (2006). embryogenesis defined by genomic information. BMC 41. Elsik, C.G. et al. Creating a honey bee consensus gene biology 5, 1 (2007). set. Genome biology 8, R13 (2007). 19. Kalinka, A.T. et al. Gene expression divergence 42. Kent, W.J. BLAT--the BLAST-like alignment tool. recapitulates the developmental hourglass model. Nature Genome research 12, 656-64 (2002). 468, 811-4 (2010). 49

Nature Genetics: doi:10.1038/ng.2615 43. Birney, E., Clamp, M. & Durbin, R. GeneWise and 65. Li, H. & Durbin, R. Fast and accurate short read Genomewise. Genome research 14, 988-95 (2004). alignment with Burrows-Wheeler transform. 44. Zdobnov, E.M. & Apweiler, R. InterProScan--an Bioinformatics 25, 1754-60 (2009). integration platform for the signature-recognition methods 66. Li, H. et al. The Sequence Alignment/Map format and in InterPro. Bioinformatics 17, 847-8 (2001). SAMtools. Bioinformatics 25, 2078-9 (2009). 45. Ashburner, M. et al. Gene ontology: tool for the unification 67. Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of of biology. The Gene Ontology Consortium. Nature utilities for comparing genomic features. Bioinformatics genetics 25, 25-9 (2000). 26, 841-2 (2010). 46. Buck, L. & Axel, R. A novel multigene family may encode 68. Wang, L., Feng, Z., Wang, X. & Zhang, X. DEGseq: an R odorant receptors: a molecular basis for odor recognition. package for identifying differentially expressed genes Cell 65, 175-87 (1991). from RNA-seq data. Bioinformatics 26, 136-8 (2010). 47. Guindon, S. et al. New algorithms and methods to 69. Robinson, M.D. & Oshlack, A. A scaling normalization estimate maximum-likelihood phylogenies: assessing the method for differential expression analysis of RNA-seq performance of PhyML 3.0. Systematic biology 59, data. Genome biology 11, R25 (2010). 307-21 (2010). 70. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & 48. Guindon, S. & Gascuel, O. A simple, fast, and accurate Wold, B. Mapping and quantifying mammalian algorithm to estimate large phylogenies by maximum transcriptomes by RNA-Seq. Nature methods 5, 621-8 likelihood. Systematic biology 52, 696-704 (2003). (2008). 49. Endres, C.S., Putman, N.F. & Lohmann, K.J. Perception 71. Friedlander, M.R., Mackowiak, S.D., Li, N., Chen, W. & of airborne odors by loggerhead sea turtles. The Journal Rajewsky, N. miRDeep2 accurately identifies known and of experimental biology 212, 3823-7 (2009). hundreds of novel microRNA genes in seven animal 50. Rannala, B. & Yang, Z. Inferring speciation times under clades. Nucleic acids research 40, 37-52 (2012). an episodic molecular clock. Systematic biology 56, 72. Enright, A.J. et al. MicroRNA targets in Drosophila. 453-66 (2007). Genome biology 5, R1 (2003). 51. Yang, Z. & Rannala, B. Bayesian estimation of species 73. Matsuda, Y. et al. Highly conserved linkage homology divergence times under a molecular clock using multiple between birds and turtles: bird and turtle chromosomes fossil calibrations with soft bounds. Molecular biology and are precise counterparts of each other. Chromosome evolution 23, 212-26 (2006). research : an international journal on the molecular, 52. Yang, Z. PAML 4: phylogenetic analysis by maximum supramolecular and evolutionary aspects of chromosome likelihood. Molecular biology and evolution 24, 1586-91 biology 13, 601-15 (2005). (2007). 74. John W. Bickham, Karen A. Bjorndal, Haiduk, M.W. & 53. Harris, R. Improved pairwise alignment of genomic DNA. Rainey, W.E. The Karyotype and Chromosomal Banding Ph.D. thesis Pennsylvania State University. (2007). Patterns of the Green Turtle (Chelonia mydas). Copeia 54. Blanchette, M. et al. Aligning multiple genomic sequences 1980, 540-543 (1980). with the threaded blockset aligner. Genome research 14, 75. Koonin, E.V. Obituary: Walter Fitch and the orthology 708-15 (2004). paradigm. Briefings in bioinformatics 12, 377-8 (2011). 55. Niimura, Y. & Nei, M. Extensive gains and losses of 76. Lyson, T.R., Bever, G.S., Bhullar, B.A., Joyce, W.G. & olfactory receptor genes in mammalian evolution. PloS Gauthier, J.A. Transitional fossils and the origin of turtles. one 2, e708 (2007). Biology letters 6, 830-3 (2010). 56. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a 77. Hugall, A.F., Foster, R. & Lee, M.S. Calibration choice, new generation of protein database search programs. rate smoothing, and the pattern of tetrapod diversification Nucleic acids research 25, 3389-402 (1997). according to the long nuclear gene RAG-1. Systematic 57. Katoh, K., Kuma, K., Toh, H. & Miyata, T. MAFFT version biology 56, 543-63 (2007). 5: improvement in accuracy of multiple sequence 78. Kuraku, S. et al. cDNA-based gene mapping and GC3 alignment. Nucleic acids research 33, 511-8 (2005). profiling in the soft-shelled turtle suggest a chromosomal 58. Saitou, N. & Nei, M. The neighbor-joining method: a new size-dependent GC bias shared by sauropsids. method for reconstructing phylogenetic trees. Molecular Chromosome research : an international journal on the biology and evolution 4, 406-25 (1987). molecular, supramolecular and evolutionary aspects of 59. Takezaki, N., Rzhetsky, A. & Nei, M. Phylogenetic test of chromosome biology 14, 187-202 (2006). the molecular clock and linearized trees. Molecular 79. Werneburg, I. & Sanchez-Villagra, M.R. Timing of biology and evolution 12, 823-33 (1995). organogenesis support basal position of turtles in the 60. Edgar, R.C. MUSCLE: multiple sequence alignment with amniote tree of life. BMC evolutionary biology 9, 82 high accuracy and high throughput. Nucleic acids (2009). research 32, 1792-7 (2004). 80. Stamatakis, A., Ludwig, T. & Meier, H. RAxML-III: a fast 61. Hamburger, V. & Hamilton, H.L. A series of normal stages program for maximum likelihood-based inference of large in the development of the chick embryo. 1951. phylogenetic trees. Bioinformatics 21, 456-63 (2005). Developmental dynamics : an official publication of the 81. Niehuis, O. et al. Genomic and Morphological Evidence American Association of Anatomists 195, 231-72 (1992). Converge to Resolve the Enigma of Strepsiptera. Current 62. Levin, J.Z. et al. Comprehensive comparative analysis of Biology 22, 1-5 (2012). strand-specific RNA sequencing methods. Nature 82. Zhou, X. et al. Phylogenomic analysis resolves the methods 7, 709-15 (2010). interordinal relationships and rapid diversification of the 63. Parkhomchuk, D. et al. Transcriptome analysis by laurasiatherian mammals. Systematic biology 61, 150-64 strand-specific sequencing of complementary DNA. (2012). Nucleic acids research 37, e123 (2009). 83. Hahn, M.W., Demuth, J.P. & Han, S.G. Accelerated rate 64. Camacho, C. et al. BLAST+: architecture and of gene gain and loss in primates. Genetics 177, 1941-9 applications. BMC bioinformatics 10, 421 (2009). (2007).

50

Nature Genetics: doi:10.1038/ng.2615 84. Meyer, A. & Van de Peer, Y. From 2R to 3R: evidence for von Charles Darwin reformirte Descendenz-Theorie., a fish-specific genome duplication (FSGD). BioEssays : (Georg Reimer, Berlin, 1866). news and reviews in molecular, cellular and 102. Wimsatt, W.C. Integrating Scientific Disciplines (ed developmental biology 27, 937-45 (2005). Bechtel, P. W.) (Springer, 1986). 85. Kuraku, S. & Kuratani, S. Genome-wide detection of gene 103. Sander, K. The evolution of patterning mechanisms: extinction in early mammalian evolution. Genome Biol gleanings from insect embryogenesis and Evol 3, 1449-62 (2011). spermatogenesis. in Development and evolution. (ed. 86. Manousaki, T., Feiner, N., Begemann, G., Meyer, A. & Goodwin BC, H.N., Wylie CC,) 137-159 (Cambridge Kuraku, S. Co-orthology of Pax4 and Pax6 to the fly University Press, Cambridge, 1983). eyeless gene: molecular phylogenetic, comparative 104. Slack, J.M., Holland, P.W. & Graham, C.F. The zootype genomic, and embryological analyses. Evolution & and the phylotypic stage. Nature 361, 490-2 (1993). Development 13, 448-459 (2011). 105. Sean, B.C., Jennifer, K.G. & Scott D., W. From DNA to 87. Hoffmann, F.G., Opazo, J.C. & Storz, J.F. Differential loss Diversity, (Blackwell publishing, 2001). and retention of cytoglobin, myoglobin, and globin-E 106. Nagashima, H. et al. Turtle-chicken chimera: an during the radiation of vertebrates. Genome Biol Evol 3, experimental approach to understanding evolutionary 588-600 (2011). innovation in the turtle. Developmental dynamics : an 88. Blank, M. et al. Oxygen supply from the bird's eye official publication of the American Association of perspective: globin E is a respiratory protein in the Anatomists 232, 149-61 (2005). chicken retina. The Journal of biological chemistry 286, 107. Hall, B.K. Evo-Devo: evolutionary developmental 26507-15 (2011). mechanisms. The International journal of developmental 89. Guiloff, G.D., Jones, J. & Kolb, H. Organization of the biology 47, 491-5 (2003). inner plexiform layer of the turtle retina: an electron 108. Sanchez-Villagra, M.R. et al. Skeletal development in the microscopic study. The Journal of comparative neurology Chinese soft-shelled turtle Pelodiscus sinensis 272, 280-92 (1988). (Testudines: Trionychidae). Journal of morphology 270, 90. Zhang, X. & Firestein, S. The olfactory receptor gene 1381-99 (2009). superfamily of the mouse. Nature neuroscience 5, 124-33 109. Kawashima-Ohya, Y., Narita, Y., Nagashima, H., Usuda, (2002). R. & Kuratani, S. Hepatocyte growth factor is crucial for 91. Niimura, Y. On the origin and evolution of vertebrate development of the carapace in turtles. Evol Dev 13, olfactory receptor genes: comparative genome analysis 260-8 (2011). among 23 species. Genome biology and 110. Garriock, R.J., Warkman, A.S., Meadows, S.M., evolution 1, 34-44 (2009). D'Agostino, S. & Krieg, P.A. Census of vertebrate Wnt 92. Niimura, Y. & Nei, M. Evolutionary dynamics of olfactory genes: isolation and developmental expression of receptor genes in fishes and tetrapods. Proceedings of Xenopus Wnt2, Wnt3, Wnt9a, Wnt9b, Wnt10a, and the National Academy of Sciences of the United States of Wnt16. Dev Dyn 236, 1249-58 (2007). America 102, 6039-44 (2005). 111. Mikels, A.J. & Nusse, R. Purified Wnt5a protein activates 93. McGowen, M.R., Clark, C. & Gatesy, J. The vestigial or inhibits beta-catenin-TCF signaling depending on olfactory receptor subgenome of odontocete whales: receptor context. PLoS Biol 4, e115 (2006). phylogenetic congruence between gene-tree 112. Bartel, D.P. MicroRNAs: genomics, biogenesis, reconciliation and supermatrix methods. Systematic mechanism, and function. Cell 116, 281-97 (2004). biology 57, 574-90 (2008). 113. Stefani, G. & Slack, F.J. Small non-coding RNAs in 94. Kishida, T. & Hikida, T. Degeneration patterns of the animal development. Nat Rev Mol Cell Biol 9, 219-30 olfactory receptor genes in sea snakes. Journal of (2008). evolutionary biology 23, 302-10 (2010). 114. Friedländer, M.R., Mackowiak, S.D., Li, N., Chen, W. & 95. Saito, H., Chi, Q., Zhuang, H., Matsunami, H. & Mainland, Rajewsky, N. miRDeep2 accurately identifies known and J.D. Odor coding by a Mammalian receptor repertoire. hundreds of novel microRNA genes in seven animal Science signaling 2, ra9 (2009). clades. Nucleic Acids Res 40, 37-52 (2012). 96. Niimura, Y. & Nei, M. Evolution of olfactory receptor 115. Parra, G., Bradnam, K., Ning, Z., Keane, T. & Korf, I. genes in the . Proceedings of the National Assessing the gene space in draft genomes. Nucleic Academy of Sciences of the United States of America acids research 37, 289-97 (2009). 100, 12235-40 (2003). 116. Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to 97. Beissbarth, T. & Speed, T.P. GOstat: find statistically accurately annotate core genes in eukaryotic genomes. overrepresented Gene Ontologies within a group of Bioinformatics 23, 1061-7 (2007). genes. Bioinformatics 20, 1464-5 (2004). 117. Bairoch, A. & Apweiler, R. The SWISS-PROT protein 98. Sato, T. et al. Structure, regulation and function of . sequence database and its supplement TrEMBL in 2000. Journal of biochemistry 151, 119-28 (2012). Nucleic acids research 28, 45-8 (2000). 99. Kalin, T.V., Ustiyan, V. & Kalinichenko, V.V. Multiple 118. Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of faces of FoxM1 transcription factor: lessons from genes and genomes. Nucleic acids research 28, 27-30 transgenic mouse models. Cell cycle 10, 396-405 (2011). (2000). 100. Kelner, M.J. et al. Structural organization of the 119. St John, J.A. et al. Sequencing three crocodilian microsomal glutathione S-transferase gene (MGST1) on genomes to illuminate the evolution of archosaurs and chromosome 12p13.1-13.2. Identification of the correct amniotes. Genome biology 13, 415 (2012). promoter region and demonstration of transcriptional 120. Smit, A., Hubley, R. RepeatModeler Open-1.0. regulation in response to oxidative stress. The Journal of (2008-2010). biological chemistry 275, 13000-6 (2000). 121. Morgulis, A., Gertz, E.M., Schaffer, A.A. & Agarwala, R. A 101. Haeckel, E. Generelle Morphologie der Organismen. fast and symmetric DUST implementation to mask Allgemeine GrundzuXge der organischen low-complexity DNA sequences. Journal of computational Formen-Wissenschaft, mechanisch begruXndet durch die

51

Nature Genetics: doi:10.1038/ng.2615 biology : a journal of computational molecular cell biology 140. SeqPrep Website: https://github.com/jstjohn/SeqPrep. 13, 1028-40 (2006). 141. Grabherr, M.G. et al. Full-length transcriptome assembly 122. Down, T.A. & Hubbard, T.J. Computational detection and from RNA-Seq data without a reference genome. Nature location of transcription start sites in mammalian genomic biotechnology 29, 644-52 (2011). DNA. Genome research 12, 458-61 (2002). 142. Trinity Website. 123. Davuluri, R.V., Grosse, I. & Zhang, M.Q. Computational 143. Tarazona, S., Garcia-Alcalde, F., Dopazo, J., Ferrer, A. & identification of promoters and first exons in the human Conesa, A. Differential expression in RNA-seq: a matter genome. Nature genetics 29, 412-7 (2001). of depth. Genome research 21, 2213-23 (2011). 124. Lowe, T.M. & Eddy, S.R. tRNAscan-SE: a program for 144. John, B. et al. Human MicroRNA targets. PLoS biology 2, improved detection of transfer RNA genes in genomic e363 (2004). sequence. Nucleic acids research 25, 955-64 (1997). 145. Brochu, C. Phylogenetics, Taxonomy, and Historical 125. Burge, C. & Karlin, S. Prediction of complete gene Biogeography of Alligatoroidea. Memoir. Society of structures in human genomic DNA. Journal of molecular Vertebrate Paleontology 6(1999). biology 268, 78-94 (1997). 146. Benton, M.J. & Donoghue, P.C. Paleontological evidence 126. Goujon, M. et al. A new bioinformatics analysis tools to date the tree of life. Molecular biology and evolution 24, framework at EMBL-EBI. Nucleic acids research 38, 26-53 (2007). W695-9 (2010). 147. fissilrecord.net. www.fossilrecord.net 127. Sayers, E.W. et al. Database resources of the National 148. Hoffmann, F.G. et al. Evolution of the Globin Gene Family Center for Biotechnology Information. Nucleic acids in Deuterostomes: Lineage-Specific Patterns of research 38, D5-16 (2010). Diversification and Attrition. Molecular biology and 128. European Nucleotide Archive. evolution (2012). 129. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, 149. Warren, W.C. et al. The genome of a songbird. Nature D.J. Basic local alignment search tool. Journal of 464, 757-62 (2010). molecular biology 215, 403-10 (1990). 150. Matsui, A., Go, Y. & Niimura, Y. Degeneration of olfactory 130. Slater, G.S. & Birney, E. Automated generation of receptor gene repertories in primates: no direct link to full heuristics for biological sequence comparison. BMC trichromatic vision. Molecular biology and evolution 27, bioinformatics 6, 31 (2005). 1192-200 (2010). 131. Lewis, S.E. et al. Apollo: a sequence annotation editor. 151. Y. Benjamini, Y.H. Controlling the false discovery rate: a Genome biology 3, RESEARCH0082 (2002). practical and powerful approach to multiple testing. 132. Griffiths-Jones, S. et al. Rfam: annotating non-coding Journal of the Royal Statistical Society Series (1995). RNAs in complete genomes. Nucleic acids research 33, 152. Kimmel, C.B., Ballard, W.W., Kimmel, S.R., Ullmann, B. & D121-4 (2005). Schilling, T.F. Stages of of the 133. Griffiths-Jones, S., Grocock, R.J., van Dongen, S., zebrafish. Developmental dynamics : an official Bateman, A. & Enright, A.J. miRBase: microRNA publication of the American Association of Anatomists sequences, targets and gene . Nucleic 203, 253-310 (1995). acids research 34, D140-4 (2006). 153. Faber, J. & Neuwkoop, P.D. Normal Table of Xenopus 134. Chen, S. et al. De novo analysis of transcriptome Laevis (Daudin): A Systematical and Chronological dynamics in the migratory locust during the development Survey of the Development from the Fertilized Egg Till the of phase traits. PloS one 5, e15633 (2010). End of Metamorphosis. , (Garland Publications, 1994). 135. Kishino, H., Thorne, J.L. & Bruno, W.J. Performance of a 154. Matthew, H. & Kaufman, M.H.K. The atlas of mouse divergence time estimation method under a probabilistic development. 2nd Ed., (Academic Press, 1992). model of rate evolution. Molecular biology and evolution 155. Kalinka, A.T. & Tomancak, P. The evolution of early 18, 352-61 (2001). animal embryos: conservation or divergence? . Trends in 136. Thorne, J.L., Kishino, H. & Painter, I.S. Estimating the Ecology and Evolution (in press). rate of evolution of the rate of molecular evolution. 156. Friedlander, M.R. et al. Discovering microRNAs from Molecular biology and evolution 15, 1647-57 (1998). deep sequencing data using miRDeep. Nature 137. Sanderson, M.J. r8s: inferring absolute rates of molecular biotechnology 26, 407-15 (2008). evolution and divergence times in the absence of a 157. Xu, L. & Massague, J. Nucleocytoplasmic shuttling of molecular clock. Bioinformatics 19, 301-2 (2003). signal transducers. Nature reviews. Molecular cell biology 138. Martin, M. Cutadapt removes adapter sequences from 5, 209-19 (2004). high-throughput sequencing reads. EMBnet.journal 17(2011). 139. Schmieder, R. & Edwards, R. Quality control and preprocessing of metagenomic datasets. Bioinformatics 27, 863-4 (2011).

52

Nature Genetics: doi:10.1038/ng.2615