Quick viewing(Text Mode)

And Fugu (Takifugu Rubripes)

And Fugu (Takifugu Rubripes)

Genes Genet. Syst. (2007) 82, p. 135–144 The size evolution of medaka (Oryzias latipes) and ( rubripes)

Shuichiro Imai1,2, Takashi Sasaki2, Atsushi Shimizu2, Shuichi Asakawa2, Hiroshi Hori1 and Nobuyoshi Shimizu2* 1Division of Biological Science, Graduate School of Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8602, Japan 2Department of Molecular Biology, Keio University School of Medicine, 35 Shinanomachi, Shinjuku-ku, Tokyo 160-8582, Japan

(Received 10 January 2007, accepted 29 January 2007)

Evolution of the genome size in eukaryotes is often affected by changes in the noncoding sequences, for which insertions and deletions (indels) of small nucle- otide sequences and amplification of repetitive elements are considered responsi- ble. In this study, we compared the genomic DNA sequences of two kinds of fish, medaka (Oryzias latipes) and fugu (), which show two-fold diff- erence in the genome size (800 Mb vs. 400 Mb). We selected a contiguous DNA sequence of 790 kb from the medaka chromosome LG22 (linkage group 22), and made a precise comparison with the sequence (387 kb) of the corresponding region of Takifugu. The sequence of 178 kb in total was aligned common between two fishes, and the remaining sequences (612 kb for medaka and 209 kb for fugu) were found abundant in various repetitive elements including many types of unclassi- fied low copy repeats, all of which accounted for more than a half (54%) of the genome size difference. Furthermore, we identified a significant difference in the length ratio of the unaligned sequences that locate between the aligned sequences (USBAS), particularly after eliminating known repetitive elements. These USBAS with no repetitive elements (USBAS-nr) located within the intron and intergenic region. These results strongly indicated that amplification of repeti- tive elements and compilation of indels are major driving forces to facilitate changes in the genome size.

Key words: genome size evolution, indels, medaka, repetitive element, Takifugu

remained unsolved. INTRODUCTION In contrast to biological traits, the genomic DNA sequ- The genome size of the organism is highly diverse ences of various species can be directly compared to find- among species (Gregory, 2005), and the genome size out what types of nucleotide sequences have increased or diversity is found even between very closely related spe- decreased during evolutionary time span. In the eukary- cies (Wendel and Cronn, 2003; Hickey and Clements, ote genome, the amount of coding sequences is much 2005; Boulesteix et al., 2006). This phenomenon is clas- smaller than the non-coding sequences, and hence the lat- sically referred to as “C-value paradox” (Thomas, 1971), ter should have exerted greater influence on the genome which represents the discrepancy between the amount of size change. In general, changes in the non-coding sequ- genome DNA and developmental complexity of the organ- ences occur mainly by insertions and deletions (indels) of ism. Previous studies attempted to clarify this paradox small nucleotide sequences or amplification of repetitive by focusing on the adaptive significance of the relation- elements. In fact, the small indels were considered as a ship between genome size and biological traits such as major driving force of genome size evolution (Petrov et al., cell size, metabolic rate, and longevity (Chipman et al., 1996; Petrov, 2002a) and the rate of DNA loss through 2001; Griffith et al., 2003; Cavalier-Smith, 2005; Hughes accumulation of small deletions was emphasized as a and Piontkivska, 2005). However, little or no causal major driving force for the genome to shrink (Petrov, links were found with biological traits and the paradox 2001; Petrov, 2002b). However, those studies utilized somewhat limited sequences such as transposons and Edited by Yoko Satta pseudogenes to investigate indel bias, and hence the * Corresponding author. E-mail: [email protected] information was insufficient to clarify the process how 136 S. IMAI et al. genomic architecture changed along with genome size al., 2002) to evaluate the basis for genome size difference evolution. and the genomic architecture. Recently, amplification of the repetitive elements has received more attention as another driving force (Kidwell, MATETIALS AND METHODS 2002; Neafsey and Palumbi, 2003; Boulesteix et al., 2006) because repetitive elements occupy a significant portion Sequencing Strategy and Assembly For the medaka of the eukaryote genome, as evidenced for human (Inter- LG22 DNA sequence, we filled sequence gaps in BAC national Sequencing Consortium, 2001) clones and determined a contiguous sequence for precise and other organisms (SanMiguel et al., 1996). As an comparison of genomic DNA sequences. For the present exception, pufferfishes contain minute amounts of repeti- study, we selected a particular 1 Mb sequence consisting tive elements, having the smallest genome (~400 Mb) of five BAC clones (Md0172F16, Md0159H14, Md0170- among (Crollius et al., 2000; Aparicio et al., F19, Md0147C05, and Md0200E16) from the Medaka 2002). BAC Library (Matsuda et al., 2001). These clones were However, it is still unknown how genome sizes expand sequenced with a 3730xl DNA Analyzer and a 3100 or shrink by changing the amounts of small indels and Genetic Analyzer (Applied Biosystems) as described pre- repetitive elements. One way to answer this question is viously (Kawasaki et al., 1997). DNA sequence assembly to directly compare DNA sequences among appropriate was performed using the Phred/Phrap/Consed program species. Thanks to the genome sequencing projects, (Ewing and Green, 1998; Ewing et al., 1998; Gordon et al., enormous amounts of genomic DNA sequences are now 1998) and sequence gaps were filled by primer walking. available for various species, especially mammals (Tho- mas et al., 2003; Chapman et al., 2004). However, there Sequence analysis The coding sequence was analyzed are no significant differences in the genome sizes among with BLASTN, (Altschul et al., 1990) against nr database mammalian species (human 3.4 Gb, chimpanzee 3.7 Gb, in NCBI and the medaka EST database (Naruse et al., mouse 3.3 Gb, rat 3.0 Gb, and cow 3.6 Gb; calculated from 2004; The TIGR Gene Index Databases, The Institute the data in Genome Size Database at http:// for Genomic Research, Rockville, MD 20850 http://www. www.genomesize.com (Gregory, 2005)). Interestingly, tigr.org/tdb/tgi; Heinz Himmelbauer unpublished data). the situation is different in fish species. Two puffer- GENSCAN was utilized for gene prediction (Burge and fishes, Takifugu rubripes (Aparicio et al., 2002) and Tet- Karlin, 1997). To determine orthologous genes between raodon nigroviridis (Jaillon et al., 2004) have almost medaka and human, whose genome is annotated most equal size of genome (400 Mb), however medaka has 2- precisely in the sequenced species so far, puta- fold bigger genome (800 Mb) and zebrafish has 4-fold big- tive genes predicted by GENSCAN were analyzed by ger genome (1700 Mb), showing the genome size diversity. BLASTP against human genes in the public database. Furthermore, substantial amounts of DNA sequences are Genomic structures of medaka genes identified with available for these fishes and this was considered advan- human orthologous genes were determined by Wise2 tageous for investigating the genome size evolution. For (available at http://www.ebi.ac.uk/Wise2/). Finally, the meaningful comparison, it is essential to select proper exons were determined by the est2genome program (Mott, species that have high degree of homology in the genomic 1997) and exon-intron boundaries of each medaka gene DNA sequences. It may not be feasible to compare geno- were confirmed with the DOTTER program (Sonnhammer mic DNA sequences between fish and mammals because and Durbin, 1995). To mask repetitive elements in the these two lineages show low degree of homology except medaka genome sequence, we developed a Medaka Repeat coding sequences and regulatory elements (Goode et al., Database (ver.1.0 available at http://biol1.bio. nagoya-u. 2003; Thomas et al., 2003). On the contrary, medaka ac.jp:8000/). For this, we utilized fish repetitive elements and pufferfishes exhibit high degree of sequence homol- (T. rubripes, T. nigroviridis, Lepidiolamprologus elon- ogy despite 2-fold genome size difference. Thus, we con- gates, and Danio rerio) from the public database giri sidered medaka and Takifugu as an ideal combination to (http://www.girinst.org/~server/repbase.html) and repeti- evaluate effects of indels and repetitive elements on the tive elements previously found (Naruse et al., 1992, Koga genome size evolution. Ohtsuka et al. (2004) compared et al,. 2002, Matsuo and Nonaka 2004) and newly found the 229 kb medaka sequence with Takifugu, human and in 19 Mb of the medaka genome sequence of LG22. The mouse, however, it was a gene-poor region and analytical 6914 entries of repetitive elements were classified into 6 methods used were not sufficient to analyze amplification categories (“LTR”, “LINE”, “SINE”, “DNA transposon”, of repetitive elements and compilation of indels. “Simple Repeat and Low Complexity”, and “Unclassified”) In this study, we utilized approximately 1 Mb genomic based on their structures or the homology with the known DNA sequence of medaka chromosome LG22 (Sasaki et repetitive elements. Using this database, repetitive ele- al., 2004; Shimizu et al., 2006; Sasaki et al., 2007) and the ments were identified with the RepeatMasker2 program corresponding sequence of Takifugu genome (Aparicio et (Smit, A. F. A. and Green, P. RepeatMasker at http:// Genome size evolution of medaka and Takifugu 137 www.repeatmasker.org). Takifugu. Those gaps were found only in the USBASs of Takifugu and could not be evaluated for precise length. Takifugu genome sequence The Takifugu genome Thus, prior to analyses, we excluded pairs of USBASs sequence corresponding to the medaka 1 Mb-sequence with gaps in Takifugu from both species to allow genomic was basically searched from the database at the Joint sequences to be compared between species as accurately Genome Institute (JGI), Takifugu rubripes ver. 3.0 (Apari- as possible. cio et al., 2002) and 5 scaffolds (Scaffold940, 1291, 788, 3768, and 183) were identified by BLASTN. For RESULTS Scaffold183 we utilized a more accurate sequence regis- tered in Genbank (accession number AF411956). Takifugu Characteristics of the selected 1 Mb-region of sequences were masked with the Medaka Repeat Data- medaka chromosome LG22 Recently, we determined base using RepeatMasker2. the 19 Mb-DNA sequence of medaka chromosome LG22. We selected the particular region covered by five unique Identification of synteny and sequence align- BAC clones because the draft sequence annotation pre- ment We employed BLASTZ (Schwartz et al., 2000), a dicted the gene number of this region is approximately pairwise alignment tool using local alignment methods. same as that of the entire LG22 sequence (34 genes per The program was downloaded from http://pipmaker.bx. 1 Mb, Sasaki et al., 2007). The selected region of 918.9 psu.edu/pipmaker/ and applied locally with the parame- kb was processed for precise annotation using the gene ters B = 0, C = 2, H = 2200, T = 0 and W = 6. The result prediction program GENSCAN and the BLAST search of alignment was visualized with PipMaker. against public database. We identified 37 genes includ- To compare genomic sequences in detail, we classified ing 7 novel genes (Fig. 2). The presence of each gene was them into aligned and unaligned regions based on the confirmed by identifying the corresponding sequences in sequence alignment of the two genomic sequences by the medaka EST database. The GC content of this chro- BLASTZ. The unaligned sequences were divided into mosomal region was 40.7%, which is identical to the aver- two classes. One comprised “indels”, for which bound- age GC content of the entire chromosome LG22 (40.9%). aries could be clearly assigned within the aligned region The total sequence of all the exons in those 37 genes was (Fig. 1a) but for which it was difficult to determine calculated to be 50.1 kb, in which 38.4 kb was derived whether they were insertions in medaka and deletions in from open reading frames (ORFs). In addition, the DNA Takifugu or vice versa. Therefore, we defined medaka- length of this medaka chromosomal region was roughly 2 insertions (or Takifugu-deletions) as medaka extra sequ- times larger than the corresponding region of Takifugu ences (MES) and Takifugu-insertions (or medaka dele- chromosome (see below), reflecting the 2-fold genome size tions) as Takifugu extra sequences (TES). The other difference between medaka and Takifugu. unaligned sequences were those between two sets of aligned sequences and defined as “unaligned sequence DNA sequence alignment with Takifugu and zebra- between aligned sequences (USBAS; Fig. 1b)”. fish Analysis using BLASTN against the whole genome Because the Takifugu sequences are still in a draft sta- shotgun database of Takifugu (JGI Takifugu rubripes ver. tus, there were 30 sequence gaps in the studied region of 3.0) identified five Takifugu scaffolds (scaffold940, 788, 1291, 3768, and 183) that present high homology to the selected 918.9 kb-medaka DNA sequence. The orienta- tion of Takifugu scaffolds was determined by comparison with medaka genomic sequence (Fig. 3). The Takifugu Scaffold788 of 90 kb was an exception, because it showed high homology with a different region of medaka LG22. We assumed that an intra-chromosomal shuffling would have occurred in Takifugu (or medaka) lineage during evolution. Among 37 medaka genes, 33 genes were found in Takifugu and they were located in the same order and direction on the same chromosomal DNA. Fig. 1. Alignment by BLASTZ produces aligned sequences However, four genes were not found in the Takifugu data- (filled boxes) and two types of unaligned sequence (blank box). base for unknown reason. We identified all the ORFs of One is indels (a) clearly assigned within aligned sequences and 33 Takifugu genes and those sequences were counted up named as medaka extra sequences (MESs) and Takifugu extra to 36.4 kb. These results showing a high degree of sim- sequences (TESs). (b) The other is a corresponding uncon- served sequence (USBAS), unaligned due to low homology ilarity suggest that these chromosomal regions of medaka between medaka and Takifugu, but comparable because of clip- and Takifugu would have been derived from the same ping between corresponding aligned sequences. region of a common ancestral chromosome. 138 S. IMAI et al.

Fig. 2. Contig and gene map of medaka and Takifugu. Medaka and Takifugu maintain high synteny, except for the black bar in Takifugu scaffold 788 where synteny appears disrupted. Genes of open triangle indicate the genes found in both medaka and Takifugu. Filled triangles indicate genes found only in medaka. The gene names are, 1: Ppp1r3b, 2: Ptp4a, 3: Md0172F16_novel_1, 4: Gjb5, 5:Mlp, 6: NM_018045, 7: Olig2, 8: Md0170F19_novel_1, 9: Fndc5, 10: Md0170F19_novel_2, 11: Md0170F19_novel_3, 12: Arh, 13: Rhce, 14: Rhd, 15: Smp1, 16: Mgst3, 17: Md0147C05_novel_1, 18: Md0147C05_novel_2, 19: Gcpip, 20:Runx3, 21: Clic4, 22: Srrm1, 23: Tdh, 24: Mtmr9, 25: C8orf13L, 26: Lck, 27: Hdac1, 28: Bclp, 29: Md0200E16_novel_1, 30: Pabpc4, 31: Ppie, 32: Gjb4, 33: Gjb3, 34: Hmgcl, 35: Zbtb5, 36: Gale, 37: Insm1.

ratio between medaka and Takifugu. Therefore, the detailed comparison of these sequences would be worthy for providing information on the genome size diversity. In the concerned chromosomal regions, alignment by BLASTZ identified the total 178.2 kb DNA sequence in common and these sequences included 36 kb-sequence as ORFs. To ascertain if the sequence aligned outside ORF was conserved in other species, we examined the correspond- ing region of zebrafish (Ensembl: zebrafish assembly ver. 4). In zebrafish, the corresponding region was divided into several small sub-regions that are assigned to at least three different chromosomes 13, 17, and 19. The size of one such sub-region was calculated to be 91.1 kb, Fig. 3. High synteny between medaka and Takifugu, plotted by PipMaker. The arrow indicates the region in which synteny is and its relevant regions were calculated to be 28.1 kb for disrupted between medaka and Takifugu seen in Fig. 2. Takifugu and 56.6 kb for medaka. In these small sub- regions, six genes (Clic4, Srrm1, Tdh, Mtmr9, C8orf13L, The sizes of the corresponding regions of medaka and and Lck) were found in common (see Fig. 2 for medaka Takifugu chromosomes were calculated to be 789.6 kb and Takifugu genes), and the total sequence of ORFs was and 387.4 kb, respectively (Table 1). The ratio of total equally 7.6 kb for all these three fishes. Furthermore, sequence size was 2.04, which represents the genome size the pair-wise comparison of those sequences by BLASTZ

Table 1. Sequence categories and lengths for medaka and Takifugu

Length (bp) Number Ratio Ratio Medaka Takifugu Medaka Takifugu (Medaka/Takifugu) (Medaka/Takifugu) Total lengths of regions to compare 789,557 387,407 2.04 Aligned sequences 178,201 178,201 1.00 5,578 5,578 1.00

Extra In ORF 597 458 1.30 87 65 1.34 Unaligned sequence Out of ORF 19,302 17,707 1.09 2,574 2,506 1.03 sequences USBAS 591,457 191,041 3.10 299 299 1.00 USBAS (USBAS-nr) (363,316) (178,790) (2.03) (299) (299) (1.00) USBAS: Unaligned Sequence Between Aligned Sequence Genome size evolution of medaka and Takifugu 139 determined the total homologous sequence to be 16.2 kb found in the 134.0 kb-medaka DNA sequence with the between medaka and Takifugu, 9.4 kb between medaka distribution at 1,090 sites, and their amount corre- and zebrafish, and 9.2 kb between Takifugu and sponded to 58.5% of the total repetitive elements. Unlike zebrafish. The total size of sequences well-aligned out- medaka, only 7 types of unclassified repeats were found side ORFs was calculated to be 8.6 kb for medaka-Tak- at 12 sites in the 1.8 kb-Takifugu DNA sequence. These ifugu, 2.6 kb for medaka-zebrafish, and 2.7 kb for results indicate that the abundance of low copy repeats is Takifugu-zebrafish alignment, indicating the 3-fold abun- a characteristic feature of medaka chromosome. dance of homologous sequences in medaka-Takifugu as compared to two other comparisons. Difference in the length of unaligned sequences There are large amounts of sequences that are not com- Repetitive elements To clarify the contribution of mon between medaka and Takifugu. These “unaligned repetitive elements to the 2-fold genome size difference sequences” were found to be 611.3 kb for medaka and between medaka and Takifugu, we analyzed the amount 209.2 kb for Takifugu, respectively and classified into and composition of repetitive elements in the concerned three types. One type is abundant in medaka and regions. The medaka 789.6 kb-region contained the total defined as medaka-extra sequence (MES), whereas 229.2 kb of repetitive elements (29.0%), whereas the Tak- another type is abundant in Takifugu and defined as Tak- ifugu 387.4 kb-region contained the total 13.8 kb of repet- ifugu-extra sequence (TES). In the unaligned sequences, itive elements (3.6%). This difference (215.4 kb) in the there were 2661 MESs and 2571 TESs but their average amount of repetitive elements accounts for 53.6% of the length was as small as 7.47 bp and 7.07 bp, respectively. total sequence difference (402.2 kb) in the studied Therefore, most of MES and TES are just small indels region. The types of repetitive elements were quite dif- and do not belong to transposable elements. The total ferent between medaka and Takifugu (Table 2). About length of MES and TES were calculated to be only 19.9 one-third of the Takifugu repetitive elements were kb and 18.2 kb, respectively. Therefore, we concluded assigned by RepeatMasker2 as “simple repeats” and “low that MESs and TESs were not major driving forces to complexity sequences” in consistent with previous whole determine the genome size of medaka and Takifugu. The genome analysis (Aparicio et al., 2002), whereas only remaining “unaligned sequences” were found in the 3.0% of the total medaka repetitive elements were assi- regions between two aligned sequences, and these sequ- gned to those categories. So many as 648 types of repet- ences were designated as USBAS “unaligned sequences itive elements were identified in medaka and those were between aligned sequences”. Here, the USBAS was con- located at 1,422 different sites in the concerned region, sidered responsible for the size difference of the studied whereas only 29 types of repetitive elements were found region. at 51 different sites in Takifugu chromosome. There The total length of USBAS in medaka was calculated to were repetitive elements common to medaka and be 591.5 kb, which is 3.10 times larger than Takifugu Takifugu. These include DNA transposons, SINEs, and (191.0 kb) (Wilcoxon matched-pairs signed test, z = –11.92, non-LTR retrotransposons such as Chaplin, SINE_FR, p < 0.0001) and the length difference was 400.5 kb. The Maui, REX3, and Expander. Most significantly, medaka USBAS contains repetitive elements in the amounts of genome contains many copies of various “unclassified 228.1 kb for medaka and 12.3 kb for Takifugu. Thus, repeats” and most of them were uniquely found in approximately half of the length difference of USBAS was medaka and not found in Takifugu. In the studied attributed to the repetitive elements. Therefore, we region of medaka, we identified 68 copies of one particular excluded those repetitive elements and re-evaluated the type of “unclassified repeats” in the total amount of 16.7 rests of sequence as the USBAS with no repetitive ele- kb. As many as 566 types of unclassified repeats were ments (USBAS-nr). The maximum size of USBAS-nr

Table 2. Repetitive elements in medaka and Takifugu

Medaka Takifugu Ratio Category Length (bp) (%) Length (bp) (%) (Medaka/Takifugu) LTR 14,696 6.41 55 0.40 267.20 LINE 37,205 16.23 1,705 12.31 21.82 SINE 6,610 2.89 1,074 7.76 6.15 DNA transposon 29,827 13.01 3,985 28.78 7.48 Simple repeats & Low complexity 6,884 3.00 5,249 37.90 1.31 Unclassified 134,001 58.46 1,780 12.85 75.28 Total 229,223 100 13,848 100 16.55 140 S. IMAI et al. was 6,655 bp for medaka and 3,304 bp for Takifugu, therefore, a value larger than 1 means that the size differ- respectively. It should be noted that not all the USBAS- ence is greater than two-fold. Under this condition, the nrs in medaka are larger than Takifugu, namely 75 out mean was 0.863 with a standard deviation of 1.543, which of 296 USBAS-nrs in Takifugu were larger than those in is significantly larger than 0 (t = 9.627, n = 296, p < 0.0001, medaka. The average length of USBAS-nr in medaka Fig. 4a). These results indicate that the shape of the dis- was still larger than Takifugu (Wilcoxon matched-pairs tribution is different from normal distribution and signifi- signed test, z = –10.43, p < 0.0001) and the total length cantly broader than normal (kurtosis: b2 = 6.36, p < 0.05). was 363.3 kb for medaka and 178.8 kb for Takifugu with We also analyzed the location of USBAS-nr and its a ratio of 2.03. There were 36 USBAS-nrs between length ratio within intergenic regions. However, no sig- medaka and Takifugu and 38 USBAS-nrs between nificant correlation was found between the length ratio of medaka and zebrafish in the Clic4-Lck regions. By fur- the USBAS-nr within intergenic region and the distance ther comparison, we found 16 USBAS-nrs whose positions from the USBAS-nr to the neighboring gene (medaka: r = are conserved among these three species medaka, Tak- 0.019, n = 182, p > 0.05, Takifugu: r = –0.049, n = 67, p > ifugu and zebrafish. The total length of such position- 0.05). Interestingly, the average log-transformed length conserved USBAS-nrs of medaka was only 0.59 times of ratio of USBAS-nr within intron (1.094, Fig. 4b) was sig- that of zebrafish and 2.35 times larger than that of nificantly larger that within intergenic region (0.719, Fig. Takifugu. Therefore, the position-conserved USBAS-nrs 4c) (t-test, t = 2.248, df = 294, p < 0.05). Therefore, we also reflect the genome size. assumed that the length of each USBAS-nr was affected The length ratio for each pair of USBAS-nr between mostly at random throughout the whole genome region, medaka and Takifugu is shown in Fig. 4. The length and there was no bias for location of USBAS-nr within ratio was log-transformed with base 2 for simplicity, and intergenic regions. However, between medaka and Tak- ifugu, the distribution of length ratio of each USBAS-nr was broader than normal distribution, and there was a significant length ratio difference for the USBAS-nr within intron and intergenic regions. Although the introns had higher length ratio in the USBAS-nr, the pro- portion of the conserved sequence was higher in the intron (medaka 22.1%, Takifugu 43.0%) than the inter- genic region (medaka 18.2%, Takifugu 39.0%).

DISCUSSION Features of DNA sequences between medaka and Takifugu The studied region of medaka chromosome LG22 was twice as large as the corresponding region of Takifugu, and this size ratio was identical to the genome size ratio between medaka and Takifugu. Moreover, the gene density and GC content of the concerned region were almost same as the entire chromosome LG22, therefore this 1 Mb region represented the whole genome and was suitable to use as an ideal case to analyze the genome size difference. In the concerned regions, 33 genes were located in the same order and same direction, therefore gene number was not related to the genome size differ- ence. Furthermore, a small sub-region containing same 6 genes was common to three fishes (Takifugu, medaka, and zebrafish), but their size ratio was different as 1 : 2 : Fig. 4. The distribution of length ratios of medaka and Takifugu USBAS-nr. The length ratios were log-transformed with base 4. Therefore, we concluded that the genome size differ- 2. (a) The distribution of total length ratios showing that ence among these three fishes may have been caused by medaka USBAS-nr are actually twice as large as Takifugu gain or loss of small nucleotide sequences in the non-cod- USBAS-nr. The mean value was 0.863 with a standard devia- ing region, not by drastic gain or loss of large DNA frag- tion of 1.543. (b) The distribution of length ratios of USBAS-nrs ments. Assuming zebrafish as an outgroup, we believe within introns. (c) The distribution of length ratios of USBAS- nrs within intergenic regions. The mean value within introns that the lineage of medaka and Takifugu has decreased was significantly larger than intergenic regions (t-test, t = 2.248, genome size and such a tendency has been stronger in df = 294, p < 0.05). Takifugu. Genome size evolution of medaka and Takifugu 141

In general, the conserved regions of chromosomes are elements (USBAS-nr) accounted for the remaining half of suitable to make direct comparison at nucleotide sequence the length difference between medaka and Takifugu. We level. A quarter of the medaka DNA sequence was assumed that most of the sequences corresponding to aligned to a half of the Takifugu DNA sequence in regard USBAS-nr in both species must have been derived from to both inside ORFs and outside ORFs. Most of the the same region of a common ancestral chromosome, and homologous sequences inside ORFs and some of the they have changed by mutations independently in each homologous sequences outside ORFs would have suffered lineage. The USBAS-nr may include ancient repetitive from functional constraint during evolution. However, elements that were already subjected to many changes by two thirds of the homologous sequences outside ORFs was various mutations over a long period of evolution, so that lost in zebrafish, therefore, these lost homologous sequ- RepeatMasker that deposits the repeats of contemporary ences would be not necessary to conserve among these organisms cannot identify those repeats. The variation three fishes. Some of these lost homologous sequences in of log-transformed ratio of USBAS-nr suggested that the zebrafish might be functional in only medaka and Tak- length of USBAS-nr in medaka would have changed in a ifugu, however, we believe that most of these lost homol- way to make each medaka USBAS-nr twice as large as ogous sequences would not be functional and may be the corresponding Takifugu USBAS-nr. The observed related to the evolutionary divergence time, namely variation of USBAS-nr ratio would fit with the idea of medaka and Takifugu diverged 184 Myr ago much more gradual compilation of small indels, although the length recent than the divergence between medaka and zebrafish difference of USBAS-nr between medaka and Takifugu (277 Myr ago) (Inoue et al., 2005; Yamanoue et al., 2006). may include ancient repetitive elements. The small indels must have been accumulated in the sequence that Diversity of repetitive elements The genome sequ- cannot be aligned between medaka and Takifugu. As ence comparison between medaka and Takifugu revealed discussed above, the ancestral sequence of USBAS-nr a large difference in the amount of repetitive elements, would have been degraded after divergence from the com- accounting for a half of the genome size difference. Then, mon ancestor, thereby no or little homology was observed we examined the involvement of their compositions in the in the current USBAS-nr. genome size evolution as seen in other species (Boulesteix Furthermore, we found difference in the GC content et al., 2006). The classification of repetitive elements in between USBAS-nrs (medaka 37.2%, Takifugu 42.8%) medaka is not comprehensive, but the composition clearly and aligned sequences outside ORFs (medaka 44.4%, differs between medaka and Takifugu. In particular, Takifugu 47.6%). This result suggests that mutations 58.5% of the medaka repetitive elements are currently disturbed sequence homology outside ORF and this effect unclassified, and even the most frequent repetitive was much less in the GC-rich sequences (aligned sequ- element accounts for only 2.1% (16.7 kb) of the studied ences outside ORFs) than AT-rich sequences (USBAS- region of medaka chromosome LG22. In the human nrs). Therefore, the difference of evolutionary rate genome, the most frequent repetitive element Alu occu- would be related to the heterogenic degeneration of homo- pies 10.6% of the total genome sequence (International logy in the non-functional sequences outside ORF, that Human Genome Sequencing Consortium, 2001). More- may have resulted in higher homology in the GC-rich over, most of the unclassified repeats found in medaka region and absence of homology in the AT-rich region. were not detected in Takifugu. In fact, medaka had Because MES outside ORF was AT-rich (GC% = 42.4), it many types of low copy unknown repetitive sequences. was suggested that deletions have been AT-biased in Tak- Also, Takifugu contained various repeats such as trans- ifugu, making Takifugu genome GC-rich. These may be posable elements far more than human (Aparicio et al., the reasons why some sequences are diverged among spe- 2002). Taking all these data together, fish genome may cies and others are not in the non-coding sequences. be generally abundant in repetitive elements and hence further analysis of these “unclassified repeats” in medaka Genome size evolution The distribution of length ratio and related species will provide insights into the evolu- of each USBAS-nr between medaka and Takifugu was tionary significance of their relative abundance in partic- broader than normal distribution. There was no bias for ular species. generated location of USBAS-nr within intergenic regi- ons, however, we identified the significant length ratio Indels in the unconserved sequences MES, TES difference for USBAS-nr within intron and intergenic and USBAS There was no significant difference in the region. These results indicated that the effects of driv- total length of MESs and TESs between medaka and Tak- ing forces for alteration of USBAS-nr length should be dif- ifugu, and most of the genome size difference was found ferent between intron and intergenic regions. We within the USBAS. A half of the difference in the length deduced that the difference of indel rate derived from of USBAS was attributed to the difference in the amounts effects of difference of driving forces would have resulted of repetitive elements. The USBAS without repetitive in current length ratio of USBAS-nr within intron and 142 S. IMAI et al. intergenic regions between medaka and Takifugu. C., Verhoef, F., Predki, P., Tay, A., Lucas, S., Richardson, Details of the driving forces for alteration of USBAS-nr P., Smith, S. F., Clark, M. S., Edwards, Y. J., Doggett, N., length are difficult to deduce by this study alone. How- Zharkikh, A., Tavtigian, S. V., Pruss, D., Barnstead, M., Evans, C., Baden, H., Powell, J., Glusman, G., Rowen, L., ever, several other studies have shown the positive corre- Hood, L., Tan, Y. H., Elgar, G., Hawkins, T., Venkatesh, B., lation of intron length and genome size (Moriyama et al., Rokhsar, D., and Brenner, S. (2002) Whole-genome shotgun 1998; Vinogradov, 1999; McLysaght et al., 2000). assembly and analysis of the genome of Fugu rubripes. Sci- In summary, our study suggests that amplification of ence 297, 1301–1310. repetitive elements and gradual changes of indels mainly Boulesteix, M., Weiss, M., and Biémont, C. (2006) Differences in genome size between closely related species: the Drosophila contributed to the genome size evolution. The “2-fold” melanogaster species subgroup. Mol. Biol. Evol. 23, 162–167. concordance between medaka and Takifugu does not Burge, C., and Karlin, S. (1997) Prediction of complete gene struc- mean that gradual changes of indels occurring in the tures in human genomic DNA. J. Mol. Biol. 268, 78–94. USBAS-nr are solely responsible for variation in the Cavalier-Smith, T. (2005) Economy, speed and size matter: evo- genome size evolution. The contribution of repetitive lutionary forces driving nuclear genome miniaturization and expansion. Ann. Bot. (Lond) 95, 147–175. elements was estimated to be 54% and the contribution of Chapman, M. A., Donaldson, I. J., Gilbert, J., Grafham, D., non-coding sequences including MESs, TESs and the Rogers, J., Green, A. R., and Göttgens, B. (2004) Analysis of length difference in USBAS-nr was estimated to be 46% multiple genomic sequence alignments: a web resource, (Fig. 5). Most of the non-coding sequences must have online tools, and lessons learned from analysis of mamma- gradually changed in two directions, gain or loss, by lian SCL loci. Genome Res. 14, 313–318. Chipman, A. D., Khaner, O., Haas, A., and Tchernov, E. (2001) indels throughout the entire genome, thereby the genome The evolution of genome size: what can be learned from size could have expanded or shrunk. Further analysis of anuran development? J. Exp. Zool. 291, 365–374. repetitive elements and indels will be necessary to better Comeron, J. M., and Kreitman, M. (2000) The correlation understand the relationship between amplification of between intron length and recombination in Drosophila: repetitive elements and compilation of indels over a long dynamic equilibrium between mutational and selective force. Genetics 156, 1175–1190. period of evolutionary time. Crollius, H. R., Jaillon, O., Dasilva, C., Ozouf-Costaz, C., Fizames, C., Fischer, C., Bouneau, L., Billault, A., Quetier, F., Saurin, W., Bernot, A., and Weissenbach, J. (2000) Char- acterization and repeat analysis of the compact genome of the freshwater pufferfish Tetraodon nigroviridis. Genome Res. 10, 939–949. Ewing, B., and Green, P. (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194. Ewing, B., Hillier, L., Wendl, M. C., and Green, P. (1998) Base- calling of automated sequencer traces using phred. I. Accu- racy assessment. Genome Res. 8, 175–185. Goode, D. K., Snell, P., and Elgar, G. (2003) Comparative anal- ysis of vertebrate Shh genes identifies novel conserved non- Fig. 5. The compositions of medaka and Takifugu genome coding sequence. Mamm. Genome 14, 192–201. sequences in the studied region. The 2-fold length difference in Gordon, D., Abajian, C., and Green, P. (1998) Consed: a graphi- USBAS-nr between the two species and variation in abundance cal tool for sequence finishing. Genome Res. 8, 195–202. of repetitive elements in medaka each account for approximately Gregory, T. R. (2004) Insertion-deletion biases and the evolution half of the total length difference between the two species. of genome size. Gene 324, 15–34. Gregory, T. R. (2005) The C-value enigma in plants and : The authors thank S. K. Ishikawa for technical assistance with a review of parallels and an appeal for partnership. Ann. DNA sequencing. This work was supported by a Grant-in-Aid Bot. (Lond) 95, 133–146. for Scientific Research on the Priority Area “Study of Medaka as Griffith, O. L., Moodie, G. E., and Civetta, A. (2003) Genome size a Model for Organization and Evolution of the Nuclear Genome” and longevity in fish. Exp. Gerontol. 38, 333–337. (#813), Priority Area “Comparative ” (#015) from the Hare, M. P., and Palumbi, S. R. (2003) High intron sequence Ministry of Education, Culture, Sports, Science and Technology conservation across three mammalian orders suggests func- of Japan (MEXT). tional constraints. Mol. Biol. Evol. 20, 969–978. Hickey, A. J., and Clements, K. D. (2005) Genome size evolution in New Zealand triplefin fishes. J Hered 96, 356–362. REFERENCES Hughes, A. L., and Piontkivska, H. (2005) DNA repeat arrays in chicken and human and the adaptive evolution of Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, avian genome size. BMC Evol. Biol. 5, 12. D. J. (1990) Basic local alignment search tool. J. Mol. Biol. Inoue, J. G., Miya, M., Venkatesh, B., and Nishida, M. (2005) 215, 403–410. The mitochondrial genome of Indonesian coelacanth Latim- Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J. M., eria menadoensis ( Sarcopterygii: Coelacanthiformes) and Dehal, P., Christoffels, A., Rash, S., Hoon, S., Smit, A., divergence time estimation between the two coelacanths. Gelpke, M. D., Roach, J., Oh, T., Ho, I. Y., Wong, M., Detter, Gene 349, 227–235. Genome size evolution of medaka and Takifugu 143

International Human Genome Sequencing Consortium. (2001) Petrov, D. A. (2002a) Mutational equilibrium model of genome Initial sequencing and analysis of the human genome. size evolution. Theor. Popul. Biol. 61, 531–544. Nature 409, 860–921. Petrov, D. A. (2002b) DNA loss and evolution of genome size in Jaillon, O., Aury, J. M., Brunet, F., Petit, J. L., Stange- Drosophila. Genetica 115, 81–91. Thomann, N., Mauceli, E., Bouneau, L., Fischer, C., Ozouf- Petrov, D. A., Lozovskaya, E. R., and Hartl, D. L. (1996) High Costaz, C., Bernot, A., Nicaud, S., Jaffe, D., Fisher, S., intrinsic rate of DNA loss in Drosophila. Nature 384, 346– Lutfalla, G., Dossat, C., Segurens, B., Dasilva, C., Salanoubat, 349. M., Levy, M., Boudet, N., Castellano, S., Anthouard, V., SanMiguel, P., Tikhonov, A., Jin, Y-K., Motchoulskaia, N., Jubin, C., Castelli, V., Katinka, M., Vacherie, B., Biémont, Zakharov, D., Melake-Berhan, A., Springer, P. S., Edwards, C., Skalli, Z., Cattolico, L., Poulain, J., De Berardinis, V., K. J., Lee, M., Avramova, Z., and Bennetzen, J. L. (1996) Cruaud, C., Duprat, S., Brottier, P., Coutanceau, J. P., Nested retrotransposons in the intergenic regions of the Gouzy, J., Parra, G., Lardier, G., Chapple, C., McKernan, K. maize genome. Science 274, 765–768. J., McEwan, P., Bosak, S., Kellis, M., Volff, J. N., Guigó, R., Sasaki, T., Asakawa, S., Shimizu, A., Ishikawa, S. K., Imai, S., Zody, M. C., Mesirov, J., Lindblad-Toh, K., Birren, B., Himmelbauer, H., Mitani, H., Furutani-Seiki, M., Kondoh, Nusbaum, C., Kahn, D., Robinson-Rechavi, M., Laudet, V., H., Schartl, M., Hori, H., Shima, A., and Shimizu, N. (2004) Schachter, V., Quétier, F., Saurin, W., Scarpelli, C., Medaka Genome Mapping and Sequencing: Toward Com- Wincker, P., Lander, E. S., Weissenbach, J., and Crollius, R. plete Genome Sequence. Marine Biotech. 6, S445–S448. H. (2004) Genome duplication in the teleost fish Tetraodon Sasaki, T., Shimizu, A., Ishikawa, S. K., Imai, S., Asakawa, S., nigroviridis reveals the early vertebrate proto-karyotype. Murayama, Y., Khorasani, M. Z., Mitani, H., Furutani- Nature 431, 946–957. Seiki, M., Kondoh, H., Nanda, I., Schmid, M., Schartl, M., Kawasaki, K., Minoshima, S., Nakato, E., Shibuya, K., Shintani, Nonaka, M., Takeda, H., Hori, H., Himmelbauer, H., Shima, A., Schmeits, J. L., Wang, J., and Shimizu, N. (1997) One- A., and Shimizu, N. (2007) The DNA sequence of medaka megabase sequence analysis of the human immunoglobulin chromosome LG22. Genomics 89, 124–133. lambda gene locus. Genome Res. 7, 250–261. Shimizu, N., Sasaki, T., Asakawa, S., Shimizu, A., Ishikawa, S. Kidwell, M. G. (2002) Transposable elements and the evolution K., Imai, S., Murayama, Y., Himmelbauer, H., Mitani, H., of genome size in eukaryotes. Genetica 115, 49–63. Furutani-Seiki, M., Kondoh, H., Schartl, M., Nonaka, M., Koga A., Hori H., and Ishikawa Y. (2002) Gamera, a family of Takeda, H., Hori, H., and Shima, A. (2006) Comparative LINE-like repetitive sequences widely distributed in Genomics of Medaka and Fugu. Proceedings for TODAI medaka and related fishes. Heredity 89, 446–452 International Symposium of Functional Genomics of Puffer- Matsuda, M., Kawato, N., Asakawa, S., Shimizu, N., Nagahama, fish – Recent Advances and Perspective –. Comp. Biochem. Y., Hamaguchi, S., Sakaizumi, M., and Hori, H. (2001) Con- Physiol. D. 1, 6–12 struction of a BAC library derived from the inbred Hd-rR Schwartz, S., Zhang, Z., Frazer, K. A., Smit, A., Riemer, C., strain of the teleost fish, Oryzias latipes. Genes Genet. Bouck, J., Gibbs, R., Hardison, R., and Miller, W. (2000) Syst. 76, 61–63. PipMaker--a web server for aligning two genomic DNA Matsuo M. Y., and Nonaka M. (2004) Repetitive elements in the sequences. Genome Res. 10, 577–586. major histocompatibility complex (MHC) class I region of a Sonnhammer, E. L., and Durbin, R. (1995) A dot-matrix pro- teleost, medaka: Identification of novel transposable ele- gram with dynamic threshold control suited for genomic ments. Mech. Dev.121, 771–777. DNA and protein sequence analysis. Gene 167, GC1–10. McLysaght, A., Enright, A. J., Skrabanek, L., and Wolfe, K. H. Thomas, C. A. (1971) The genetic organization of chromosomes. (2000) Estimation of synteny conservation and genome com- Ann. Rev. Genet. 5, 237–256. paction between pufferfish (Fugu) and human. Yeast 17, Thomas, J. W., Touchman, J. W., Blakesley, R. W., Bouffard, G. 22–36. G., Beckstrom-Sternberg, S. M., Margulies, E. H., Moriyama, E. N., Petrov, D. A., and Hartl, D. L. (1998) Genome Blanchette, M., Siepel, A. C., Thomas, P. J., McDowell, J. C., size and intron size in Drosophila. Mol. Biol. Evol. 15, Maskeri, B., Hansen, N. F., Schwartz, M. S., Weber, R. J., 770–773. Kent, W. J., Karolchik, D., Bruen, T. C., Bevan, R., Cutler, Mott, R. (1997) EST_GENOME: a program to align spliced DNA D. J., Schwartz, S., Elnitski, L., Idol, J. R., Prasad, A. B., sequences to unspliced genomic DNA. Comput. Appl. Bio- Lee-Lin, S. Q., Maduro, V. V., Summers, T. J., Portnoy, M. sci. 13, 477–478. E., Dietrich, N. L., Akhter, N., Ayele, K., Benjamin, B., Naruse K., Mitani H., and Shima A. (1992) A highly repetitive Cariaga, K., Brinkley, C. P., Brooks, S. Y., Granite, S., interspersed sequence isolated from genomic DNA of the Guan, X., Gupta, J., Haghighi, P., Ho, S. L., Huang, M. C., medaka, Oryzias latipes, is conserved in three other related Karlins, E., Laric, P. L., Legaspi, R., Lim, M. J., Maduro, Q. species within the Oryzias. J. Exp. Zool. 262, 81–86. L., Masiello, C. A., Mastrian, S. D., McCloskey, J. C., Naruse, K., Tanaka, M., Mita, K., Shima, A., Postlethwait, J., Pearson, R., Stantripop, S., Tiongson, E. E., Tran, J. T., and Mitani, H. (2004) A medaka gene map: the trace of Tsurgeon, C., Vogt, J. L., Walker, M. A., Wetherby, K. D., ancestral vertebrate proto-chromosomes revealed by com- Wiggins, L. S., Young, A. C., Zhang, L. H., Osoegawa, K., parative gene mapping. Genome Res. 14, 820–828. Zhu, B., Zhao, B., Shu, C. L., De Jong, P. J., Lawrence, C. Neafsey, D. E., and Palumbi, S. R. (2003) Genome size evolution E., Smit, A. F., Chakravarti, A., Haussler, D., Green, P., in pufferfish: a comparative analysis of diodontid and tetra- Miller, W., and Green, E. D. (2003) Comparative analyses of odontid pufferfish genomes. Genome Res. 13, 821–830. multi-species sequences from targeted genomic regions. Ohtsuka M., Kikuchi N., Ozato K., Inoko H., and Kimura M. Nature 424, 788–793. (2004) Comparative analysis of a 229-kb medaka genomic Vinogradov, A. E. (1999) Intron-genome size relationship on a region, containing the zic1 and zic4 genes, with Fugu, large evolutionary scale. J. Mol. Evol. 49, 376–384. human, and mouse. Genomics 83,1063–1071. Wendel, J. F., and Cronn, R. C. (2003) Polyploidy and the evolu- Petrov, D. A. (2001) Evolution of genome size: new approaches tionary history of cotton. Adv. Agron. 78, 139–186. to an old problem. Trends Genet. 17, 23–28. Yamanoue, Y., Miya, M., Inoue, J. G., Matsuura, K., and 144 S. IMAI et al.

Nishida, M. (2006) The mitochondrial genome of spotted odontiformes) and divergence time estimation among model green pufferfish Tetraodon nigroviridis (Teleostei: Tetra- organisms in fishes. Genes Genet. Syst. 81, 29–39.