Rice Transposable Elements: a Survey of 73,000 Sequence-Tagged-Connectors

Letter Rice Transposable Elements: A Survey of 73,000 Sequence-Tagged-Connectors Long Mao,1 Todd C. Wood,1 Yeisoo Yu,1 Muhammad A. Budiman,1,3 Jeff Tomkins,1 Sung-sick Woo,1,4 Maciek Sasinowski,1,5 Gernot Presting,1 David Frisch,1 Steve Goff,2 Ralph A. Dean,1,6 and Rod A. Wing1,7 1Clemson University Genomics Institute, Clemson, South Carolina 29634 USA; 2Novartis Agricultural Discovery Institute, San Diego, California 92121 USA As part of an international effort to sequence the rice genome, the Clemson University Genomics Institute is developing a sequence-tagged-connector (STC) framework. This framework includes the generation of deep-coverage BAC libraries from O. sativa ssp. japonica c.v. Nipponbare and the sequencing of both ends of the genomic DNA insert of the BAC clones. Here, we report a survey of the transposable elements (TE) in >73,000 STCs. A total of 6848 STCs were found homologous to regions of known TE sequences (E<10−5) by FASTX search of STCs against a set of 1358 TE protein sequences obtained from GenBank. Of these TE-containing STCs (TE–STCs), 88% (6027) are related to retroelements and the remaining are transposase homologs. Nearly all DNA transposons known previously in plants were present in the STCs, including maize Ac/Ds, En/Spm, Mutator, and mariner-like elements. In addition, 2746 STCs were found to contain regions homologous to known miniature inverted-repeat transposable elements (MITEs). The distribution of these MITEs in regions near genes was confirmed by EST comparisons to MITE-containing STCs, and our results showed that the association of MITEs with known EST transcripts varies by MITE type. Unlike the biased distribution of retroelements in maize, we found no evidence for the presence of gene islands when we correlated TE–STCs with a physical map of the CUGI BAC library. These analyses of TEs in nearly 50 Mb of rice genomic DNA provide an interesting and informative preview of the rice genome. Transposable elements (TEs) are ubiquitous in all or- Rice (Oryza sativa) is the main staple food for more ganisms (Burge and Howe 1989; Xiong and Eickbush than half of the world’s population and is of great eco- 1990). In plants, TEs are classified into two main nomic importance. Among the cereal grasses, rice has classes (Flavell et al. 1994). Retrotransposons comprise the smallest genome size (430 Mb) and, as revealed by Class I and transpose via an RNA intermediate. Class I comparative mapping, has substantial conservation of TEs include retrotransposons with long terminal re- synteny with other cereal crops such as maize, sor- peats (LTRs) such as Ty1/Copia-like and Ty3/Gypsy-like, ghum, and wheat (Gale and Devos 1998). Conse- as well as non-LTR retrotransposons. The class II TEs quently, rice is an ideal representative for cereal ge- transpose via a DNA intermediate and in plants nomics studies and is the focus of an international ef- have been found mainly in maize. Class II TEs in- fort to completely sequence its genome. Although clude Ac/Ds, En/Spm, and Mutator (Federoff 1989). numerous TEs have been reported in rice, no compre- MITEs, that is, miniature inverted-repeat transposable hensive investigation has been carried out on a ge- elements, such as maize Tourist and Stowaway, fall into nome-wide scale, because the majority of rice TEs were a newly described third class of TEs (Bureau and uncovered by chance or by limited assays using con- Wessler 1992, 1994a,b, 1996). The mechanism of served regions such as reverse transcriptase of retro- transposition of MITEs is still unclear, although they transposons (Hirochika et al. 1992; Motohashi et al. have received considerable attention recently due to 1996; Kumekawa et al. 1999). As part of the Interna- their high copy numbers and tendency to be associated tional Rice Genome Sequencing Project (IRGSP), a rice with genes in maize (Wessler et al. 1995; Zhang et al. BAC library was constructed from a partial HindIII di- 2000). gest of the genome of the rice variety Nipponbare (Bu- diman 1999), and the ends of BAC clone inserts have Present addresses: 3Orion Genomics, St. Louis, Missouri 63108 been sequenced. BAC end sequences will serve as se- USA; 4Department of Agronomy, Konkuk University, Seoul, quence-tagged-connectors (STCs) for selecting mini- South Korea 143-701, Korea; 5Institute for Computational Ge- nomics, 110 Clemson, South Carolina 29631 USA; 6Department mum overlapping clones for genome sequencing (Ven- of Plant Pathology, North Carolina State University, Raleigh, ter et al. 1996). North Carolina 27606 USA. 7Corresponding author. The generation of >73,000 Nipponbare STCs also E-MAIL [email protected]; FAX (864) 656–4293. provides an opportunity to preview TE content and 982 Genome Research 10:982–990 ©2000 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/00 $5.00; www.genome.org www.genome.org Rice Transposable Elements in STC Database distribution in rice genome. The current STC library contains ∼48 Mb of rice genomic DNA after vector re- moval, with an average sequence read of 707 nucleo- tides. With an average insert of 128.5 kb, the CUGI rice BAC library is expected to cover ∼10 rice genome equivalents. Preliminary efforts to confirm the coverage of the library based strictly on sequence comparison of the STCs to finished rice BACs have shown that the estimated coverage is ∼10.4 genome equivalents (data not shown). Assuming that the HindIII sites are evenly distributed, our 73,000 STCs should be distributed one STC every 9 kb across the 430-Mb rice ge- Figure 1 Classification of retrotransposons identified by FASTX nome. and TFASTX searches. Fractions shown are percentages of total TEs are one of the major sources of repetitive se- retrotransposon-containing STCs. FASTX searches were con- quences in cereal plants and have been a concern of ducted using the rice STC database as queries to search the 1358- member TE database. Classification as gypsy, copia, or non-LTR the IRGSP as a potential source of problems in com- was made on the basis of the most similar transposable element pleting the rice genome sequence. Here, we report the protein sequence. TFASTX searches were conducted using Gypsy- TE content of the STC database and show that the rice like rice RIRE2 (BAA84458, 1397 homologous STCs), Copia-like genome probably contains a small fraction of TEs in Hopscotch from maize (T02087, 528 homologous STCs), and a rice non-LTR LINE (CAA73800, 119 homologous STCs) as queries comparison with other cereal genomes, such as maize. to search the STC database. The small amount of TEs confirms rice as a well-chosen model crop genome. We note the discovery of several potentially novel TEs, and we investigate the location quences. To assess the accuracy of our retrotransposon of TE–STCs on the current physical map of the CUGI classification, we used TFASTX to search the STC data- rice BAC library. We find that the TEs appear to be base with protein sequences of representative Gypsy randomly distributed with respect to potential genes, (rice RIRE2), Copia (maize Hopscotch), and non-LTR identified by similarity to rice ESTs. (rice CAA73800) retrotransposons as query sequences. For all three searches, we found a total of 1959 STCs -Divided by retro .(5מRESULTS with significant similarity (E<10 transposon classification, the proportions of STCs TE Content of STC Library identified in each class for both the FASTX and TFASTX To analyze the number and types of TE-like elements searches were nearly identical (Fig. 1). present in the STC database, we used FASTX (Pearson et As a control, we performed an identical survey on al. 1997) to compare 73,362 BAC end sequences (STCs) 16,360 Arabidopsis STCs sequenced by Genoscope with a set of 1358 TE protein sequences. At an expec- (http://www.genoscope.cns.fr/externe/arabidopsis/ 5מ tation cut-off value of 10 or less, 6848 STCs were data/bac_ends) and compared the results from both found to contain regions of homology to known trans- species with the publicly available chromosomal se- posable elements. The vast majority of STCs (6027) are quences. In our FASTX survey of the Arabidopsis STCs, homologous to retrotransposons, whereas the remain- we found 1197 and 143 STCs homologous to retroele- ing 821 are homologous to various transposases of ments and transposases, respectively. Although the ac- class II transposons (Table 1). STCs homologous to ret- tual numbers differ, the proportions of TEs in the rice rotransposons were further classified as Gypsy-like and Arabidopsis STC databases are nearly the same, (4124), Copia-like (1401), and non-LTR (502) on the with 8.2% of the Arabidopsis STCs and 9.3% of the rice basis of classification of the most similar protein se- STCs showing homology to a TE. Within each species, retroelements account for 89.3% of Arabidopsis TE– STCs and 88.0% of rice TE–STCs (Fig. 2). The TE Table 1. Transposable Element Content of the Rice STC Database content of the chromosomal sequences from each plant shows slightly different proportions. The anno- Class Element No. tation of Arabidopsis chromosome 2 identified 563 TEs I Gypsy-like 4124 with 404 (71.7%) retroelements (Lin et al. 1999). Simi- Copia-like 1401 larly, a survey of a 1-Mb PAC contig from rice chromo- non-LTR 502 some 1 sequenced by the Rice Genome Research Pro- II Transposons 821 gram (http://www.dna.affrc.go.jp:82/genomicdata/ Other MITEs 2746 Pararetrovirus-like 3 GenomeFinished.html) revealed 68 unique regions Total 9597 homologous to TEs in TFASTX searches with the proteins of our 1358-member TE database.

Rice Transposable Elements: a Survey of 73,000 Sequence-Tagged-Connectors

Comparative Genomics of Arabidopsis and Maize: Prospects and Comment Limitations Volker Brendel*, Stefan Kurtz† and Virginia Walbot‡

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D

Quality Assessment of Maize Assembled Genomic Islands (Magis) and Large-Scale Experimental Verification of Predicted Genes

An Active DNA Transposon Family in Rice

Biological Sequence Database: NCBI

5, and J. Chris Pires

Genome Survey of Misgurnus Anguillicaudatus to Identify Genomic Information, Simple Sequence Repeat (SSR) Markers and Mitochondrial Genome

SSR-HRM) Analysis for Genetic Relationship of Luffa Genotypes

Identification and Characterization of Rearrangements in the Vervet Monkey Genome

The Nuclear Genome of Brachypodium Distachyon: Analysis of BAC End Sequences

Distribution of Genes and Repetitive Elements in the Diabrotica Virgifera Virgifera Genome Estimated Using BAC Sequencing Brad S

Analysis of the Maize (Zea Mays L) Genome Using Molecular, Genetic and Computational Approaches Yan Fu Iowa State University