<<

View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by The University of North Carolina at Greensboro

Archived version from NCDOCKS Institutional Repository http://libres.uncg.edu/ir/asu/

Young, Intact And Nested Retrotransposons Are Abundant In The Onion And Asparagus Genomes

Authors: C. Vitte1, M. C. Estep, J. Leebens-Mack and J. L. Bennetzen

Abstract Background and Aims Although monocotyledonous comprise one of the two major groups of angiosperms and include .65 000 species, comprehensive genome analysis has been focused mainly on the Poaceae (grass) family. Due to this bias, most of the conclusions that have been drawn for monocot genome evolution are based on grasses. It is not known whether these conclusions apply to many other monocots. Methods To extend our understanding of genome evolution in the monocots, genomic sequence data were acquired and the structural properties of asparagus and onion genomes were analysed. Specifically, several available onion and asparagus bacterial artificial chromosomes (BACs) with contig sizes .35 kb were annotated and analysed, with a particular focus on the characterization of long terminal repeat (LTR) retrotransposons. Key Results The results reveal that LTR retrotransposons are the major components of the onion and garden aspara-gus genomes. These elements are mostly intact (i.e. with two LTRs), have mainly inserted within the past 6 million years and are piled up into nested structures. Analysis of shotgun genomic sequence data and the observation of two copies for some transposable elements (TEs) in annotated BACs indicates that some families have become particu-larly abundant, as high as 4–5 % (asparagus) or 3–4 % (onion) of the genome for the most abundant families, as also seen in large grass genomes such as wheat and maize. Conclusions Although previous annotations of contiguous genomic sequences have suggested that LTR retrotran- sposons were highly fragmented in these two Asparagales genomes, the results presented here show that this was largely due to the methodology used. In contrast, this current work indicates an ensemble of genomic features similar to those observed in the Poaceae.

C. Vitte1, M. C. Estep, J. Leebens-Mack and J. L. Bennetzen (2013) "Young, Intact And Nested Retrotransposons Are Abundant In The Onion And Asparagus Genomes" Annals of Botany Version of Record Available At www.academic.oup.com [doi:10.1093/aob/mct155] INTRODUCTION extensively studied at the cytological and biochemical levels. Monocotyledonous plants are divided into 11 orders: Acorales, Onion is a diploid (2n 2x 16), but has a genome size of , Petrosaviales, , , , approx. 16 400 Mb/1C, one of the largest among all cultivated Asparagales, , , and , the diploid species and similar in size to that of hexaploid wheat last four being grouped into the Commelinid clade (Stevens (http://data.kew.org/cvalues/). Although a large genome can fa- et al., 2001). Most of these orders include multiple agriculturally cilitate cytogenetic analyses, it has hampered the development of and ornamentally important groups. Among these, the Asparagales genomic resources, thus impeding both molecular breeding and may be the second most important for agriculture and horticulture characterization of the molecular origins of its large genome. after the Poales. The Asparagales includes the hyperdiverse Transcriptome analyses suggest that onion transcript characteris- Orchidaceae with .30 000 species and nearly 5000 additional tics are quite distinct from those of Poaceae (Kuhl et al., 2004), species distributed across the rest of the order. These include im- but little is known about the transposable element (TE) compos- portant crops such as aloe (Aloe vera), agave (Agave tequilana), ition in the onion genome. In particular, while long terminal asparagus (Asparagus officinalis), garlic (Allium sativum), repeat (LTR) retrotransposons are thought to be a major compo- leek (Allium ampeloprasum), onion (Allium cepa) and vanilla nent of the onion genome (Pearce et al., 1996), the role that TE (Vanilla planifolia), as well as ornamental plants such as yuccas, activity has played in shaping the current genome is poorly amarylids, daffodils, irises and orchids. With a world production understood. Most of the available data on onion genome compos- of .95 Mt, it is the third most cultivated group for vegetable pro- ition come from re-association kinetics (Cot) and fluorescence in duction in the world after the (including potato, situ hybridization (FISH) analyses that revealed an abundance of tomato, pepper and aubergine) and the (including middle-repetitive sequences interspersed with low copy number melons, cucumbers and gourds) (world production, http:// regions (Stack and Comings, 1979; Pearce et al., 1996; Suzuki faostat.fao.org/). et al., 2001). Analyses of bacterial artificial chromosome Onion (A. cepa) is the most economically important member (BAC)-end sequences revealed a high frequency of TEs, consist- of the Asparagales. As a biological model, onion has been ent with results obtained for other plants with large genomes

(Jaksˇe et al., 2008). Sequencing of a few onion BACs (Do et al., MATERIALS AND METHODS 2003; Jaksˇe et al., 2008) revealed the presence of repeated sequences, including a large quantity (approx. 50 %) of TEs, Sequence data but also microsatellites and direct tandem repeats. Most of the Contiguous sequences (contigs) from four asparagus (Asparagus TEs found were interpreted as highly degraded, a feature that is officinalis) BACs (available as GenBank accessions AC183410, quite different from the pattern observed in grasses, where DQ273271 and the overlapping AC183409 and AC183411) that intact and young (,5 million year old) elements are found. represent three genomic regions selected to contain a sulfite reduc- However, the scarcity of genomic data and the absence of any tase gene, a sucrose transporter gene and the AOB272 cDNA TE database from a closely related species made the detection (described in Jaksˇe et al., 2006), and three full-length onion of TEs in Asparagales particularly challenging. (Allium cepa) BACs (AB111058, DQ273272 and DQ273270) Garden asparagus (A. officinalis) is the third most economically were retrieved from NCBI. In addition, 3235 unpaired Sanger important in the Asparagales, after onion and garlic. It is shotgun sequences from onion were provided by C. Town, and believed to be native to Europe, northern Africa and Asia, and we generated an additional 3560 unpaired Sanger sequences is even more widely cultivated as a vegetable crop. Asparagus is from garden asparagus (deposited in GenBank). diploid (2n 2x 20) and has one of the smallest genomes of the core Asparagales (approx. 1300 Mb/1C, http://data.kew.org/ cvalues/) For this reason, garden asparagus has been proposed as Annotation of asparagus and onion BAC sequences a genomic model for the core Asparagales that would help inves- Bacterial artificial chromosome sequences were used as tigate the gene content of other Asparagales with larger genome the subject for Repeatmasker and BLASTX searches. The sizes, such as onion (Kuhl et al., 2005; Telgmann-Rauber et al., Repeatmasker search was performed with version 3.0 (Smit 2007). Synteny between asparagus and onion genomes has et al., 1996; http://www.repeatmasker.org) using repbase received preliminary investigation by mapping several onion (April 2012) and redat (MIPS, May 2008) as query databases. expressed sequence tags (ESTs) on both onion chromosomes For the BLASTX search (Altschul et al., 1990), version blastall and a few studied asparagus BACs (Jaksˇe et al., 2006). This ana- 2.2.25 was used with the nr database (March 2012) as query and a lysis posited that many genes encoding physically linked ESTs threshold e-value ≤1025, and without filtering out low complex- in asparagus were located on distinct chromosomes in onion, sug- ity regions. In parallel, asparagus and onion shotgun reads were gesting a lack of collinearity between the two genomes. Since the mapped onto the asparagus and onion BACs, respectively, in ESTs analysed were not single copy [e.g. from gene duplication, order to spot repeated regions that may have been missed by perhaps via TE mobilization of genes or gene fragments (Jin the BLASTX and Repeatmasker searches due to the absence of and Bennetzen, 1994; Jiang et al., 2004; Lai et al., 2005; Yang closely related TEs in the available repeat databases. For this, and Bennetzen, 2009)], it is possible that the sequences compared BLASTN searches using BAC sequences as subjects and were not orthologous. However, given that the divergence of the shotgun sequences as queries were performed. lineages leading to these species occurred approx. 87 million Foreach contig, results of these searches were visualized using years ago (Mya) (Jansson and Bremer, 2005), frequent inter- Argo v10.0.31 (http://www.broadinstitute.org/annotation/argo) ruptions in microcollinearity are expected (Bennetzen, 2000). and manually evaluated to provide a first annotation. Genes Hence, more detailed analyses, as well as comparison of other were predicted using the BLASTX results. Only matches to a species within the Asparagales, are needed to investigate and predicted protein from a dicotyledonous species with an e-value extend these results. ≤1025 were used, because, as onion and asparagus are mono- The recent sequencing of asparagus BACs has revealed cots, genes annotated from matches to other monocots proteins the presence of sequences showing similarities to known with unknown function have a reasonable chance to be TE LTR retrotransposons from other species (Jaksˇe et al., 2006; genes (Bennetzen et al., 2004). Because TE genes are much Telgmann-Rauber et al., 2007). However, the structure of the less conserved than are standard plant genes when comparing TEs found and their precise characterization were not assessed. distantly related species (such as monocots with dicots), A comprehensive characterization of TE sequences in the a match between a monocot genome and an unknown protein garden asparagus genome is needed, especially to investigate from a dicot species is more likely to be a real gene. Transposable the origin of the large differences in genome size observed element-coding domains were predicted using BLASTX and between hermaphroditic African asparagus species and dioe- Repeatmasker results, and TE regions were predicted from cious species from Europe and Africa (Sˇ tajner et al., 2002). Repeatmasker and BLASTN search results with survey sequence Herewe describe the acquisition of new shotgun sequence data data. To improve the annotation of TEs, LTRs were sought in the and further annotation of available onion and asparagus BAC regions flanking BLASTX-defined TE-coding domains by sequences, with a particular focus on the detailed characteriza- BLASTN2 dotplot comparison (www.ncbi.nlm.nih.gov/blast/ tion of TE sequences, to unravel the structural properties of bl2seq/wblast2.cgi), and by running LTR_STRUC (McCarthy these genomes and compare them with those of several well- and McDonald, 2003). studied grass genomes. Our results reveal that LTR retrotranspo- sons are the majorcomponents of the onion and garden asparagus genomes. These elements are mostly intact (i.e. with two LTRs), Characterization of LTR retrotransposon types and clustering of the have inserted ,6 Mya and are piled up into nested structures, a discovered elements into families suite of features identical to those observed in the Poaceae and For all LTR retrotransposons for which a polyprotein could be in another recently analysed monocot, banana (Musa acuminata) detected, we looked for the order of the reverse transcriptase (D’Hont et al., 2012). (RT), integrase (Int) and ribonuclease H (RNaseH) domains, a

feature that distinguishes elements of the Gypsy-like and to build identity histograms. Results of this search were also used Copia-like superfamilies. When elements were lacking some to compute the length occupied by each LTR retrotransposon, of these domains, their sub-class was determined by comparing leading to an estimate of genome coverage for each element their sequence with a database of maize LTR retrotransposons family. For each element, lengths of all BLAST high scoring (Baucom et al., 2009) of known superfamily using tBLASTX pairs (HSPs) found were extracted and computed to estimate (from blastall version 2.2.25). the amount of shotgun sequence occupied by this element and For intact elements, we sought structural features such as the the corresponding percentage of total shotgun sequence. This primer-binding site (PBS), polypurine tract (PPT) and target percentage was then converted into a fraction of the genome site duplications (TSDs). To analyse PBSs, a tRNA database using the corresponding genome size. Confidence intervals (http://lowelab.ucsc.edu/GtRNAdb/) was reverse complemen- (CIs) were estimated using a boostrap approach in which the ted and compared with the downstream region of the 5′ LTR shotgun sequence datawere regenerated 100 times by resampling from each element. Intact LTR retrotransposons were then the same number of sequences with replacement in the original clustered into families based on the overall structure of the data set and performing the same analysis. Corresponding distri- elements (in particular the sharing of similar LTRs and PBS), butions were used to estimate 95 % CIs of the genome fraction and on sequence identity between copies obtained from an occupied by each element. all vs. all retrotransposons BLASTN search with a required 25 e-value ≤10 . RESULTS

Genome composition and structure from BAC sequences Estimation of intact LTR retrotransposon insertion dates Although a few BAC sequences from asparagus and onion have The insertion date of each LTR retrotransposon with two LTRs been produced over the past several years, their annotation has and a TSD was estimated following the method described in been hampered by the lack of TE sequences from closely SanMiguel et al. (1998). Divergence between LTRs was com- related species. To better evaluate the genome structure of as- puted by MEGA version 4.0 (Tamura et al., 2007), using the paragus and onion genomes, we carefully annotated the contig Kimura 2 parameter distance (Kimura, 1980) that corrects for sequences with size .35 000 bp from publicly available BACs both homoplasy and differences in the rates of transition and of these two species, using a methodology combining searches transversion. Rough estimation of insertion dates was obtained for homology to known TEs, and structural characteristics of using the substitution rate of 1.3 × 1028 substitution per site TEs. In addition, random shotgun reads were mapped onto the per year that has been described for rice LTR retrotransposons sequences of the TEs that we discovered, to assess their diversity, (Ma and Bennetzen, 2004). relative abundance and genome coverage (see the Materials and Methods). As presented in Table 1, LTR retrotransposons were found to account for the majority of both onion and asparagus Estimation of TE copy divergence and TE genome coverage using contigs, while other types of TEs such as DNA transposons genome survey sequences and LINEs (long interspersed nuclear elements) are much less For each intact TE characterized, nucleotide divergence abundant, contributing only a few per cent of the contig DNA. between each sequence and the intact copy was estimated by Mapping of the shotgun reads onto the contigs revealed the pres- running a BLASTN search using sequences of TEs as subject ence of additional repeats that could not be fully deciphered. and shotgun sequences as query, with an e-value threshold of Although these repeats could not be proven to be TEs, they 1023 and a minimum read coverage of 20 % (i.e. approx. may correspond to TE fragments that were missing one or 140 bp). For each element for which a minimum of 50 cases of both ends of the element because it was outside of the contig homology was found, identity scores between the detected re- or had lost some segments by progressive deletion (Ma and gions and the corresponding reference sequence were computed Bennetzen, 2004) or other rearrangement. For asparagus, two

TABLE 1. BAC composition

% BAC coverage

Size of largest* contig LTR DNA Uncharacterized Species BAC name (bp) Gene retrotransposon transposon LINE repeat Microsatellite Unknown

Asparagus AC183410 39 875 0.0 80.7 0.0 0.0 0.9 0.1 18.3 Asparagus DQ273271 42 943 0.5 36.9 23.6 12.3 1.6 0.1 25.1 Asparagus AC183411 51 593 0.0 90.6 0.0 0.0 5.5 0.0 3.9 Onion AB111058 35 243 6.1 4.1 0.0 8.5 33.2 0.1 48.0 Onion DQ273272 84 316 0.0 58.9 0.0 0.0 14.2 0.0 27.0 Onion DQ273270 108 232 0.0 52.4 6.8 0.0 4.0 0.0 36.7

LINE, long interspersed nuclear element. * When several contigs were available, only those with size .35 000 bp were analysed. In all cases, this led to the analysis of the largest contigs only, as the second largest contig was too short to reach this threshold.

contigs (from BACs AC183410 and AC183411) are composed Type, structure and copy number of asparagus and onion LTR almost entirely of LTR retrotransposon sequences, representing retrotransposons approx. 81 % and approx. 91 % of the sequence analysed, re- As presented in Table 2, our annotation of large contigs spectively. On the other hand, the two onion contigs with the enabled us to retrieve 12 and eight intact LTR retrotransposon highest LTR retrotransposon content (from BACs DQ273272 copies in asparagus and onion, respectively. Similarity clustering and DQ273270) were found to include, respectively, approx. of the elements revealed that some of them are members of the 52 % and approx. 59 % LTR retrotransposon DNA. Analysis of same family. The asparagus ART2 element and the onion contig structures (Fig. 1) revealed the presence of nested LTR ret- ORT5 element were found at two copies each. rotransposons in both species. The largest set of nested elements Of the 11 distinct LTR retrotransposons found in asparagus, was observed in onion BAC DQ273272, and consists of three five could be characterized as Copia-like (Table 2) and the rest LTR retrotransposons. The presence of traces of other LTR retro- could not be classified as either Copia-like or Gypsy-like. In transposons in the flanking regions of the nests suggests that the onion, of the seven distinct elements found, five could be degree of nesting could be greater than observable in these short assigned to the Gypsy clade, one to the Copia clade and one contigs. could not be classified (Table 2).

A Onion Sulfite reductase Hyp. protein –10 (V. vinifera 8 × 10 ) ORT1 ORT4 En/Spm, Put. gag/pol incomplete (no TIRs, no TSD) ORT3 Gypsy-like Gypsy-like ORT2

DQ273270 contig (108 232 bp) 10 kb ORT5

ORT6 Repeat LTR retro TSD: GTCG(T/C) fragments ORT5 ORT7

DQ273272 contig (84 316 bp)

B Garden asparagus

LTR retro ART4 with partial LTRs

ART5 LTR retro pp ART6 LTR retro pp

AC183411 contig (51 593 bp) 10 kb

LTR retro? (potential TSD but no PBS and PPT) CACTA Hat-like Phytocyanin- LTR retro tn like (trunc. by e.o.c.) LINE ART7

DQ273271 contig (42 943 bp)

FIG . 1 . Annotation of large contigs. Names of the BACs are highlighted below each drawing. Genes are shown in red. Regions with other colours correspond to fully characterized TEs. Grey bars show regions with homology to TEs for which the complete structure could not be detected or characterized. LTR retrotransposons are represented by lines with arrows highlighting the position of the LTRs. A 10 kb scale is given on the right of (A) onion BACs and (B) garden asparagus BACs. Abbreviations: put., putative; hyp., hypothetical; TIRs, terminal inverted repeats; TSD, target site duplication; retro, retrotransposon; tn, transposon; pp, polyprotein; trunc., truncated; e.o.c., end of contig.

TABLE 2. Properties of the LTR retrotransposons discovered

Species BAC Name Type Total size LTR size TSD LTR: start .. . end PBS PPT Ts:Tv LTR divergence J&C Age J&C

Asparagus AC183410 ART1 Copia 5207 656/505 CATCC TGT .. . ACA Arg Yes 2.8:1 0.055 2.1 Asparagus AC183410 ART2 Copia 5591 588 AATTT TGT .. . ACA Met Yes 5.0:1 0.04 1.5 Asparagus AC183410 ART3 Copia 5227 129 ATTTT TGT .. . ACA Met Yes 0:0 0 0.0 Asparagus AC183411 ART4 Unknown 7979 1075/1046 GTGGC TGT .. . ACA Arg Yes 3.8:1 0.043 1.7 Asparagus AC183411 ART5 Unknown 7368 624/696 ATTTT TGGT .. . ACA Arg Yes 4.4:1 0.118 4.5 Asparagus AC183411 ART6 Unknown 8216 1328/1260 TATAC TGT .. . ACA Arg Yes 2.6:1 0.009 0.3 Asparagus DQ273271 ART7 Unknown 6756 960/958 ATCGT TGT .. . ACA Arg Yes 3.0:1 0.025 1.0 Asparagus DQ273271 ART8 Unknown 6837 2122/2094 TCATC TGT .. . ATA Met Yes 1.2:1 0.016 0.6 Asparagus AC183409 ART9 Unknown 2265 430/433 TTATT TGA .. . T(C/T)A Met Yes 5.5:1 0.031 1.2 Asparagus AC183409 ART2 Copia 5109 693/688 TT(T/C)CT T(G/A)T .. . ACA Met Yes 5.4:1 0.099 3.8 Asparagus AC183409 ART10 Unknown 7602 1185/1156 AGGAC TGT .. . ACA Arg Yes 1.7:1 0.017 0.7 Asparagus AC183409 ART11 Copia 5309 617 TTGAC TGT .. . ACA Met Gx3 3:1 0.02 0.8 Onion DQ273270 ORT1 Gypsy 11 347 3091/3092 TCTGT TGT .. . ACA Met Yes 1.7:1 0.155 6.0 Onion DQ273270 ORT2 Gypsy 10 909 1463/1460 A(C/A)AAG TGT ... A(–/C)A NF Yes 1.6:1 0.217 8.3 Onion DQ273270 ORT3 Copia 15 510 2377 GTTTA TGT .. . ACA Met Yes 1.3:1 0.008 0.3 Onion DQ273270 ORT4 Unknown 3659 962/963 No TSD TCT .. . ACA Met Yes NC NC NC Onion DQ273272 ORT5 Gypsy 10 168 2720/2718 A(C/A)TAG T(T/G)T .. . ACA Met Yes 1.7:1 0.032 1.2 Onion DQ273272 ORT6 Gypsy 11 194 3217/3211 CATTC TGA .. . ACA Met Yes 1.2:1 0.032 1.2 Onion DQ273272 ORT5 Gypsy 10 275 2770/2773 AATCT TGT .. . ACA Met Yes 0:1 0.001 0.0 Onion DQ273272 ORT7 Gypsy 43 875* 2225/2229 ATAT(C/T) TGT .. . ACA Met Yes 1.8:1 0.062 2.4

TSD, target site duplication; PBS, primer-binding site; PPT, polypurine tract; Ts, transition; Tv, transversion; J&C, Jukes and Cantor; NF, not found; NC, not characterized. Arg and Met correspond to tRNA types matching the PBS. * The size of this element may be smaller due to the possible insertion of another retrotransposon within it. A duplicated structure that could correspond to LTRs has been found, but no other LTR retrotransposon feature could be detected. The total size would be 12 566 bp if only the two repeats are removed (without what could be the internal part), or 10 633 bp if the whole structure was removed.

Analysis of the frequency of homology found for each discov- genome, while ORT3, ORT4 and ORT5 together account for ered element to reads in the shotgun data set revealed great vari- .5 % to 7 % of the onion genome. ation between element families, including some with no identified homologues identical to the analysed intact elements Insertion timing of asparagus and onion LTR retrotransposons (i.e. ART3) and hundreds with identical sequences for ART4, ART5, ART6, ART7 and ART8 (Fig. 2, Table 3). For onion For intact copies (i.e. copies with two LTRs), estimation of in- LTR retrotransposons, the number of cases of identity to the sertion dates (see the Materials and Methods) revealed that all intact elements ranged from a few for ORT1, ORT2, ORT6 elements have inserted within the past 6 million years, most and ORT7 to up to 250 for ORT3 (Fig. 2, Table 3). having insertion dates ,4 Mya (Fig. 3). For elements present The total length occupied by each element was also computed in nested structures, insertion dates of the different elements using the output of the BLASTN analysis of shotgun reads com- involved in the same nest were compared (Table 2, Fig. 1), and pared with each TE sequence of intact elements. This led to the revealed, as expected, that younger elements are always inserted estimation that some elements (ART5, ART6, ART7, ART8 into older ones: 6.0 Mya/8.3 Mya for nest ORT1/ORT2, 0.04 and ART10) represent around 10 – 50 Mb of the asparagus Mya/1.2 Mya/2.4 Mya for nest ORT5/ORT6/ORT7 and 1.6 genome, while ART4 represents .50 Mb of the genome Mya/4.5 Mya for nest ART4/ART5. Interestingly, the two (Table 3). On the other hand, several elements (ART1, ART2, oldest elements found for onion (i.e. ORT1, 6.0 Mya; and ART3, ART9 and ART11) had too few cases of homology to es- ORT2, 8.3 Mya), far older than any of the other elements timate accurately the genomic fraction theyoccupy (Table 3). For found in onion (all younger than 2.5 Mya), are inserted within onion, element ORT3 was found to be very abundant, providing each other. Analysis of nucleotide pair frequencies in LTRs 560 – 730 Mb of the onion genome. Two other elements, ORT4 reveals that transitions outnumber transversions in both and ORT5, were estimated to contribute .80 Mb to the onion species, with average transition (Ts):transversion (Tv) ratios of genome each, while the other three identified elements were 3.2:1 and 1.3:1 for asparagus and onion, respectively. observed to represent a very small fraction of this genome To get an overall idea of the timing of amplification events, we (Table 3). Altogether, these results revealed that ART4, ART7 computed the distribution of nucleotide identities from pairwise and ART8 combine to make up 8 – 12 % of the asparagus alignments between each intact copy and matching random

A 60 ART4 ART6 ART8

40

20 count Hsp 0

60 ART5 ART7 ART10

40

20 count Hsp 0

B 60 ORT3 ORT4 ORT7

40

20 count Hsp 0

75 80 85 90 95 100 75 80 85 90 95 100 75 80 85 90 95 100 Percent identity Percent identity Percent identity

FIG . 2 . Global identity scores among copies of the ART and ORT LTR retrotransposons. Pairwise identity score between each LTR retrotransposon and the random shotgun sequences with a significant BLAST homology. Onlyelements for which .50 high scoring pairs (HSPs) were found are shown. Vertical lines indicate average identity scores. For ORT7, the graph presented corresponds to that obtained with the full-length element as reference, but almost identical results were obtained when removing potential inserted elements. (A) Asparagus LTR retrotransposons (ART type); (B) onion LTR retrotransposons (ORT type).

TABLE 3. Genomic fraction occupied by each LTR copy was estimated to have inserted around 8 Mya (Table 2), had retrotransposon discovered a copy number too low to make this pairwise alignments-based estimation. Genomic shotgun Portion of genome occupied sequence occupied (Mb) Gene annotation From Length original 95 % CI from Several genes were predicted in these BACs in previous anno-

Name Class (bp) Percentage data set resampling tations (Jaksˇe et al., 2006, 2008). In our annotation, genes were

. validated only when a BLASTX hit to a gene from a eudicot ART1 Copia 2802 0 1 2 NE species could be found, as matches to proteins of other monocots ART2 Copia 10 931 0.5 6 NE ART3 Copia 0 0.0 0 NE with unknown function have a reasonable chance to be TE genes, . ART4 Unknown 109 918 4 7 61 51 – 70 while a match between a monocot genome and an unknown ART5 Unknown 27 391 1.2 15 12 – 19 protein from a dicot genome is more likely to be a real gene ART6 Unknown 36 565 1.6 20 15 – 25 (Bennetzen et al., 2004). ART7 Unknown 67 872 2.9 37 28 – 50 Moreover, when our extended characterization of TEs indi- ART8 Unknown 49 748 2.1 27 23 – 32 ART9 Unknown 1028 0.0 1 NE cated overlap with a previously characterized gene, we made a ART10 Unknown 22 434 1.0 12 10 – 16 concerted effort to discover whether the gene had been mis- ART11 Copia 8488 0.4 5 NE . annotated due to the fact that the element had not been detected, ORT1 Gypsy 5755 0 2 33 NE ORT2 Gypsy 1950 0.1 11 NE or if it corresponds to an insertion of the gene or gene fragment

ORT3 Copia 111 687 3.9 642 560 – 730 within the newly annotated element. Interestingly, on onion ORT4 Unknown 23 758 0.8 137 100 – 185 BAC DQ273270 originally selected for containing a sulfite re- ORT5 Gypsy 20 798 0.7 120 80 – 160 ductase gene, we found that the LTR of the ORT1 element . ORT6 Gypsy 1378 0 0 8 NE carries an insertion of unknown origin, within which is included

part of a sulfite reductase genewith identity to both maize protein NE, not estimated. NP_001105302.1 and Arabidopsis thaliana protein NP_196079.1 Note that ORT7 was not analysed due to the possible presence of other elements inserted within it, thus creating the potential for an erroneous (data not shown). Hence, the sulfi reductase homology on this prediction of genome coverage. BAC appears to be exclusively caused by the presence of a sulfi reductase fragment, perhaps acquired by a Pack-MULE (Jiang et al., 2004), by another transposon with gene acquisition propen- 4 sities (Yang and Bennetzen, 2009) or by LTR retrotransposon Onion ORT1 itself.

3 Garden asparagus DISCUSSION

number 2

LTR retrotransposons in onion and asparagus are young, diverse

Copy 1 and organized in nested structures Annotation of several contigs with size .35 000 bp for onion 0 and garden asparagus revealed the presence of many distinct LTR retrotransposons. These elements make up .50 % of the regions analysed, while only a few genes could be characterized. Insertion date (My) This structure is similar to what is observed for large grass genomes, in which small gene-rich islands are interspersed in a FIG . 3 . Distribution of intact element insertion dates. Insertion dates were esti- sea of TEs, in particular LTR retrotransposons (San Miguel mated using pairwise identity scores between the two LTRs for each copy (see , 1996), a structure also observed in banana, a non-grass Material and Methods for details). et al. monocot (D’Hont et al., 2012). Because the BACs analysed were originally selected for containing agronomically important shotgun reads for all elements showing .50 cases of high hom- genes, the regions studied are expected to be gene rich, and this ology. For both onion and asparagus, all elements have an trend towards small gene islands is likely to be even more pro- average identity .84 %. Using the same conversion rate as nounced in the gene-poor regions of the genome (e.g. paracentro- that used for estimating individual insertion dates, this indicates meric heterochromatin). that the elements analysed were active within the last 6 million The observation of two copies for some elements in this years (Fig. 3). Although these data only give a first estimate, limited BAC data set suggests that some families have become because they are impacted by the age of the reference copy very abundant, as is seen in maize, wheat and barley (Vitte and used, they are in accordance with the insertion dates found for in- Bennetzen, 2006). This may be particularly true for onion, as dividual insertions (Fig. 3, Table 2). Moreover, the detection of the contigs analysed correspond to only 0.001 % of the total hits showing .95 % identity with our reference intact copy genome size, as compared with 0.017 % for the contigs analysed (for elements ART4, ART6, ART7, ART8, ORT3 and ORT7) from garden asparagus. This feature is reinforced by analyses of reveals that, for a few elements, very recent amplification the shotgun sequence data indicating that some elements repre- events have occurred in both genomes. ORT2, whose full-length sent at least 2 % of these genomes (for instance, ART4, ART6,

ART7, ART8 and ORT3 represent 3.6– 5.5, 1.1– 2.1, 1.8– 4.1, comparisons and therefore hampers the detection of TEs based 1.7 – 2.6 and 3.4–4.5 % of the data, respectively). Because the on their repetitiveness. For these reasons, we felt that the previous onion genome is much larger (approx. 16 000 Mb) than that of analyses concluding that TEs from onion and asparagus genomes asparagus (approx. 1300 Mb), similar proportional genome oc- were highly fragmented (Jaksˇe et al., 2006, 2008) required some cupancy leads to much higher sequence coverage for onion. further investigation. We used a combination of searches based For instance, the ORT3 element is estimated to make up .550 on both homology and the detection of TE structural features Mb within the onion nuclear genome (a TE abundance greater to detect the presence of TEs in these two Asparagales species. than the size of the rice genome, for instance), while the ART4 Our analyses led to the discovery of several intact LTR retrotran- element is estimated to make up only 50 – 70 Mb of the asparagus sposons from both species. The LTR retrotransposon database genome sequence. Resampling the data with replacement indi- that we constructed is the largest available for these two species, cated that the accuracy of these estimations is approx. 13 – 35 but needs to expand dramatically as more data are generated and % for the most abundant elements (Table 3). Hence, this rough similar analyses to ours are undertaken. Subsequent comparison comparison of genome coverage indicates .20-fold variation of random shotgun reads with the newly discovered intact ele- in abundance of different TEs in each species, indicating that ments allowed characterization of the age of amplification some element families have amplified much more actively in events and provided crude approximations of relative abundance recent times than have others, a feature observed for the for different LTR retrotransposon families. Poaceae, but also for other more recently analysed monocot Our results also highlight the importance of annotation quality species such as banana (D’Hont et al., 2012) and to provide a clear description of TEs in plant genomes. We ( dactylifera) (Al-Dous et al., 2011). propose that combining analysis of large contigs to discover Both LTRcomparison of intact elements and pairwise analysis new intact TEs and random shotgun reads mapping onto these of shotgun data show that the different LTR retrotransposon fam- new TEs is an efficient method to characterize TE dynamics ilies found have been active within the past 6 million years. This for any newly investigated and/or large genome species. period of activity seems to be more recent for some families such Although this study was performed using data that had been gen- as ART4, ART6, ART7, ART8, ORT3 and ORT7. This suggests erated using the older Sanger technology, current and future that both asparagus and onion genomes have undergone amplifi- high-throughput methodologies (Mardis et al., 2008) will cation of LTR retrotransposon in their recent history. clearly render such procedures so powerful and affordable that The observation of Ts:Tv ratios of 3.2:1 and 1.3:1 in asparagus TE properties can be investigated in a vast number of species, and onion LTR sequences, respectively, is in accordance with therefore allowing analysis of TE dynamics across broader evo- previous results showing that Ts:Tv rates in LTRs differ drama- lutionary times. tically among species, from 1.6 (barley) to 3.9 (maize) (Vitte and Finally, our analysis reveals that some of the sequences that Bennetzen, 2006). Genic regions (including introns) typically were previously annotated as genes were erroneously predicted. exhibit Ts:Tv ratios of 1:1. Therefore, it has been argued that a This is mainly because TEs were not accurately detected, leading higher Ts:Tv ratio is evidence of extensive cytosine 5-methylation, to false prediction of ‘host genes’ in the coding regions of these because this epigenetic DNA modifi tion increases the C to T elements. The finding of a fragment of the sulfite reductase gene transition rate (SanMiguel et al., 1998). Therefore, as for other within the LTR of one LTR retrotransposon highlights the im- plant genomes that were previously analysed, most LTR retrotran- portance of precisely delimitating TE positions to better charac- sposons from onion and asparagus are probably epigenetically terize genome structure and evolution. This is particularly silenced with extensive cytosine 5-methylation, at CG, CHG and important when comparative genome analyses are performed, CHH sites (Feng and Jacobsen, 2011). as insertions of gene fragments within TEs may lead to mislead- ing predictions of a lack of synteny between species. Impact of the TE detection process on the biological features depicted ACKNOWLEDGEMENTS In the past few years, several contig sequences have been gen- We thank Chris Town for sharing onion shotgun data, Rod Wing erated from BAC clones for onion and asparagus. Although and Dave Kurdna for performing asparagus shotgun sequencing, global gene and TE content has been estimated from these Eugene McCarthy for discussions regarding LTR-STRUC BACs, detection of TEs had been made using homology searches results, and Johann Joets for bioinformatic support and discus- on existing databases, as well as pairwise comparison of BAC sion on the shotgun data analysis. This work was supported by sequences (Jaksˇe et al., 2006, 2008; Telgmann-Rauber et al., NSF grant #0607123 (to J.L.B.) and NSF grant #0841988 (to 2007). Although this methodology gives a first approximation J.L.M.). of TE content, it is not adequate to characterize TE abundance or organization within genomes, in particular for species with scarce genomic data and which are not closely related to a LITERATURE CITED model species, as is the case for onion and asparagus. First, the Al-Dous EK, George B, Al-Mahmoud ME, et al. 2011. De novo genome se- absence of known TEs from a closely related species limits quencing and comparative genomics of the date palm (Phoenix dactylifera). homology-based detection to the most conserved part of TEs. Nature Biotechnology 29: 521 – 527. Therefore, with regular annotation processes, one is likely to Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. Journal of Molecular Biology 215: 403 – 410. miss segments within elements and conclude mistakenly that Baucom RS, Estill JC, Chaparro C, et al. 2009. Exceptional diversity, non- the elements are highly truncated and have degenerated. random distribution, and rapid evolution of retroelements in the B73 Secondly, the scarcity of genomic data limits within-genome maize genome. PLoS Genetics 5: pe1000732.

Bennetzen JL. 2000. Comparative sequence analysis of plant nuclear genomes: Ma J, Bennetzen JL. 2004. Rapid recent growth and divergence of rice nuclear microcollinearity and its many exceptions. The Plant Cell 12: 1021 – 1029. genomes. Proceedings of the National Academy of Sciences, USA 101: Bennetzen JL, Coleman C, Liu R, Ma J, Ramakrishna W. 2004. Consistent 12404 – 12410. over-estimation of gene number in complex plant genomes. Current Mardis ER. 2008. Next-generation DNA sequencing methods. Annual Review of Opinion in Plant Biology 7: 732 – 736. Genomics and Human Genetics 9: 387 – 402. D’Hont A, Denoeud F, Aury JM, et al. 2012. The banana (Musa acuminata) McCarthy EM, McDonald JF. 2003. LTR_STRUC: a novel search and identi- genome and the evolution of monocotyledonous plants. Nature 488: fication program for LTR retrotransposons. Bioinformatics 19: 362 – 367. 213 – 217. Pearce SR, Pich U, Harrison G, et al. 1996. The Ty1-copia group retrotransposons Do S, Suzuki G, Mukai Y. 2003. Genomic organization of a novel root alliinase of Allium cepa are distributed throughout the chromosomes but are enriched in gene, ALL1, in onion. Gene 325: 17 – 24. the terminal heterochromatin. Chromosome Research 4: 357 –364. Feng S, Jacobsen SE. 2011. Epigenetic modifications in plants: an evolutionary SanMiguel P, Tikhonov A, Jin YK, et al. 1996. Nested retrotransposons in the perspective. Current Opinion in Plant Biology 14: 179 – 186. intergenic regions of the maize genome. Science 274: 765 – 768. Jaksˇe J, Telgmann A, Jung C, et al. 2006. Comparative sequence and genetic SanMiguel P, Gaut BS, Tikhonov A, Nakajima Y, Bennetzen JL. 1998. The analyses of asparagus BACs reveal no microsynteny with onion or rice. paleontology of intergene retrotransposons of maize. Nature Genetics 20: Theoretical and Applied Genetics 114: 31 – 39. 43 – 45. . Jaksˇe J, Meyer JD, Suzuki G, et al. 2008. Pilot sequencing of onion genomic Smit AFA, Hubley R, Green P. 1996 – 2010. RepeatMasker Open-3 0. http:// DNA reveals fragments of transposable elements, low gene densities, and www.repeatmasker.org. significant gene enrichment after methyl filtration. Molecular Genetics Stack SM, Comings DE. 1979. The chromosomes and DNA of Allium cepa. and Genomics 280: 287 – 292. Chromosoma 70: 161 – 181. Jansson T, Bremer K. 2005. The age of major monocot groups in- ferred from Stajner N, Bohanec B, Javornik B. 2002. Genetic variability of economically 800+rbcL sequences. Botanical Journal of the Linnean Society 146: important Asparagus species as revealed by genome size analysis and 395 – 398. rDNA ITS polymorphisms. Plant Science 162: 931 – 937. Jiang N, Bao Z, Zhang X, Eddy SR, Wessler SR. 2004. Pack-MULE transpos- Stevens PF. 2001 onwards. Angiosperm Phylogeny Website. Version 9, June 2008 [and more or less continuously updated since]. http:// able elements mediate gene evolution in plants. Nature 431: 569 – 573. www.mobot.org/MOBOT/research/APweb/ Jin YK, Bennetzen JL. 1994. Integration and nonrandom mutation of a plasma Suzuki G, Ura A, Saito N, et al. 2001. BAC FISH analysis in Allium cepa. Genes membrane proton ATPase gene fragment within the Bs1 retroelement of and Genetic Systems 76: 251 – 255. maize. The Plant Cell 6: 1177 – 1186. Tamura K, Dudley J, Nei M, Kumar S. 2007. MEGA4: Molecular Evolutionary Kimura M. 1980. A simple method for estimating evolutionary rates of base sub- Genetics Analysis (MEGA) software version 4.0. Molecular Biology and stitutions through comparative studies of nucleotide sequences. Journal of Evolution 24: 1596 – 1599. Molecular Evolution 16: 111 – 120. Telgmann-Rauber A, Jamsari A, Kinney MS, Pires JC, Jung C. 2007. Genetic Kuhl JC, Cheung F, Yuan Q, et al. 2004. A unique set of 11,008 onion expressed and physical maps around the sex-determining M-locus of the dioecious sequence tags reveals expressed sequence and genomic differences between plant asparagus. Molecular Genetics and Genomics 278: 221 – 234. the monocot orders Asparagales and Poales. The Plant Cell 16: 114 – 125. Vitte C, Bennetzen JL. 2006. Analysis of retrotransposon structural diversity Kuhl JC, Havey MJ, Martin WJ, et al. 2005. Comparative genomic analyses in uncovers properties and propensities in angiosperm genome evolution. Asparagus. Genome 48: 1052 – 1060. Proceedings of the National Academyof Sciences, USA 103: 17638– 17643. Lai J, Li Y, Messing J, Dooner HK. 2005. Gene movement by Helitron transpo- Yang L, Bennetzen JL. 2009. Distribution, diversity, evolution, and survival of sons contributes to the haplotype variability of maize. Proceedings of the Helitrons in the maize genome. Proceedings of the National Academy of National Academy of Sciences, USA 102: 9068 – 9073. Sciences, USA 106: 19922 – 19927.