A GeneTrek analysis of the maize

Renyi Liu*, Cle´ mentine Vitte*, Jianxin Ma*, A. Assibi Mahama†, Thanda Dhliwayo†, Michael Lee†, and Jeffrey L. Bennetzen*‡

*Department of Genetics, University of Georgia, Athens, GA 30602; and †Department of Agronomy, Iowa State University, Ames, IA 50011

Contributed by Jeffrey L. Bennetzen, May 7, 2007 (sent for review March 19, 2007) Analysis of the sequences of 74 randomly selected BACs demon- BAC clones from maize led to the prediction of 42,000–56,000 strated that the maize nuclear genome contains Ϸ37,000 candidate models and at least 66% repetitive DNA (14). with homologues in other plant species. An additional BAC sequences provide additional context information to a Ϸ5,500 predicted genes are severely truncated and probably pseu- sample sequence analysis, thus allowing genome predictions that dogenes. The distribution of genes is uneven, with Ϸ30% of BACs greatly enrich the GeneTrek approach. Here we describe a proce- containing no genes. BAC gene density varies from 0 to 7.9 per 100 dure for accurately predicting plant genome structure and compo- kb, whereas most gene islands contain only one gene. The average sition with a relatively small data input and use this approach in number of genes per gene island is 1.7. Only 72% of these genes comprehensive sequence annotation of randomly selected BACs show collinearity with the rice genome. Particular LTR retrotrans- that contain DNA from maize inbred B73. The results indicate that poson families (e.g., Gyma) are enriched on gene-free BACs, most the maize genome contains many gene-free regions, many highly of which do not come from pericentromeres or other large het- truncated gene fragments, and a nonrandom distribution of repet- erochromatic regions. Gene-containing BACs are relatively en- itive elements within different repeat-rich domains riched in different families of LTR retrotransposons (e.g., Ji). Two major bursts of LTR retrotransposon activity in the last 2 million years Results are responsible for the large size of the maize genome, but only the Repeat and Mobile DNA Annotation. Repetitive elements, especially more recent of these is well represented in gene-containing BACs, LTR retrotransposons, are a major component of the maize suggesting that LTR retrotransposons are more efficiently removed genome (15–17). Exhaustive identification of repetitive elements is in these domains. The results demonstrate that sample sequencing important not only for the accurate estimation of the amount of and careful annotation of a few randomly selected BACs can repetitive DNA but also for the estimation of gene content. provide a robust description of a complex plant genome. Inadequate identification of repetitive elements may lead to con- sistent overestimation of gene number in plants because many gene distribution ͉ gene number ͉ genome annotation ͉ repetitive DNA ͉ repeats are scored as genes by programs, and these sample sequencing repeats are well represented in EST databases (18, 19). A total of 7.56 Mb of repeats were identified on the 74 randomly hole-genome sequence analysis has revolutionized the selected BACs by comparing BAC sequences to the TIGR maize Wfield of plant genetics. A sequenced plant genome pro- repeat database (20) and an in-house maize retrotransposon data- vides the full list of genetic elements and also the context in base (P. SanMiguel, personal communication). Intact LTR retro- which these elements function. The near-complete sequence of transposons were then identified by a structure-based search, Arabidopsis thaliana (1) enabled the Arabidopsis 2010 project, leading to the characterization of an additional 710 kb of LTR which proposes to characterize the function of all of the genes in retrotransposons. Hence, mobile and/or repetitive DNA accounted the genome (2). In addition, comparative analysis of the for at least 8.3 Mb or 68% of the BAC sequences (Table 1). of two or more species with known evolutionary relatedness is a powerful way to identify functional elements, to transfer knowl- Gene Annotation. When the FGENESH gene prediction program edge from well studied model organisms to related plants, and was applied to the 74 BACs, 2,137 gene models were predicted. to infer the mechanisms of genome evolution. For example, Some models [1,790 (84%)] were parts of identified repeats (mostly comparative analysis of orthologous regions from multiple ce- LTR retrotransposons) and were removed from further analysis. reals, including maize, rice, sorghum and wheat, has provided Based on stringent criteria (i.e., either significant homology (e Յ Ϫ10 abundant information about the timing, nature and mechanisms 10 ) to other species and/or colinearity with rice genes), 216 of small rearrangements within those genomes (3–7). gene models were classified as verified gene candidates. After Because of the high cost of whole-genome sequencing, it is best comparing to rice or Arabidopsis genes, 28 of these gene models to choose the most cost-effective sequencing method, a choice were identified as gene fragments. This left 188 annotated genes influenced by genome size, gene content, and sequence organiza- over 74 BACs, with an average gene density of 1 gene per 65 kb tion. For small genomes or those with genes highly separated from (Table 1). The remaining 131 gene models were classified as most repetitive DNAs, BAC-by-BAC sequencing is the obvious ‘‘hypothetical ,’’ many of which will eventually be shown to approach. For the larger genomes characteristic of most flowering be parts of heretofore undiscovered transposons (18). If these gene plants, where repeats are often intermixed with genes, gene- numbers are extrapolated to the whole maize genome, with an enrichment approaches have been suggested as a more efficient estimated size of 2,400 Mb, then maize contains 37,000 (verified) to sequencing strategy (8, 9). Therefore, before deciding on an effec- tive sequencing approach, it is appropriate to first understand the Author contributions: R.L. and C.V. contributed equally to this work; R.L., C.V., J.M., and general composition and structure of the target genome. J.L.B. designed research; R.L., C.V., J.M., and A.A.M. performed research; R.L., C.V., J.M., The GeneTrek approach has been proposed as an efficient way A.A.M., T.D., M.L., and J.L.B. analyzed data; and R.L., C.V., M.L., and J.L.B. wrote the paper. to evaluate the general properties of any genome (10, 11). The The authors declare no conflict of interest. principle is to sequence and annotate a small randomly selected Abbreviations: TSD, target site duplication; My, million years; RJM, repeat junction marker; portion of the genome, a strategy first used by Brenner and TE, transposable element; IBM, Intermated B73 ϫ Mo17. coworkers in the analysis of the Fugu genome (12). For maize, ‡To whom correspondence should be addressed. E-mail: [email protected]. analysis of Ϸ475,000 BAC end sequences led to the estimation that This article contains supporting information online at www.pnas.org/cgi/content/full/ the maize genome contains Ϸ59,000 genes and is Ϸ58% repetitive 0704258104/DC1. DNA (13). Further, sequence analysis of 100 randomly selected © 2007 by The National Academy of Sciences of the USA

11844–11849 ͉ PNAS ͉ July 10, 2007 ͉ vol. 104 ͉ no. 28 www.pnas.org͞cgi͞doi͞10.1073͞pnas.0704258104 Downloaded by guest on October 1, 2021 Table 1. Summary of annotation results for 74 randomly Gene Islands. The gene distribution in the gene-rich regions can be selected maize BACs evaluated as the number of genes per gene island. Here two genes Ͻ Total number of BACs analyzed 74 are counted as on one gene island if there is 5 kb of identifiable Combined BAC lengths 12.2 Mb repetitive sequences in the intergenic region between them. Genes Amount of identified repetitive DNA 8.3 Mb (67%) at either end of a BAC are discarded from this analysis because one (percentage) boundary of the gene island is not clear. As shown in supporting Number of genes with similarity or 188 information (SI) Fig. 3, islands were found to contain one to seven collinearity support (verified genes only) or eight (hypothetical genes included) genes. Number of severely truncated gene 28 If hypothetical genes are included, 61% (89 of the 145) of the gene fragments islands contain only one gene and the average number of genes per Overall gene density One gene per 65 kb gene island is 1.8. If they are not included, the numbers are 64% (57 Genes that show collinearity with rice 124 (72%) of 89) and 1.7, respectively. (percentage) Detailed Sequence Analysis of Six Gene-Free BACs. Number of hypothetical genes 131 From this analysis Number of estimated total maize genes 37,038–62,847 of 74 BACs, 21 did not contain any verified gene (that is, either no gene candidate or only hypothetical genes were predicted). To further investigate the content of such genomic regions, 6 of these 21 BACs with complete or almost complete sequence assembly 63,000 (verified plus hypothetical) genes (Table 1). Recalculating were selected and manually annotated. The result of this annotation gene number using these 74 BACs, randomly sampled with re- is presented in Fig. 2 and SI Figs. 4–8. As observed for other maize placement, indicated that the 37,000 gene number is 82% accurate genomic regions (4, 6, 14, 22–24), these regions primarily contain with 95% confidence (data not shown). LTR retrotransposons that are organized in nested structures. As shown in SI Table 2, LTR retrotransposons are the largest com- Gene Fragments. Because maize and rice are much more closely ponents of all six BACs, representing 87.5% of the total sequence related than are maize and Arabidopsis, it is easier to find excellent analyzed. The remaining sequence includes MITEs (0.1%) and maize homology to rice sequences. As a result, 27 of the 28 gene other DNA transposons (1.4%). In BAC AC148161, Ͼ5% of the fragments were identified through comparisons with rice genes. sequence corresponds to a tandem repeat (minisatellite) that shares Among 28 gene fragments, five are C-terminal fragments, 11 are Ͼ83% identity with the 180-bp knob repeat (25). Only 9.4% of the N-terminal fragments, and 12 are fragments from the middle total sequence remained undefined. These sequences probably regions of the genes, the latter requiring at least two truncation correspond to pieces of ancient repeats that are no longer recog- events. nizable because of the accumulation of deletions and substitutions (26). Among LTR retrotransposons, the most represented are Zeon Colinearity with Rice. Because it is not possible to infer colinearity (17 copies), Cinful (14 copies), Prem-1 (13 copies), Gyma (nine if there is only one gene on a maize BAC, 15 genes were dropped copies), Xilon (eight copies), Huck (seven copies), Opie (seven from the colinearity analysis because they are on single-gene BACs. copies), Ji (seven copies), Shadowspawn (five copies), and Tekay For the remaining 173 genes, 124 (72%) were colinear with genes (five copies). Two new gypsy-like elements, Weju and Reme, were in the sequenced rice genome (21). also discovered on these BACs (SI Fig. 4 and SI Table 3). Both parametric and nonparametric tests comparing the number Local Gene Density and Distribution. As shown in Fig. 1, gene density, of LTR retrotransposons in the gene-free vs. gene-containing calculated as the number of genes on a BAC divided by the BAC regions revealed that the elements Zeon and Gyma are significantly length, varies from 0 to 7.9 per 100 kb on different BACs. This more abundant in the gene-free than gene-containing regions when indicates that gene distribution in the maize genome is uneven, and taken individually (P values of 0.029 and 0.019 and 0.003 and 0.003, that some regions have much higher gene density than others. respectively, for parametric and nonparametric tests). After a Twenty one of the 74 annotated BACs contain no verified genes at Bonferroni correction, only Gyma remains significant (P values all, indicating that Ϸ30% of the maize genome is comprised of long after Bonferroni correction, 0.030). In contrast, the Ji and Opie stretches of exclusively nongenic sequence. This uneven distribution elements are relatively more abundant in the gene-containing is supported by a formal ␹2 test (null hypothesis rejected at 99% regions [individual P value Ͻ0.0001 and 0.0001 and 0.006 and 0.014, significance level, ␹2 ϭ 22.7, df ϭ 9, P ϭ 0.007). respectively, for each test, with Ji still significant after Bonferroni correction (corrected P value of 0.001)]. As presented in Fig. 2, SI Figs. 4–8 and SI Table 4, most (66) of )bk001 rep seneg( ytisned eneG ytisned seneg( rep )bk001 8 the 103 LTR retrotransposons analyzed contain two LTRs, whereas Hypothetical genes 7 only 37 are truncated. For 13 of these truncated elements, the Confirmed genes truncation is at the end of the BAC sequence and thus probably 6 does not correspond to a biological feature. Only four solo LTRs 5 (two from Gyma and two from Ji) could be described with confi- 4 dence using the presence of the target site duplication (TSD), whereas an additional LTR was complete but did not exhibit a TSD. 3 Among the 66 elements with two LTRs, insertion dates could be

2 estimated for 63 (the other three did not have enough LTR PLANT BIOLOGY sequence to be properly aligned). Results, presented in SI Table 5, 1 show that most of these elements have inserted within the last 3 0 million years (My), with a first quartile, a median, a third quartile, 010203040506070 and a maximum at 0.39, 1.35, 1.71, and 3.94 My, respectively. Maize BACs Histograms of the insertion dates were built, with a time scale of 1/2 Fig. 1. Gene density variation among BACs. Gene density was calculated as My. These results, presented in SI Fig. 9, reveal two peaks of the gene number on a BAC divided by the BAC length. The BACs are sorted by amplification, Ϸ1.5–2 Mya and within the last 500,000 years. overall gene density. The gene density is shown for both hypothetical and Of the 45 elements, only five (11.1%) are inserted within their verified genes. own family, whereas 40 (88.9%) are within a member of another

Liu et al. PNAS ͉ July 10, 2007 ͉ vol. 104 ͉ no. 28 ͉ 11845 Downloaded by guest on October 1, 2021 6A 710.0 040.0 1-nwapswodahS iX 4-nol 710.0 250.0 1-amyG 2-kcuH 0.0 00 4-noeZ 600.0 740.0 3-noliX Z 3-noe X 2-noli nwonknU 340.0 r taepe 1# 1-noliX 230.0 020.0 5-noeZ CAB 987741CA )pb167961( 40.0 4 1-kcuH 1-noeZ bk01 nwonknU # taeper # 2 RTL osopsnartorter n 950.0 Z oe n 2- Ci fn u 1-l iC fn 2-lu 1-mupeiG 1merP -1 -amyG 2 T snar sop no ETIM 1 961 7 16 ( end) P mir re rof MJR

Fig. 2. Graphic representation of BAC AC147789. Total size of the BAC is shown in parentheses. For LTR retrotransposons, family names are shown on top of the element, and the number after the dash represents each copy. LTR divergence is represented on top of the name. Locations of the primers used for RJMs are represented as small arrows.

family (SI Table 6). The ‘‘top’’ elements are usually younger (based AC147809 appears to be located near knob 6L1, described in on comparison of the LTR sequences) than the elements in which traditional maize varieties and Mexican teosintes (27) (the mapping they have inserted, although there are six exceptions (see Fig. 2 and interval spans the knob region). For BAC AC148081, the Inter- SI Figs. 4–8). mated B73 ϫ Mo17 (IBM) mapping interval spans a centromere, but the interval is quite large and marker phi034/cyp6 (chromosome Generation of Unique Markers from Repeated Sequences and Map- 7S) is tightly linked to the BAC. Therefore, it appears that BAC ping of Gene-Free BACs. Gene-free genomic fragments contain AC148081 is located on chromosome 7S, at least 3 ␮m from the mainly repetitive sequences. Hence, they may be difficult to map centromere. genetically, because they do not harbor unique sequences that can Hence, of the six gene-free BACs analyzed, one (AC148172) is be used as molecular markers. However, the insertion of a partic- located in a pericentromeric region, one (AC148159) is located ular LTR retrotransposon within another can be used as a unique close to a B73 knob, and one (AC147809) is located close to a knob marker, because the combination of both the nature of the two that is present in traditional maize varieties and Mexican teosintes elements and the site of insertion may be unique in the genome (27) but not visible in B73 by in situ hybridization (28), whereas the (11). Of 29 LTR repeat junction markers (RJMs) tested, 11 (37.9%) three other BACs are from chromosomal locations that are distant led to the amplification of a single band of expected size in B73 to known heterochromatic regions. maize and seven (24.1%) to the amplification of a single band of expected size, plus one or two additional faint bands (SI Table 7; see Discussion SI Table 8 for details). Five of these 18 RJMs were polymorphic The Structure of the Maize Genome. To sequence large plant ge- between B73 and Mo17 and were used to position these BACs on nomes in a cost-effective manner, it is appropriate to first evaluate the maize genetic map. These five polymorphic markers are from genome composition and structure, then choose the best sequenc- BACs AC147789 (marker A6), AC148081 (markers B3 and B4), ing strategy. Our results demonstrate that the annotation of a small and AC147809 (markers F2 and F5). set of randomly selected BACs is an effective way to evaluate the Evidence for linkage was found for all markers, leading to the key properties of a large and complex plant genome, such as total positioning of BAC AC147789 on , bin 1.06, between gene number, amount of repetitive DNA, and gene distribution. markers umc1972 (15.4 cM) and asg58 (43.1 cM); BAC AC148081 Because all large plant genomes contain numerous transposable on , bin 7.02, between markers phi034/cyp6 (0 cM) element (TEs) that may be difficult to identify and differentiate and umc1983 (18.9 cM); and BAC AC147809 on , from genes, it is challenging to obtain an accurate estimation of bins 6.01–6.02, between markers umc1006 (4.6) and bnlg1867 (4.9 gene number (18). Even in the relatively simple rice genome, the cM) (SI Table 9). In cases where two markers were available for the estimation of gene number from draft whole-genome sequence and same BAC (i.e., markers B3 and B4 on AC148081 and markers F2 finished individual chromosomes has varied from Ϸ32,000 to and F5 on AC147809, respectively), these two markers were Ϸ70,000 (29–33). The estimation of gene number in maize is mapped within 0.5 cM, confirming the correlation between the particularly challenging, because the complete genome sequence is bands observed and the markers designed (data not shown). not available, and the majority of the genome consists of nested BAC fingerprint data (www.genome.arizona.edu/fpc/WebAG- LTR retrotransposons (15). The lower boundary of our estimation CoL/maize/WebFPC) were used to physically locate all six BACs. of maize gene number (37,000) is similar to the gene number This step also confirmed the RJM mapping results and helped estimated in the nearly completed rice genome sequence (Ϸ32,000) estimate the position of the remaining three BACs for which no (34). This is consistent with the fact that, even though maize has a RJM marker could be found: AC148172 in bin 4.05, AC148161 in fairly recent tetraploid history (35), it is approaching a diploid status bin 5.03, and ac148159 in bin 5.05 (SI Table 9). because 50–90% of the duplicated copies of genes have been Positions of the centromeres and knobs located on maize chro- deleted at least partially in one of the homoeologous regions (4, 6, mosomes 1, 4, 5, 6, and 7 were compared with the position of the 36). In a recent study, Haberer et al. (14) predicted 42,000–56,000 six BACs (SI Table 9). Results of this comparison reveal that genes in the maize genome and concluded that 22% of their 100 AC147789 is located Ͼ5 ␮m from the centromere and knob 1S randomly chosen maize BACs were missing genes. The differences described in traditional maize varieties (27), AC148172 is near (Ͻ1 in our analyses, for instance the Ϸ30% of maize BACs that we ␮m) a centromere, AC148161 is located far (Ϸ9 ␮m) from any found to be lacking in verified genes, is an outcome of the more centromere, AC148159 is located Ϸ2 ␮m from knob 5L, and conservative gene annotation in the current study. We found that

11846 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0704258104 Liu et al. Downloaded by guest on October 1, 2021 many of the sequences called genes by Haberer et al. (14) were vs. -containing regions revealed that some elements are significantly actually truncated gene fragments and/or sequences within TEs. more abundant in the gene-free regions (i.e., Gyma), whereas One critical step to obtain an accurate gene number estimation others are significantly more abundant in the gene-containing ones in maize is to differentiate gene fragments from complete genes. In (i.e., Ji). Previous studies have indicated that some LTR retrotrans- this set of BACs, 14% of the genes with homologues to other plant posons are more likely to accumulate in pericentromeric regions species are severely truncated when compared with rice or Arabi- than in euchromatic DNA in maize and other plants (for details, see dopsis genes. Gene fragments have been identified earlier in many SI Text). The data presented herein suggest this bias is not limited maize regions. Some are the residual gene components retained to large heterochromatic blocks like pericentromeres or knobs but from the loss of duplicated genes (from either segmental duplica- is a general property of interstitial gene-poor regions intermixed tion or polyploidy) caused by the slow accumulation of small with euchromatin. deletions associated with illegitimate recombination (4, 37). The Investigation of the insertion dynamics of LTR retrotransposons majority, however, are associated with Helitrons, a new class of TEs in the gene-free regions indicated two peaks of amplification, that can acquire (and sometimes fuse/express) multiple gene frag- around 1.5–2 Mya and within the last 500,000 years. Calculating the ments from different genes (38, 39). Although truncated genes can insertion dates of LTR retrotransposon for six gene-containing play a functional role because they might be expressed and trans- regions revealed only one peak of amplification, within the last 1 lated, they are usually excluded from gene content estimation. My. This striking difference suggests that LTR retrotransposons Given our stringent criteria for prediction as a truncated gene have been more efficiently removed from the gene-containing (Ͼ30% sequence loss), it is likely that many more predicted genes regions (47). Removal of LTR retrotransposon sequences occurs in maize are actually truncated or inactivated by other types of primarily by illegitimate and unequal homologous recombination mutations. For instance, Morgante et al. have predicted that Ϸ20% (26, 48). The recombinational suppression expected in gene-poor of annotated maize genes are actually gene fragments within regions should lead to a lower frequency of solo LTR generation by Helitrons (39). unequal homologous recombination, as observed in rice pericen- Another key step to evaluate gene content is to differentiate tromeric heterochromatin (49). genes from TEs. Although a similarity search is effective in iden- Mapping of gene-free BACs revealed that, among the six gene- tifying known TEs, it cannot identify novel elements. It is thus free BACs, only two (AC148159 and AC141872) are from known important to identify TEs by structural features. In this study, 710 heterochromatic regions such as centromeres or knobs, and the four kb of novel LTR retrotransposon sequences were identified, and others are located either in regions distant from the heterochro- this decreased the predicted gene number by Ͼ20%. matic regions or not highly compacted in B73. Thus blocks of Genomic colinearity with closely related species can be an gene-free repetitive DNA Ͼ100 kb in size appear to be common in effective way to identify real genes, because plant genomes are the maize genome and are likely to be intermixed with genic blocks. evolving very fast, and orthologous intergenic regions are rarely Although BAC AC148161 does not map in a knob region (it is conserved (5). It has been previously demonstrated that the com- located on bin 5.03, whereas knob 5L is located on bin 5.07, and parison of the homologous regions of maize and sorghum or of physical estimates place it Ͼ16 ␮m from the knob), it harbors a barley and rice was an effective way to identify genes and delineate tandem repeat that shares Ͼ83% identity with the 180-bp knob gene structures (40, 41). However, this strategy has not been applied repeat (50). The repeat unit is repeated 71 times on this gene-free to large-scale genomic sequence annotation in plants. This is BAC. The repeats on this BAC are mostly complete, with only 12 because the two species involved need to have appropriate evolu- truncated repeats that lack from 10 to 40 bp. The copy number of tionary distance, which should be close enough to allow for exten- the 180-bp tandem repeat usually correlates with the size of the sive colinearity and far enough to allow for extensive divergence in knob (25, 51), suggesting this BAC comes from a region with a the intergenic regions. For example, Arabidopsis is not an appro- ‘‘microknob’’ in B73. As suggested by Viotti et al. (52), the molec- priate reference for rice genome analysis, because the colinearity is ular characterization of this repeat confirms that some euchromatic very limited between these two species (42–45). In this study, regions have a low concentration of knob 180-bp repeats. The other colinearity with rice was used to help confirm that maize genes were maize knob repeat, the 350-bp repeat known as TR-1 (53), is not real, and not artifacts of annotation. present in BAC AC148161, in agreement with the observation that Overall gene distribution in a genome has important implications the two types of repeats tend to be isolated from each other, either in selecting cost-effective strategies for whole-genome sequencing in different knobs or in separate domains of the same knob (54). or map-based cloning. The annotation results show that Ϸ30% of maize BACs contain no verified genes. Given the gene distribution Conclusions observed, the results predict that Ϸ50% of verified genes could be The general properties of the maize genome can be described by sequenced on Ϸ18% of maize BACs, whereas Ϸ90% of verified sequence analysis of a small number of randomly selected genes are present on 49% of maize BACs and Ϸ95% of maize genes large-insert clones. The overall gene number in the genome, on 60% of maize BACs. Given the general opinion that 95% gene frequency of truncated genes, and gene distribution can be discovery is an unacceptably low final product for a genome unambiguously assessed. The contributions and arrangements of sequence and given the difficulties in finding all maize BACs with repetitive DNAs are also clearly delineated. The results indicate genes on them, these results suggest that selecting only gene-rich that genes are found in small islands that are unevenly distrib- BACs for maize genome sequencing would be a poor genome- uted around the genome, and that different families of TEs sequencing approach. It would be better to use gene enrichment or preferentially associate with gene-containing or -free regions. sequence all available BACs and accept that Ϸ30% of the se- Mapping of these regions suggests most of these gene-free

quenced clones will provide no gene discovery value. regions are not associated with known heterochromatic features PLANT BIOLOGY of the genome. Nonrandom Genome-Component Distribution Across the Maize Ge- nome. The majority of the sequence of the six gene-free BACs Materials and Methods corresponds to LTR retrotransposons, as expected in the LTR Randomly Selected Maize BACs. Ninety-one maize BAC sequences retrotransposon-rich maize genome. Neither LINEs nor Helitrons were downloaded from the National Center for Biotechnology were found on these gene-free BACs. The absence of Helitrons in Information (www.ncbi.nlm.nih.gov) on March 4, 2004. These these regions was unexpected, because previous studies suggested BACs were selected randomly from EcoRI, MboI, and HindIII that these elements insert mainly in repeated sequences (46). BAC libraries with inserts of DNA from maize inbred B73 (55). Statistical analysis of LTR retrotransposon number in gene-free They were sequenced by the Whitehead/Broad Institute, the Uni-

Liu et al. PNAS ͉ July 10, 2007 ͉ vol. 104 ͉ no. 28 ͉ 11847 Downloaded by guest on October 1, 2021 versity of Arizona, and Rutgers University (www.broad.mit.edu/ all BACs except AC148161 (which contains three contigs), the annotation/plants/maize/randomclones.html) (14). We selected sequence is represented in one contig. only those 80 BACs with sizes Ͼ80 kb for annotation. Six of these Annotation of the six BACs sequences was performed in several BACs were excluded later, because their predicted size changed steps, each step allowing the characterization of a certain type of dramatically upon subsequent assemblies recorded in GenBank, repeat, such as known TEs, unknown LTR retrotransposons, leaving 74 BACs covering 12.2 Mb (0.5%) of the maize genome for tandem repeats, and previously undescribed repeats (see SI Text for the analysis. details of these steps). After each step, the annotated regions were removed from the sequence to allow the characterization of other Annotation of LTR Retrotransposons and Other Repeats. Known repeats in which the first annotated elements may have inserted. repeats were identified by comparing BAC sequences with the For each characterized TE, special care was given to detect the TIGR maize repeat database (20) and an in-house database of presence of the TSD. maize repeats with cross-match (www.phrap.org/phredphrap- consed.html). LTR retrotransposons were manually identified by Primers Design and PCR. Primers were chosen to flank the junction structural features such as pairs of LTRs, TSDs, a primer binding between two LTR retrotransposons, based on annotation of the site and a polypurine tract. BAC sequences from maize line B73. The TSD sequence of the inserted LTR retrotransposon was chosen to be in the middle of Gene Annotation and Distribution Analysis. BAC sequences were the PCR product, with a product size ranging from 500 to 600 bp. first subject to ab initio gene prediction with FGENESH (www. Primers were designed by using the Primer3 software (http:// ࿝ softberry.com) using the matrix for monocot plants. Gene models frodo.wi.mit.edu/cgi-bin/primer3/primer3 www.cgi), with a GC that were integral parts of known repeats were removed from clamp of 2. Primer sequences are listed in SI Table 10. further analysis. The remaining gene models were used as query DNAs were extracted from 2-week-old plants from maize lines sequences to search the National Center for Biotechnology Infor- B73 and Mo17 using the CTAB method described by M. Frohlich ࿝ mation nonredundant database, the Arabidopsis protein (http://fgp.bio.psu.edu/fgp/methods/method CTAB.html) without database (ftp://ftp.tigr.org/pub/data/a࿝thaliana, version 5.0), all conducting the CsCl purification steps. PCRs were performed in a 20-␮l total volume, containing 100 ng of DNA, 2 ␮lof10ϫ Taq BAC sequences from the international rice genome sequencing ␮ group (21), and the TIGR plant gene indices (56). All BLAST hits polymerase buffer (Roche, Indianapolis, IN), 0.16 lof25mM dNTPs (Roche), 3 ␮lof4␮M solutions of each primers (Invitro- were manually evaluated to determine whether a gene model is ␮ ␮ likely to be a real gene based on e value, alignment with the query, gen), 0.16 lof5units/ lofTaq (Roche), completed with nuclease- hit annotation, and number and evolutionary range of species with free water (Fisher Scientific, Pittsburgh, PA). They were carried out with the following steps: denaturation at 94°C for 3 min, followed sequences that exhibited homology. by 35 touchdown cycles with denaturation at 94°C for 30 sec, A ␹2 test was used to test the null hypothesis that genes are annealing with temperature decreasing from 62°C to 55°C in 30 sec, uniformly distributed in the maize genome. Because one require- and extension of 1 min at 72°C, followed by a final extension step ment of a ␹2 test is that cell (group) frequency must be at least 5 of 3 min at 72°C. In each cycle, the increase of the temperature (57), each BAC was randomly assigned to one of the 10 groups (a between the annealing and the extension steps was carried out with randomized block design) and then the expected and observed gene a ramp of 1°C per sec. number in each group was used to calculate the ␹2 value. Estimation of LTR Retrotransposon Insertion Dates. The insertion Gene Fragment Identification. Each gene model was used as a query date of each LTR retrotransposon harboring two LTRs was esti- to search Arabidopsis proteins (TIGR annotation version 5.0) and mated following the method described in ref. 22. The divergence rice proteins derived from full length cDNAs. The best match from between LTRs was computed by MEGA version 3.0 (58), using the either Arabidopsis or rice was used as query to search maize BAC Kimura 2 parameter distance (59) that corrects for both homoplasy DNA sequences to ensure that homologs were not shortened by and differences in the rates of transition and transversion. The annotation errors. Each alignment was manually evaluated. Maize substitution rate used was 1.3 ϫ 10Ϫ8 substitution/site per year, as DNA sequences upstream and downstream of the alignment were suggested by Ma and Bennetzen (60). also checked to make sure that any fragment identified was not caused by sequence gaps or BAC boundaries. A gene fragment was Comparison of Gene-Free and Gene-Containing Regions. Six gene- defined by the subjective criteria that the maize gene model could containing BACs, GenBank accession nos. AF123535 (15), Յ cover only part ( 70%) of the homologs in Arabidopsis or rice, and AF448416 (24), AY555142 (7), AY664413, AY664414, and the alignment between the query and hit would have a high identity AY664415 (23), were chosen to compare to the six gene-free BACs that was maintained at both ends of the alignment. Hence, many in this study. LTR retrotransposons copy number and insertion gene fragments and other pseudogenes would be missed by these dates were recalculated to avoid differences caused by different criteria. analytical procedures. Variation in copy number between the gene-free and -containing Colinearity with Rice. Gene models from a single BAC were used as BACs was determined for the 10 most-abundant LTR retrotrans- queries to search against rice BAC sequences. If two or more gene posons. For each element, two methods were used: (i)aP value was Ϫ models had significant (BLAST e value Յ10 5) matches on the estimated by comparing the observed copy number in the gene-free same rice BAC, they were judged as having colinearity with rice. regions to a binomial distribution with n trials (n being the total Because the randomly selected maize BAC sequences were in draft number of copies found in both gene-free and -rich regions for this stage and usually consisted of unordered pieces, it was not possible element), and a probability of a copy to be in a gene-free region to compare the order and orientation of some genes. estimated by P ϭ N1/N1ϩN2, with N1 and N2 the total number of copies found for all elements in the gene-free and gene-rich regions, Choice and Annotation of Gene-Free BACs. From the 21 BACs for respectively; (ii)aP value was estimated by using a nonparametric which no verified gene was annotated, six were chosen for further test. For each element, a theoretical distribution of the copy number characterization. These BAC sequences are present in GenBank in gene-free region was built by using 999 simulations in which NGF under accession nos. AC147789, AC148081, AC148159, AC148172, copies were randomly picked without replacement from the total AC148161, and AC147809 (SI Table 2). Sequence analysis was number of copies observed in both gene-rich and gene-free regions based on the sequences available in GenBank in February 2005. For (NGF ϩ NGR), with NGF and NGR the observed number of copies

11848 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0704258104 Liu et al. Downloaded by guest on October 1, 2021 in gene-free and -rich regions, respectively. The P value was To precisely position the six BACs on the genome, positions of estimated by comparing the observed number of copies in the the markers on the IBM2 2004 map were converted to physical gene-free regions to this distribution. A Bonferroni correction was locations, using the Morgan2McClintock translator (62) (available used to correct for the 10 individual tests that were made (one for at www.lawrencelab.org/Morgan2McClintock). This allowed pre- each of the 10 most-abundant elements) out of the same sample. diction of the physical position of the markers as a fraction of the distance on the arm from the centromere. Similarly, knob- BAC Mapping and Comparison to Centromere and Knob Locations. containing bins were physically mapped on the chromosomes, BACs with polymorphic markers were genetically mapped in a allowing the estimation of the distance between the six BACs and subset of 94 inbred lines from the IBM maize mapping population these heterochromatic regions. (61) using MapMaker 2.0. In addition, all BACs were mapped by using BAC fingerprint data retrieved from the Arizona Genomic We thank P. SanMiguel for providing an unpublished maize retrotrans- Institute Web site (www.genome.arizona.edu/fpc/WebAGCoL/ poson data set, C. Lawrence for advice on locating maize knobs, L. Yang maize/WebFPC). These data were then used to locate the corre- for Helitron analysis, and H. Zheng and J. Estill for help with statistical sponding bin, using the location of these markers on the IBM2 2004 analyses. This research was supported by National Science Foundation neighbors map. Grant DBI-0501814 (to J.L.B.).

1. The Arabidopsis Genome Initiative (2000) Nature 408:796–815. 31. Feng Q, Zhang YJ, Hao P, Wang SY, Fu G, Huang YC, Li Y, Zhu JJ, Liu YL, 2. Ausubel FM (2002) Plant Physiol 129:394–437. Hu X, et al. (2002) Nature 420:316–320. 3. Ramakrishna W, Dubcovsky J, Park YJ, Busso C, Emberton J, SanMiguel P, 32. Sasaki T, Matsumoto T, Yamamoto K, Sakata K, Baba T, Katayose Y, Wu JZ, Bennetzen JL (2002) Genetics 162:1389–1400. Niimura Y, Cheng ZK, Nagamura Y, et al. (2002) Nature 420:312–316. 4. Ilic K, SanMiguel PJ, Bennetzen JL (2003) Proc Natl Acad Sci USA 100:12265– 33. Yu Y, Rambo T, Currie J, Saski C, Kim HR, Collura K, Thompson S, Simmons 12270. J, Yang TJ, Nah G, et al. (2003) Science 300:1566–1569. 5. Bennetzen JL, Ma J (2003) Curr Opin Plant Biol 6:128–133. 34. Itoh T, Tanaka T, Barrero RA, Yamasaki C, Fujii Y, Hilton PB, Antonio BA, 6. Lai J, Ma J, Swigonova Z, Ramakrishna W, Linton E, Llaca V, Tanyolac B, Park Aono H, Apweiler R, Bruskiewich R, et al. (2007) Genome Res 17:175–183. YJ, Jeong Y, Bennetzen JL, et al. (2004) Genome Res 14:1924–1931. 35. Swigonova Z, Lai J, Ma J, Ramakrishna W, Llaca V, Bennetzen JL, Messing 7. Ma J, SanMiguel P, Lai J, Messing J, Bennetzen JL (2005) Genetics 170:1209– J (2004) Genome Res 14:1916–1923. 1220. 36. Langham RJ, Walsh J, Dunn M, Ko C, Goff SA, Freeling M (2004) Genetics 8. Whitelaw CA, Barbazuk WB, Pertea G, Chan AP, Cheung F, Lee Y, Zheng L, 166:935–945. van Heeringen S, Karamycheva S, Bennetzen JL, et al. (2003) Science 302:2118– 37. Ramakrishna W, Emberton J, Ogden M, SanMiguel P, Bennetzen JL (2002) 2120. Plant Cell 14:3213–3223. 9. Palmer LE, Rabinowicz PD, O’Shaughnessy AL, Balija VS, Nascimento LU, 38. Lai J, Li Y, Messing J, Dooner HK (2005) Proc Natl Acad Sci USA 102:9068– Dike S, de la Bastide M, Martienssen RA, McCombie WR (2003) Science 9073. 302:2115–2117. 10. Bennetzen JL (2003) Proc Tenth Int Wheat Genet Symp 1:215–220. 39. Morgante M, Brunner S, Pea G, Fengler K, Zuccolo A, Rafalski A (2005) Nat 11. Devos KM, Ma J, Pontaroli AC, Pratt LH, Bennetzen JL (2005) Proc Natl Acad Genet 37:997–1002. Sci USA 102:19243–19248. 40. Dubcovsky J, Ramakrishna W, SanMiguel PJ, Busso CS, Yan LL, Shiloff BA, 12. Brenner S, Elgar G, Sandford R, Macrae A, Venkatesh B, Aparicio S (1993) Bennetzen JL (2001) Plant Physiol 125:1342–1353. Nature 366:265–268. 41. Avramova Z, Tikhonov A, SanMiguel P, Jin YK, Liu CN, Woo SS, Wing RA, 13. Messing J, Bharti AK, Karlowski WM, Gundlach H, Kim HR, Yu Y, Wei F, Bennetzen JL (1996) Plant J 10:1163–1168. Fuks G, Soderlund CA, Mayer KFX, et al. (2004) Proc Natl Acad Sci USA 42. Devos KM, Beales J, Nagamura Y, Sasaki T (1999) Genome Res 9:825–829. 101:14349–14354. 43. Vandepoele K, Saeys Y, Simillion C, Raes J, Van de Peer Y (2002) Genome 14. Haberer G, Young S, Bharti AK, Gundlach H, Raymond C, Fuks G, Butler E, Res 12:1792–1801. Wing RA, Rounsley S, Birren B, et al. (2005) Plant Physiol 139:1612–1624. 44. Liu H, Sachidanandam R, Stein L (2001) Genome Res 11:2020–2026. 15. SanMiguel P, Tikhonov A, Jin YK, Motchoulskaia N, Zakharov D, Melake- 45. Bennetzen JL, SanMiguel P, Chen MS, Tikhonov A, Francki M, Avramova Z Berhan A, Springer PS, Edwards KJ, Lee M, Avramova Z, et al. (1996) Science (1998) Proc Natl Acad Sci USA 95:1975–1978. 274:765–768. 46. Brunner S, Pea G, Rafalski A (2005) Plant J 43:799–810. 16. Meyers BC, Tingley SV, Morgante M (2001) Genome Res 11:1660–1676. 47. Vitte C, Bennetzen JL (2006) Proc Natl Acad Sci USA 103:17638–17643. 17. SanMiguel P, Bennetzen JL (1998) Ann Bot 82:37–44. 48. Devos KM, Brown JKM, Bennetzen JL (2002) Genome Res 12:1075–1079. 18. Bennetzen JL, Coleman C, Liu R, Ma J, Ramakrishna W (2004) Curr Opin 49. Ma J, Bennetzen JL (2006) Proc Natl Acad Sci USA 103:383–388. Plant Biol 7:732–736. 50. Peacock WJ, Dennis ES, Rhoades MM, Pryor AJ (1981) Proc Natl Acad Sci 19. Sabot F, Guyot R, Wicker T, Chantret N, Laubin B, Chalhoub B, Leroy P, USA 78:4490–4494. Sourdille P, Bernard M (2005) Mol Genet 274:119–130. 51. Dennis ES, Peacock WJ (1984) J Mol Evol 20:341–350. 20. Ouyang S, Buell CR (2004) Nucleic Acids Res 32:D360–D363. 52. Viotti A, Privitera E, Sala E, Pogna N (1985) Theor Appl Genet 70:234–239. 21. Matsumoto T, Wu JZ, Kanamori H, Katayose Y, Fujisawa M, Namiki N, 53. Ananiev EV, Phillips RL, Rines HW (1998) Proc Natl Acad Sci USA 95:10785– Mizuno H, Yamamoto K, Antonio BA, Baba T, et al. (2005) Nature 436:793– 10790. 800. 54. Dawe RK, Hiatt EN (2004) Chromosome Res 12:655–669. 22. SanMiguel P, Gaut BS, Tikhonov A, Nakajima Y, Bennetzen JL (1998) Nat 55. Yim YS, Davis GL, Duru NA, Musket TA, Linton EW, Messing JW, McMullen Genet 20:43–45. MD, Soderlund CA, Polacco ML, Gardiner JM, et al. (2002) Plant Physiol 23. Brunner S, Fengler K, Morgante M, Tingey S, Rafalski A (2005) Plant Cell 130:1686–1696. 17:343–360. 24. Fu H, Dooner HK (2002) Proc Natl Acad Sci USA 99:9573–9578. 56. Quackenbush J, Cho J, Lee D, Liang F, Holt I, Karamycheva S, Parvizi B, 25. Ananiev EV, Phillips RL, Rines HW (1998) Genetics 149:2025–2037. Pertea G, Sultana R, White J (2001) Nucleic Acids Res 29:159–164. 26. Ma J, Devos KM, Bennetzen JL (2004) Genome Res 14:860–869. 57. Bhattacharyya GK, Johnson RA (1977) Statistical Concepts and Methods 27. McClintock B, Kato Y, Blumenshein A (1981) Chromosome Constitution of (Wiley, New York). Races of Maize (Colegio de Postgraduados, Chapingo, Mexico). 58. Kumar S, Tamura K, Nei M (2004) Brief Bioinfom 5:150–163. 59. Kimura M (1980) J Mol Evol 16:111–120.

28. Kato A, Lamb JC, Birchler JA (2004) Proc Natl Acad Sci USA 101:13554– PLANT BIOLOGY 13559. 60. Ma J, Bennetzen JL (2004) Proc Natl Acad Sci USA 101:12404–12410. 29. Goff SA, Ricke D, Lan TH, Presting G, Wang RL, Dunn M, Glazebrook J, 61. Lee M, Sharopova N, Beavis WD, Grant D, Katt M, Blair D, Hallauer A (2002) Sessions A, Oeller P, Varma H, et al. (2002) Science 296:92–100. Plant Mol Biol 48:453–461. 30. Yu J, Hu SN, Wang J, Wong GKS, Li SG, Liu B, Deng YJ, Dai L, Zhou Y, 62. Lawrence CJ, Seigfried TE, Bass HW, Anderson LK (2006) Genetics 172:2007– Zhang XQ, et al. (2002) Science 296:79–92. 2009.

Liu et al. PNAS ͉ July 10, 2007 ͉ vol. 104 ͉ no. 28 ͉ 11849 Downloaded by guest on October 1, 2021