A Genetrek Analysis of the Maize Genome
Total Page:16
File Type:pdf, Size:1020Kb
A GeneTrek analysis of the maize genome Renyi Liu*, Cle´ mentine Vitte*, Jianxin Ma*, A. Assibi Mahama†, Thanda Dhliwayo†, Michael Lee†, and Jeffrey L. Bennetzen*‡ *Department of Genetics, University of Georgia, Athens, GA 30602; and †Department of Agronomy, Iowa State University, Ames, IA 50011 Contributed by Jeffrey L. Bennetzen, May 7, 2007 (sent for review March 19, 2007) Analysis of the sequences of 74 randomly selected BACs demon- BAC clones from maize led to the prediction of 42,000–56,000 gene strated that the maize nuclear genome contains Ϸ37,000 candidate models and at least 66% repetitive DNA (14). genes with homologues in other plant species. An additional BAC sequences provide additional context information to a Ϸ5,500 predicted genes are severely truncated and probably pseu- sample sequence analysis, thus allowing genome predictions that dogenes. The distribution of genes is uneven, with Ϸ30% of BACs greatly enrich the GeneTrek approach. Here we describe a proce- containing no genes. BAC gene density varies from 0 to 7.9 per 100 dure for accurately predicting plant genome structure and compo- kb, whereas most gene islands contain only one gene. The average sition with a relatively small data input and use this approach in number of genes per gene island is 1.7. Only 72% of these genes comprehensive sequence annotation of randomly selected BACs show collinearity with the rice genome. Particular LTR retrotrans- that contain DNA from maize inbred B73. The results indicate that poson families (e.g., Gyma) are enriched on gene-free BACs, most the maize genome contains many gene-free regions, many highly of which do not come from pericentromeres or other large het- truncated gene fragments, and a nonrandom distribution of repet- erochromatic regions. Gene-containing BACs are relatively en- itive elements within different repeat-rich domains riched in different families of LTR retrotransposons (e.g., Ji). Two major bursts of LTR retrotransposon activity in the last 2 million years Results are responsible for the large size of the maize genome, but only the Repeat and Mobile DNA Annotation. Repetitive elements, especially more recent of these is well represented in gene-containing BACs, LTR retrotransposons, are a major component of the maize suggesting that LTR retrotransposons are more efficiently removed genome (15–17). Exhaustive identification of repetitive elements is in these domains. The results demonstrate that sample sequencing important not only for the accurate estimation of the amount of and careful annotation of a few randomly selected BACs can repetitive DNA but also for the estimation of gene content. provide a robust description of a complex plant genome. Inadequate identification of repetitive elements may lead to con- sistent overestimation of gene number in plants because many gene distribution ͉ gene number ͉ genome annotation ͉ repetitive DNA ͉ repeats are scored as genes by gene prediction programs, and these sample sequencing repeats are well represented in EST databases (18, 19). A total of 7.56 Mb of repeats were identified on the 74 randomly hole-genome sequence analysis has revolutionized the selected BACs by comparing BAC sequences to the TIGR maize Wfield of plant genetics. A sequenced plant genome pro- repeat database (20) and an in-house maize retrotransposon data- vides the full list of genetic elements and also the context in base (P. SanMiguel, personal communication). Intact LTR retro- which these elements function. The near-complete sequence of transposons were then identified by a structure-based search, Arabidopsis thaliana (1) enabled the Arabidopsis 2010 project, leading to the characterization of an additional 710 kb of LTR which proposes to characterize the function of all of the genes in retrotransposons. Hence, mobile and/or repetitive DNA accounted the genome (2). In addition, comparative analysis of the genomes for at least 8.3 Mb or 68% of the BAC sequences (Table 1). of two or more species with known evolutionary relatedness is a powerful way to identify functional elements, to transfer knowl- Gene Annotation. When the FGENESH gene prediction program edge from well studied model organisms to related plants, and was applied to the 74 BACs, 2,137 gene models were predicted. to infer the mechanisms of genome evolution. For example, Some models [1,790 (84%)] were parts of identified repeats (mostly comparative analysis of orthologous regions from multiple ce- LTR retrotransposons) and were removed from further analysis. reals, including maize, rice, sorghum and wheat, has provided Based on stringent criteria (i.e., either significant homology (e Յ Ϫ10 abundant information about the timing, nature and mechanisms 10 ) to other species and/or colinearity with rice genes), 216 of small rearrangements within those genomes (3–7). gene models were classified as verified gene candidates. After Because of the high cost of whole-genome sequencing, it is best comparing to rice or Arabidopsis genes, 28 of these gene models to choose the most cost-effective sequencing method, a choice were identified as gene fragments. This left 188 annotated genes influenced by genome size, gene content, and sequence organiza- over 74 BACs, with an average gene density of 1 gene per 65 kb tion. For small genomes or those with genes highly separated from (Table 1). The remaining 131 gene models were classified as most repetitive DNAs, BAC-by-BAC sequencing is the obvious ‘‘hypothetical proteins,’’ many of which will eventually be shown to approach. For the larger genomes characteristic of most flowering be parts of heretofore undiscovered transposons (18). If these gene plants, where repeats are often intermixed with genes, gene- numbers are extrapolated to the whole maize genome, with an enrichment approaches have been suggested as a more efficient estimated size of 2,400 Mb, then maize contains 37,000 (verified) to sequencing strategy (8, 9). Therefore, before deciding on an effec- tive sequencing approach, it is appropriate to first understand the Author contributions: R.L. and C.V. contributed equally to this work; R.L., C.V., J.M., and general composition and structure of the target genome. J.L.B. designed research; R.L., C.V., J.M., and A.A.M. performed research; R.L., C.V., J.M., The GeneTrek approach has been proposed as an efficient way A.A.M., T.D., M.L., and J.L.B. analyzed data; and R.L., C.V., M.L., and J.L.B. wrote the paper. to evaluate the general properties of any genome (10, 11). The The authors declare no conflict of interest. principle is to sequence and annotate a small randomly selected Abbreviations: TSD, target site duplication; My, million years; RJM, repeat junction marker; portion of the genome, a strategy first used by Brenner and TE, transposable element; IBM, Intermated B73 ϫ Mo17. coworkers in the analysis of the Fugu genome (12). For maize, ‡To whom correspondence should be addressed. E-mail: [email protected]. analysis of Ϸ475,000 BAC end sequences led to the estimation that This article contains supporting information online at www.pnas.org/cgi/content/full/ the maize genome contains Ϸ59,000 genes and is Ϸ58% repetitive 0704258104/DC1. DNA (13). Further, sequence analysis of 100 randomly selected © 2007 by The National Academy of Sciences of the USA 11844–11849 ͉ PNAS ͉ July 10, 2007 ͉ vol. 104 ͉ no. 28 www.pnas.org͞cgi͞doi͞10.1073͞pnas.0704258104 Downloaded by guest on October 1, 2021 Table 1. Summary of annotation results for 74 randomly Gene Islands. The gene distribution in the gene-rich regions can be selected maize BACs evaluated as the number of genes per gene island. Here two genes Ͻ Total number of BACs analyzed 74 are counted as on one gene island if there is 5 kb of identifiable Combined BAC lengths 12.2 Mb repetitive sequences in the intergenic region between them. Genes Amount of identified repetitive DNA 8.3 Mb (67%) at either end of a BAC are discarded from this analysis because one (percentage) boundary of the gene island is not clear. As shown in supporting Number of genes with similarity or 188 information (SI) Fig. 3, islands were found to contain one to seven collinearity support (verified genes only) or eight (hypothetical genes included) genes. Number of severely truncated gene 28 If hypothetical genes are included, 61% (89 of the 145) of the gene fragments islands contain only one gene and the average number of genes per Overall gene density One gene per 65 kb gene island is 1.8. If they are not included, the numbers are 64% (57 Genes that show collinearity with rice 124 (72%) of 89) and 1.7, respectively. (percentage) Detailed Sequence Analysis of Six Gene-Free BACs. Number of hypothetical genes 131 From this analysis Number of estimated total maize genes 37,038–62,847 of 74 BACs, 21 did not contain any verified gene (that is, either no gene candidate or only hypothetical genes were predicted). To further investigate the content of such genomic regions, 6 of these 21 BACs with complete or almost complete sequence assembly 63,000 (verified plus hypothetical) genes (Table 1). Recalculating were selected and manually annotated. The result of this annotation gene number using these 74 BACs, randomly sampled with re- is presented in Fig. 2 and SI Figs. 4–8. As observed for other maize placement, indicated that the 37,000 gene number is 82% accurate genomic regions (4, 6, 14, 22–24), these regions primarily contain with 95% confidence (data not shown). LTR retrotransposons that are organized in nested structures. As shown in SI Table 2, LTR retrotransposons are the largest com- Gene Fragments. Because maize and rice are much more closely ponents of all six BACs, representing 87.5% of the total sequence related than are maize and Arabidopsis, it is easier to find excellent analyzed.