Sciencedirect.Com
Total Page:16
File Type:pdf, Size:1020Kb
Available online at www.sciencedirect.com ScienceDirect Progress, challenges and the future of crop genomes 1 2 Todd P Michael and Robert VanBuren The availability of plant reference genomes has ushered in a The high throughput and low cost of NGS technologies new era of crop genomics. More than 100 plant genomes have made it possible to sequence crops with lower economic been sequenced since 2000, 63% of which are crop species. value or large genomes and have paved the way for These genome sequences provide insight into architecture, establishing new model species. The complexity and size evolution and novel aspects of crop genomes such as the of some crop genomes made traditional Sanger sequenc- retention of key agronomic traits after whole genome ing cost prohibitive. The wheat genome for instance, is duplication events. Some crops have very large, polyploid, hexaploid, 90% repetitive, and 17 gigabases (Gb), and the repeat-rich genomes, which require innovative strategies for sugarcane genome ranges in ploidy up to decaploid, and sequencing, assembly and analysis. Even low quality reference its 12 Gb is 80% repetitive. Although sequencing capacity genomes have the potential to improve crop germplasm and computational power are increasing exponentially, through genome-wide molecular markers, which decrease numerous challenges still remain, and both novel meth- expensive phenotyping and breeding cycles. The next stage of odologies and legacy techniques are important to crack plant genomics will require draft genome refinement, building these impossible genomes. resources for crop wild relatives, resequencing broad diversity panels, and plant ENCODE projects to better understand the Model plant genomes such as Arabidopsis [1], Brachypo- complexities of these highly diverse genomes. dium distachyon [3], Physcomitrella patens (moss [4]) and Addresses Setaria italica [5,6], serve as an engine for research, while 1 Ibis Biosciences, Carlsbad, CA, United States others like Oyrza sativa (rice [7,8]), Populus trichocarpa ([9] 2 Donald Danforth Plant Science Center, St. Louis, MO, United States poplar), Zea mays (maize [10]), Glycine max (soybean [11]), Solanum lycopersicum (tomato [12]), and Pinus taeda (lob- Corresponding author: VanBuren, Robert ([email protected]) lolly pine [13]) serve a dual purpose not just as crops but as functional models. Together these genomes have Current Opinion in Plant Biology 2015, 24:71–81 provided the foundation for an era of molecular genomics research that has enabled functional definition of many This review comes from a themed issue on Genome studies and molecular genetics key genes and pathways. Edited by Insuk Lee and Todd C Mockler Non-model and non-crop plant genomes provide im- For a complete overview see the Issue and the Editorial portant clues to plant genome architecture and the Available online 19th February 2015 evolution of flowering plants. Although it was thought http://dx.doi.org/10.1016/j.pbi.2015.02.002 that plants have a ‘one-way ticket to genome obesity’ as 1369-5266/# 2015 Elsevier Ltd. All rights reserved. a result of the retention of proliferating transposable elements (TEs) [14], the smallest plant genomes [15], Utricularia gibba (bladderwort) and Genlisea aurea (cork- screw), provided evidence that almost all intragenic space and repeat sequence can be purged [16,17]. In addition, the aquatic, highly morphologically reduced, Introduction non-grass monocot Spirodela polyrhiza (greater duck- After the release of the Arabidopsis genome in 2000 [1] weed), has a genome similar in size to Arabidopsis and the advent of Next Generation Sequencing (NGS) yet functions with 28% less genes (19,623) [18]. The technology in 2005, the number of sequenced plant genomes of Selaginella moellendorffii (spikemoss [19]) genomes has rapidly increased to more than 100 ([2], and Amborella trichopoda [20], provide the evolutionary List of sequenced plant genomes; URL: https:// link between vascular plants and angiosperms respec- genomevolution.org/wiki/index.php/Sequenced_plant_ tively, yielding key insights into the trajectory of plant genomes). Nearly two-thirds (63%) of the sequenced specific gene families and the radiance of flowering plant genomes are from crops, while model, non-model plants. and crop wild relatives make up the remainder; three- fourths (76%) of the sequenced plant genomes are from In this review we focus primarily on the most recently eudicots and one-fifth (19%) are from monocots. Few sequenced specialty and row crop genomes with an genomes from non-flowering plants have been published emphasis on challenges and limitations of current genome thus far, with only three from the Gymnospermae, one sequencing techniques. This segues into downstream from the Bryophyta and one from the Lycopodiophyta work aimed at linking the genome to the biology, and (Figure 1, Table 1). concludes with the future of plant genomics. www.sciencedirect.com Current Opinion in Plant Biology 2015, 24:71–81 72 Genome studies and molecular genetics Figure 1 Kiwi Blueberry Coffee Eggplant Tomato Potato Pepper Utricularia Monkey Flower Asterids Sugar Beet Grape Soybean Common Bean Pigeon Pea Medicago Apple Sequencing Technology Pear Sanger only Strawberry Sanger + 454/Illumina Peach 454 + Illumina Core eudicots Watermelon Illumina only Rosids Rosids I Cucumber Whole Genome Duplication Poplar Whole Genome Triplication Willow Cassava Polyploid crop species Rubber Jatropha Castor Bean Rosids II Eucalyptus Orange Cotton Cocao Papaya Arabidopsis thaliana Basal Eudicots Arabidopsis lyrata Camelina Brassica rapa Brassica oleracea Brassica napus Sacred lotus Wheat Barley Brachypodium Rice Bamboo Tef Monocots Setaria Maize Sorghum Banana Oil Palm Flowering Plants Flowering Date Palm Duckweed Amborella Seed Plants Loblolly Pine Plants Norway Spruce Vascular Selaginella Plants Physcomitrella Land Chlamydomonas Vovlox Current Opinion in Plant Biology Current Opinion in Plant Biology 2015, 24:71–81 www.sciencedirect.com Progress, challenges and the future of crop genomes Michael and VanBuren 73 Major challenges in crop genome sequencing Outcrossing species like grape, clonally propagated crops projects like apple, and long-lived trees like Eucalyptus tend to Genome assembly tools, which were generally designed have high levels of within genome heterozygosity. Para- and tested for non-plant species [21], are ill suited for logous regions and heterozygous sites create ‘bubbles’ handling the issues of genome size, repeat content, paral- during genome assembly where two or more regions that ogy, and heterozygosity that are common in plant gen- are highly similar assemble together, and the adjacent omes. The throughput of NGS technologies has made it dissimilar regions assemble separately but eventually economical to sequence most crop genomes, but resolv- merge again (Figure 2a). Assembly issues stemming from ing plant genome complexity with 100–200 bp reads is polyploidy or heterozygosity can be overcome by using still a challenge. Most recent mammalian genomes are diploid progenitor species (‘robusta’ coffee [26] and assembled into chromosome scale regions [22], but most wheat [27]), closely related wild diploid species (wood- draft plant genomes remain in thousands of highly frag- land strawberry [28]), haploid/monoploid lines (citrus mented contigs or hundreds of scaffolds with numerous [29,30], banana [31] or peach [32]), or a bacterial artificial imbedded gaps. Even the Arabidopsis genome, which is chromosome (BAC) by BAC sequencing approach (maize arguably the best-assembled plant genome, is still in [10] or pear [33]) (Figure 2). 102 contigs with a total gap length of at least 185,644 bp (TAIR 10 [23]). Organelle DNA contamination can be a major problem in genome sequencing projects. Plant cells can have over Genome size and repeat content, which are often highly 100 chloroplasts with up to 10,000 plastid DNA copies per correlated, present a major problem for plant genome cell [34] and organelle derived reads can constitute 5– assembly. Genome size in plants varies by 4 orders of 20% of the total sequences in a whole genome sequencing magnitude, from 61 Mb (Genlisea tuberosa) to over 150,000 (WGS) project. Modified DNA extraction protocols opti- Mb (Paris japonica) (reviewed in [24]). NGS platforms can mized for nuclei isolation are typically used, which can now generate enough raw data to sequence large genomes reduce organelle contamination several fold [35]. Organ- but assembling so much data is a major computational elle contamination can be tested before library construc- problem. The loblolly pine genome is the largest genome tion using a simple qPCR protocol [35]. Plant nuclear assembled to date (22 Gb) and used a preprocessed, genomes also contain numerous organelle derived condensed set of ‘super-reads’ to reduce the computation sequences which can have near identical homology to resources needed for assembly [13]. the organelle genomes themselves, accurately sequenc- ing through these regions requires read lengths that can Repeats are a major problem in genome assembly, and span the insertion junction sites. resolving repeat structures requires sequencing read lengths that exceed the 10–20 kb repeats commonly Overcoming the challenges of sequencing plant genomes found in plant genomes. Type II ‘cut and paste’ long requires both advances in sequencing technology. Lon- terminal repeat (LTR) retrotransposons are the most ger read lengths provided by third generation single prevalent repeat in plant genome