Genome Sequence of the Progenitor of the Wheat D Genome Aegilops Tauschii Ming-Cheng Luo1*, Yong Q
Total Page:16
File Type:pdf, Size:1020Kb
OPEN LETTER doi:10.1038/nature24486 Genome sequence of the progenitor of the wheat D genome Aegilops tauschii Ming-Cheng Luo1*, Yong Q. Gu2*, Daniela Puiu3*, Hao Wang4,5,6*, Sven O. Twardziok7*, Karin R. Deal1, Naxin Huo1,2, Tingting Zhu1, Le Wang1, Yi Wang1,2, Patrick E. McGuire1, Shuyang Liu1, Hai Long1, Ramesh K. Ramasamy1, Juan C. Rodriguez1, Sonny L. Van1, Luxia Yuan1, Zhenzhong Wang1,8, Zhiqiang Xia1, Lichan Xiao1, Olin D. Anderson2, Shuhong Ouyang2,8, Yong Liang2,8, Aleksey V. Zimin3, Geo Pertea3, Peng Qi4,5, Jeffrey L. Bennetzen6, Xiongtao Dai9, Matthew W. Dawson9, Hans-Georg Müller9, Karl Kugler7, Lorena Rivarola-Duarte7, Manuel Spannagl7, Klaus F. X. Mayer7,10, Fu-Hao Lu11, Michael W. Bevan11, Philippe Leroy12, Pingchuan Li13, Frank M. You13, Qixin Sun8, Zhiyong Liu8, Eric Lyons14, Thomas Wicker15, Steven L. Salzberg3,16, Katrien M. Devos4,5 & Jan Dvořák1 Aegilops tauschii is the diploid progenitor of the D genome of We conclude therefore that the size of the Ae. tauschii genome is about hexaploid wheat1 (Triticum aestivum, genomes AABBDD) and 4.3 Gb. an important genetic resource for wheat2–4. The large size and To assess the accuracy of our assembly, sequences of 195 inde- highly repetitive nature of the Ae. tauschii genome has until now pendently sequenced and assembled AL8/78 BAC clones8, which precluded the development of a reference-quality genome sequence5. contained 25,540,177 bp in 2,405 unordered contigs, were aligned to Here we use an array of advanced technologies, including ordered- Aet v3.0. Five contigs failed to align and six extended partly into gaps, clone genome sequencing, whole-genome shotgun sequencing, accounting for 0.25% of the total length of the contigs. Seven BAC and BioNano optical genome mapping, to generate a reference- contigs, although aligned, contained internal structural differences. quality genome sequence for Ae. tauschii ssp. strangulata accession The remaining contigs aligned end-to-end with an average identity of AL8/78, which is closely related to the wheat D genome. We show 99.75%. Considering the likely possibility that errors existed in both that compared to other sequenced plant genomes, including a assemblies, this validation indicated that our assembly is remarkably much larger conifer genome, the Ae. tauschii genome contains accurate. unprecedented amounts of very similar repeated sequences. Our More work is nevertheless needed to order super-scaffolds in the genome comparisons reveal that the Ae. tauschii genome has a pericentromeric regions of the pseudomolecules (Extended Data greater number of dispersed duplicated genes than other sequenced Fig. 2a). The ends of the pseudomolecules also need attention. Arrays genomes and its chromosomes have been structurally evolving of the rice (Oryza sativa) telomeric repeat9 (TTTAGGG) were detected an order of magnitude faster than those of other grass genomes. in all pseudo molecules and in three of the unassigned scaffolds, but The decay of colinearity with other grass genomes correlates with in only four pseudomolecules was such an array terminally located recombination rates along chromosomes. We propose that the vast (Extended Data Fig. 2a) and presumably marked a telomere. amounts of very similar repeated sequences cause frequent errors The wheat and Aegilops genomes have seven chromosomes, which in recombination and lead to gene duplications and structural evolved by dysploid reductions from twelve ancestral chromosomes10. chromosome changes that drive fast genome evolution. The dominant form of dysploid reduction in the grass family is the The Ae. tauschii AL8/78 genome sequence was assembled in nested chromosome insertion (NCI), in which a chromosome is five steps (Extended Data Fig. 1a). The core was assembly Aet v1.1 inserted, usually by its termini, into the centromere-adjacent region (Extended Data Fig. 1b) based on sequences of 42,822 bacterial of another chromosome11. Ae. tauschii chromosomes 1D, 2D, 4D, artificial chromosome (BAC) clones. This assembly was merged and 7D originated by NCIs (Extended Data Fig. 2b). Chromosome with a whole-genome shotgun (WGS) assembly (Aet WGS 1.0) and 5D probably originated by an end-to-end fusion of the short arms of WGS Pacific Biosciences mega-reads6 to extend scaffolds and close ancestral chromosomes corresponding to rice chromosomes Os9 and gaps, thereby producing assembly Aet v2.0 (Extended Data Fig. 1b, c). Os12, followed by a reciprocal translocation involving the Os9 section Misassembled scaffolds were detected with the aid of an AL8/78 optical of 5DL and the Os3 section of 4DS (Extended Data Fig. 2b–d). BioNano genome (BNG) map and resolved, producing assembly Aet A nearly perfect array of telomeric repeats 141-bp long was located v3.0 (Extended Data Fig. 1d, e). Two additional BNG maps were on Ae. tauschii chromosome 2D at 33,306,010 to 33,306,151 bp, which constructed and, along with the genetic and physical maps7, used in is within a 2D NCI site (31,408,197 to 34,183,608 bp, Supplementary super-scaffolding and building pseudomolecules for the final assembly, Data 1), indicating that telomeric repeats may be directly involved in Aet v4.0 (Extended Data Fig. 1d, f, g). the NCI process. An NCI may be a special case of telomere-driven The combined length of the pseudomolecules was 4,025,304,143 bp, ectopic recombination. Of the 23 ancestral chromosome termini that and they contained 95.2% of the sequence (Extended Data Fig. 2a). we could study by investigating the colinearity of Ae. tauschii pseudo- About 200 Mb of the super-scaffolds remained unassigned and 76 Mb molecules with those of rice, 14 (61%) have been involved in one or of the AL8/78 BNG contigs were devoid of aligned super-scaffolds. more inversions or end-to-end fusions (Extended Data Fig. 2b). 1Department of Plant Sciences, University of California, Davis, California, USA. 2Crop Improvement & Genetics Research, USDA-ARS, Albany, California USA. 3Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA. 4Institute of Plant Breeding, Genetics and Genomics, Department of Crop & Soil Sciences, University of Georgia, Athens, Georgia, USA. 5Department of Plant Biology, University of Georgia, Athens, Georgia, USA. 6Department of Genetics, University of Georgia, Athens, Georgia, USA. 7Plant Genome and Systems Biology, Helmholtz Zentrum München, Neuherberg, Germany. 8China Agricultural University, Beijing, China. 9Department of Statistics, University of California, Davis, California, USA. 10School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany. 11Department Cell and Developmental Biology, John Innes Centre, Norwich Research Park, Norwich, UK. 12INRA, UBP, UMR 1095, GDEC, Clermont-Ferrand, France. 13Agriculture & Agri-Food Canada, Morden, Winnipeg, Canada. 14CyVerse, University of Arizona, Tucson, Arizona, USA. 15Department of Plant and Microbial Biology, University of Zürich, Zürich, Switzerland. 16Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, Maryland, USA. *These authors contributed equally to this work. 498 | NATURE | VOL 551 | 23 noveMBer 2017 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. LETTER RESEARCH a Cereba b 1D Exon 2.5 Copia 0 5 5 5 2.0 Gypsy y Unclassied LTR × 10 1.5 Other class I TE 2 4 × 10 6 × 10 CACTA 1.0 Other class II TE intensit Normalized 0.5 Cereba 2D Exon 0 Copia Gypsy 00.5 1.01.5 2.02.5 Unclassied LTR Other class I TE Age (Myr) CACTA c 100 Bd Other class II TE Si Hs Cereba 80 Os 3D Exon Pt Copia Sb Gypsy 60 Aet Unclassied LTR Other class I TE CACTA 40 Other class II TE centage of unique Cereba -mers in genome 20 4D Exon k Copia Per Gypsy 0 Unclassi ed LTR 11 32 52 72 92 112 Other class I TE CACTA k-mer size (bp) Other class II TE d Distal families Cereba 5D Exon 2.5 Copia y Gypsy 2.0 Unclassi ed LTR Other class I TE CACTA 1.5 Other class II TE Cereba 1.0 6D Exon Copia obability densit Gypsy 0.5 Unclassi ed LTR Pr Other class I TE 0 CACTA Other class II TE 00.5 1.0 1.5 2.02.5 Cereba Age (Myr) 7D Exon Copia Proximal families Gypsy e Unclassi ed LTR 2.5 Other class I TE y CACTA Other class II TE 2.0 1.5 0 100 200 300 400 500 600 700 Mb 1.0 obability densit 0.5 Pr 0 00.5 1.01.5 2.02.5 Age (Myr) Figure 1 | Aegilops tauschii TEs. a, Heat maps of densities along the Ae. tauschii genome during the past 3 Myr. c, The k-mer uniqueness pseudomolecules of centromeric LTR-RT cereba, exons, Copia and Gypsy ratio for Ae. tauschii (Aet), compared to P. tae d a (Pt), O. sativa (Os), LTR-RT super-families, unclassified LTR-RTs, other types of class I (RNA) B. distachyon (Bd), Setaria italica (Si), S. bicolor (Sb), and Homo sapiens TEs, CACTA (DNA) TEs, and other types of class II (DNA) transposons. (Hs). d, e, Ages of TEs in the 11 distally located families (d) and the Red arrowheads or brackets indicate the positions of the centromeres. 11 proximally located families (e) from the 22 families in b. The age Green arrowheads indicate the sites of NCIs (1D, 2D, 4D, and 7D) and difference between the two groups is statistically significant (P = 0.003, chromosome (telomere) fusion (5D). b, Normalized insertion (intensity) two-sided t-test, n = 22). rates of the 22 most abundant Copia (red) and Gypsy (blue) families in the Transposable elements (TEs) represented 84.4% of the genome showed that LTR-RT families were subjected to sequential bursts sequence. By far the most abundant (65.9% of the sequence) were the of amplification followed by silencing over the past 3 million years long terminal repeat retrotransposons (LTR-RTs) (Extended Data (Myr) (Fig.