Telomere-To-Telomere Assembly of a Complete Human X Chromosome W
Total Page:16
File Type:pdf, Size:1020Kb
https://doi.org/10.1038/s41586-020-2547-7 Accelerated Article Preview Telomere-to-telomere assembly of a W complete human X chromosome E VI Received: 30 July 2019 Karen H. Miga, Sergey Koren, Arang Rhie, Mitchell R. Vollger, Ariel Gershman, Andrey Bzikadze, Shelise Brooks, Edmund Howe, David Porubsky, GlennisE A. Logsdon, Accepted: 29 May 2020 Valerie A. Schneider, Tamara Potapova, Jonathan Wood, William Chow, Joel Armstrong, Accelerated Article Preview Published Jeanne Fredrickson, Evgenia Pak, Kristof Tigyi, Milinn Kremitzki,R Christopher Markovic, online 14 July 2020 Valerie Maduro, Amalia Dutra, Gerard G. Bouffard, Alexander M. Chang, Nancy F. Hansen, Amy B. Wilfert, Françoise Thibaud-Nissen, Anthony D. Schmitt,P Jon-Matthew Belton, Cite this article as: Miga, K. H. et al. Siddarth Selvaraj, Megan Y. Dennis, Daniela C. Soto, Ruta Sahasrabudhe, Gulhan Kaya, Telomere-to-telomere assembly of a com- Josh Quick, Nicholas J. Loman, Nadine Holmes, Matthew Loose, Urvashi Surti, plete human X chromosome. Nature Rosa ana Risques, Tina A. Graves Lindsay, RobertE Fulton, Ira Hall, Benedict Paten, https://doi.org/10.1038/s41586-020-2547-7 Kerstin Howe, Winston Timp, Alice Young, James C. Mullikin, Pavel A. Pevzner, (2020). Jennifer L. Gerton, Beth A. Sullivan, EvanL E. Eichler & Adam M. Phillippy C This is a PDF fle of a peer-reviewedI paper that has been accepted for publication. Although unedited, the Tcontent has been subjected to preliminary formatting. Nature is providing this early version of the typeset paper as a service to our authors and readers. The text andR fgures will undergo copyediting and a proof review before the paper is published in its fnal form. Please note that during the production process errors may beA discovered which could afect the content, and all legal disclaimers apply.D E T A R E L E C C A Nature | www.nature.com Article Telomere-to-telomere assembly of a complete human X chromosome W 1,24 2,24 2 3 4 https://doi.org/10.1038/s41586-020-2547-7 Karen H. Miga ✉, Sergey Koren , Arang Rhie , Mitchell R. Vollger , Ariel Gershman , 5 6 7 3 E 3 Andrey Bzikadze , Shelise Brooks , Edmund Howe , David Porubsky , Glennis A. Logsdon , Received: 30 July 2019 Valerie A. Schneider8, Tamara Potapova7, Jonathan Wood9, William Chow9, JoelI Armstrong1, Accepted: 29 May 2020 Jeanne Fredrickson10, Evgenia Pak11, Kristof Tigyi1, Milinn Kremitzki12, Christopher Markovic12, 13 11 6 2V 14 Valerie Maduro , Amalia Dutra , Gerard G. Bouffard , Alexander M. Chang , Nancy F. Hansen , Published online: 14 July 2020 Amy B. Wilfert3, Françoise Thibaud-Nissen8, Anthony D. Schmitt15, Jon-Matthew Belton15, Siddarth Selvaraj15, Megan Y. Dennis16, Daniela C. Soto16, Ruta SahasrabudheE 17, Gulhan Kaya16, Josh Quick18, Nicholas J. Loman18, Nadine Holmes19, Matthew Loose19, Urvashi Surti20, Rosa ana Risques10, Tina A. Graves Lindsay12, Robert Fulton12, IraR Hall12, Benedict Paten1, Kerstin Howe9, Winston Timp4, Alice Young6, James C. Mullikin6, Pavel A. Pevzner21, Jennifer L. Gerton7, Beth A. Sullivan22, Evan E. Eichler3,23P & Adam M. Phillippy2 ✉ After two decades of improvements, the currentE human reference genome (GRCh38) is the most accurate and complete vertebrateL genome ever produced. However, no one chromosome has been fnished end to end, and hundreds of unresolved gaps persist1,2. Here we present a de novoC human genome assembly that surpasses the continuity of GRCh382, alongI with the frst gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the completeT hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our R 3 eforts on the human X chromosome , we reconstructed the ~3.1 megabase centromericA satellite DNA array and closed all 29 remaining gaps in the current reference, including new sequence from the human pseudoautosomal regions and cancer-testis ampliconic gene families (CT-X and GAGE). These novel sequences will be integratedD into future human reference genome releases. Additionally, a complete chromosome X, combined with the ultra-long nanopore data, allowed us to map Emethylation patterns across complex tandem repeats and satellite arrays for the frst T time. Our results demonstrate that fnishing the entire human genome is now within reach and the data presented here will enable ongoing eforts to complete the A remaining human chromosomes. R Complete, telomere-to-telomere reference assemblies are neces- length and greater than 98% identical between paralogs. Due to their sary to ensure that allE genomic variants are discovered and studied. absence from the reference, these repeat-rich sequences are often Currently, unresolved regions of the human genome are defined by excluded from genetics and genomics studies, limiting the scope of multi-megabaseL satellite arrays in the pericentromeric regions and the association and functional analyses4,5. Unresolved repeat sequences rDNA arrays on acrocentric short arms, as well as regions enriched in also result in unintended consequences such as paralogous sequence segmentalE duplications that are greater than hundreds of kilobases in variants incorrectly called as allelic variants6 and the contamination of 1UC SantaC Cruz Genomics Institute, University of California, Santa Cruz, CA, USA. 2Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. 3Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA. 4Department of Molecular Biology & Genetics, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA. 5Graduate Program in Bioinformatics and Systems Biology, University of CCalifornia, San Diego, CA, USA. 6NIH Intramural Sequencing Center, National Human Genome Research Institute, National Institutes of Health, Rockville, MD, USA. 7Stowers Institute for Medical Research, Kansas City, MO, USA. 8National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA. 9Wellcome Sanger Institute, Cambridge, UK. 10University of Washington, Department of Pathology, Seattle, WA, USA. 11Cytogenetic and Microscopy Core, National Human Genome Research Institute, National Institutes of A 12 13 Health, Bethesda, MD, USA. McDonnell Genome Institute at Washington University, St. Louis, MO, USA. Undiagnosed Diseases Program, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. 14Comparative Genomics Analysis Unit, Cancer Genetics and Comparative Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. 15Arima Genomics, San Diego, CA, USA. 16Department of Biochemistry and Molecular Medicine, Genome Center, MIND Institute, University of California, Davis, CA, USA. 17DNA Technologies Core, Genome Center, University of California, Davis, CA, USA. 18Institute of Microbiology and Infection, University of Birmingham, Birmingham, UK. 19DeepSeq, School of Life Sciences, University of Nottingham, Nottingham, UK. 20Department of Pathology, University of Pittsburgh, Pittsburgh, PA, USA. 21Department of Computer Science and Engineering, University of California, San Diego, CA, USA. 22Department of Molecular Genetics and Microbiology, Division of Human Genetics, Duke University Medical Center, Durham, NC, USA. 23Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA. 24These authors contributed equally: Karen H. Miga, Sergey Koren. ✉e-mail: [email protected]; adam. [email protected] Nature | www.nature.com | 1 Article bacterial gene databases7. Completion of the entire human genome is after Nanopore polishing and 99.99% after PacBio polishing. Illumina expected to contribute to our understanding of chromosome function8, data was only used to correct small insertion and deletion errors in human disease9, and genomic variation which will improve technologies uniquely-mappable regions of the genome, which had a marginal effect in biomedicine that use short-read mapping to a reference genome on average accuracy but reduced the number of frame-shifted genes. (e.g. RNA-seq10, ChIP-seq11, ATAC-seq12). Putative mis-assemblies were identified via analysis of the Illumina The fundamental challenge of reconstructing a genome from many linked-read barcodes (10x Genomics) and optical mapping (Bionano comparatively short sequencing reads—a process known as genome Genomics) data not used in the initial assembly. The initial contigs assembly—is distinguishing the repeated sequences from one another13. were broken at regions of low mapping coverage and the correctedW W Resolving such repeats relies on sequencing reads that are long enough contigs were then ordered and oriented relative to one another using to span the entire repeat or accurate enough to distinguish each repeat the optical map. Over 90% of six chromosomes are represented in two 14 E E copy on the basis of unique variants . The difficulty of the assembly contigs and ten are represented by 2 scaffolds. (Fig 1a). problem and limits of past technologies are highlighted by the fact The final assembly consists of 2.94 Gbp in 448 contigsI with a contig I that the human genome remains unfinished 20 years after its initial NG50 of 70 Mbp. A total of 98 scaffolds (173 contigs) were unambigu- 15 V V release in 2001 . The first human reference genome released by the ously assigned to a