https://doi.org/10.1038/s41586-020-2547-7 Accelerated Article Preview -to-telomere assembly of a W complete X E VI Received: 30 July 2019 Karen H. Miga, Sergey Koren, Arang Rhie, Mitchell R. Vollger, Ariel Gershman, Andrey Bzikadze, Shelise Brooks, Edmund Howe, David Porubsky, GlennisE A. Logsdon, Accepted: 29 May 2020 Valerie A. Schneider, Tamara Potapova, Jonathan Wood, William Chow, Joel Armstrong, Accelerated Article Preview Published Jeanne Fredrickson, Evgenia Pak, Kristof Tigyi, Milinn Kremitzki,R Christopher Markovic, online 14 July 2020 Valerie Maduro, Amalia Dutra, Gerard G. Bouffard, Alexander M. Chang, Nancy F. Hansen, Amy B. Wilfert, Françoise Thibaud-Nissen, Anthony D. Schmitt,P Jon-Matthew Belton, Cite this article as: Miga, K. H. et al. Siddarth Selvaraj, Megan Y. Dennis, Daniela C. Soto, Ruta Sahasrabudhe, Gulhan Kaya, Telomere-to-telomere assembly of a com- Josh Quick, Nicholas J. Loman, Nadine Holmes, Matthew Loose, Urvashi Surti, plete human . Rosa ana Risques, Tina A. Graves Lindsay, RobertE Fulton, Ira Hall, Benedict Paten, https://doi.org/10.1038/s41586-020-2547-7 Kerstin Howe, Winston Timp, Alice Young, James C. Mullikin, Pavel A. Pevzner, (2020). Jennifer L. Gerton, Beth A. Sullivan, EvanL E. Eichler & Adam M. Phillippy C This is a PDF fle of a peer-reviewedI paper that has been accepted for publication. Although unedited, the Tcontent has been subjected to preliminary formatting. Nature is providing this early version of the typeset paper as a service to our authors and readers. The text andR fgures will undergo copyediting and a proof review before the paper is published in its fnal form. Please note that during the production process errors may beA discovered which could afect the content, and all legal disclaimers apply.D E T A R E L E C C A

Nature | www.nature.com Article Telomere-to-telomere assembly of a complete human X chromosome W

1,24 2,24 2 3 4 https://doi.org/10.1038/s41586-020-2547-7 Karen H. Miga ✉, Sergey Koren , Arang Rhie , Mitchell R. Vollger , Ariel Gershman , 5 6 7 3 E 3 Andrey Bzikadze , Shelise Brooks , Edmund Howe , David Porubsky , Glennis A. Logsdon , Received: 30 July 2019 Valerie A. Schneider8, Tamara Potapova7, Jonathan Wood9, William Chow9, JoelI Armstrong1, Accepted: 29 May 2020 Jeanne Fredrickson10, Evgenia Pak11, Kristof Tigyi1, Milinn Kremitzki12, Christopher Markovic12, 13 11 6 2V 14 Valerie Maduro , Amalia Dutra , Gerard G. Bouffard , Alexander M. Chang , Nancy F. Hansen , Published online: 14 July 2020 Amy B. Wilfert3, Françoise Thibaud-Nissen8, Anthony D. Schmitt15, Jon-Matthew Belton15, Siddarth Selvaraj15, Megan Y. Dennis16, Daniela C. Soto16, Ruta SahasrabudheE 17, Gulhan Kaya16, Josh Quick18, Nicholas J. Loman18, Nadine Holmes19, Matthew Loose19, Urvashi Surti20, Rosa ana Risques10, Tina A. Graves Lindsay12, Robert Fulton12, IraR Hall12, Benedict Paten1, Kerstin Howe9, Winston Timp4, Alice Young6, James C. Mullikin6, Pavel A. Pevzner21, Jennifer L. Gerton7, Beth A. Sullivan22, Evan E. Eichler3,23P & Adam M. Phillippy2 ✉

After two decades of improvements, the currentE human reference (GRCh38) is the most accurate and complete vertebrateL genome ever produced. However, no one chromosome has been fnished end to end, and hundreds of unresolved gaps persist1,2. Here we present a de novoC assembly that surpasses the continuity of GRCh382, alongI with the frst gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the completeT hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our R 3 eforts on the human X chromosome , we reconstructed the ~3.1 megabase centromericA satellite DNA array and closed all 29 remaining gaps in the current reference, including new sequence from the human pseudoautosomal regions and -testis ampliconic families (CT-X and GAGE). These novel sequences will be integratedD into future human releases. Additionally, a complete chromosome X, combined with the ultra-long nanopore data, allowed us to map Emethylation patterns across complex tandem repeats and satellite arrays for the frst T time. Our results demonstrate that fnishing the entire human genome is now within reach and the data presented here will enable ongoing eforts to complete the A remaining human . R Complete, telomere-to-telomere reference assemblies are neces- length and greater than 98% identical between paralogs. Due to their sary to ensure that allE genomic variants are discovered and studied. absence from the reference, these repeat-rich sequences are often Currently, unresolved regions of the human genome are defined by excluded from and genomics studies, limiting the scope of multi-megabaseL satellite arrays in the pericentromeric regions and the association and functional analyses4,5. Unresolved repeat sequences rDNA arrays on acrocentric short arms, as well as regions enriched in also result in unintended consequences such as paralogous sequence segmentalE duplications that are greater than hundreds of kilobases in variants incorrectly called as allelic variants6 and the contamination of

1UC SantaC Cruz Genomics Institute, University of California, Santa Cruz, CA, USA. 2Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. 3Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA. 4Department of Molecular & Genetics, Department of , Johns Hopkins University, Baltimore, MD, USA. 5Graduate Program in and , University of CCalifornia, San Diego, CA, USA. 6NIH Intramural Sequencing Center, National Human Genome Research Institute, National Institutes of Health, Rockville, MD, USA. 7Stowers Institute for Medical Research, Kansas City, MO, USA. 8National Center for Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA. 9Wellcome Sanger Institute, Cambridge, UK. 10University of Washington, Department of Pathology, Seattle, WA, USA. 11Cytogenetic and Microscopy Core, National Human Genome Research Institute, National Institutes of A 12 13 Health, Bethesda, MD, USA. McDonnell Genome Institute at Washington University, St. Louis, MO, USA. Undiagnosed Diseases Program, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. 14Comparative Genomics Analysis Unit, Cancer Genetics and Comparative Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. 15Arima Genomics, San Diego, CA, USA. 16Department of and Molecular Medicine, Genome Center, MIND Institute, University of California, Davis, CA, USA. 17DNA Technologies Core, Genome Center, University of California, Davis, CA, USA. 18Institute of and , University of Birmingham, Birmingham, UK. 19DeepSeq, School of Life Sciences, University of Nottingham, Nottingham, UK. 20Department of Pathology, University of Pittsburgh, Pittsburgh, PA, USA. 21Department of Computer and Engineering, University of California, San Diego, CA, USA. 22Department of and Microbiology, Division of , Duke University Medical Center, Durham, NC, USA. 23Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA. 24These authors contributed equally: Karen H. Miga, Sergey Koren. ✉e-mail: [email protected]; adam. [email protected]

Nature | www.nature.com | 1 Article bacterial gene databases7. Completion of the entire human genome is after Nanopore polishing and 99.99% after PacBio polishing. Illumina expected to contribute to our understanding of chromosome function8, data was only used to correct small insertion and errors in human disease9, and genomic variation which will improve technologies uniquely-mappable regions of the genome, which had a marginal effect in that use short-read mapping to a reference genome on average accuracy but reduced the number of frame-shifted . (e.g. RNA-seq10, ChIP-seq11, ATAC-seq12). Putative mis-assemblies were identified via analysis of the Illumina The fundamental challenge of reconstructing a genome from many linked-read barcodes (10x Genomics) and optical mapping (Bionano comparatively short sequencing reads—a process known as genome Genomics) data not used in the initial assembly. The initial contigs assembly—is distinguishing the repeated sequences from one another13. were broken at regions of low mapping coverage and the correctedW W Resolving such repeats relies on sequencing reads that are long enough contigs were then ordered and oriented relative to one another using to span the entire repeat or accurate enough to distinguish each repeat the optical map. Over 90% of six chromosomes are represented in two 14 E E copy on the basis of unique variants . The difficulty of the assembly contigs and ten are represented by 2 scaffolds. (Fig 1a). problem and limits of past technologies are highlighted by the fact The final assembly consists of 2.94 Gbp in 448 contigsI with a contig I that the human genome remains unfinished 20 years after its initial NG50 of 70 Mbp. A total of 98 scaffolds (173 contigs) were unambigu- 15 V V release in 2001 . The first human reference genome released by the ously assigned to a reference chromosome, representing 98% of the U.S. National Center for Biotechnology Information (NCBI Build 28) was assembled bases. We estimated the median consensus accuracy of highly fragmented, with half of the genome contained in continuous this whole-genome assembly to be at least 99.99%E based on both pre- E sequences (contigs) of 500 kbp or greater (NG50). Dedicated finish- viously finished BAC sequences22 and mapped Illumina linked reads ing efforts16 and continued stewardship by the Genome Reference (SNote 4). While similar to the GRCh38R ungapped length (2.95 Gbp), R Consortium (GRC)2 greatly increased the continuity of the reference to our assembly size is shorter than the estimated human genome size of an NG50 contig length of 56 Mbp in the most recent release, GRCh38, 3.2 Gbp. We estimate approximatelyP 170 Mbp of collapsed bases using P but the most repetitive regions of the genome remain unsolved and SDA19. Compared to other recent assemblies we resolved a greater no chromosome is completely represented telomere to telomere. A fraction of the 341 CHM13 BAC sequences previously isolated and fin- recent de novo assembly of ultra-long (>100 kb) nanopore reads showed ished from segmentallyE duplicated and other difficult-to-assemble E promising assembly continuity in the most difficult regions1, but this regions of the genome19 (Table 1, SNote 4). Comparative annotation proof-of-concept project sequenced the genome to only 5× depth of of our whole-genomeL assembly also shows a higher agreement of L coverage and failed to assemble the largest human genomic repeats. mapped transcripts than prior assemblies and only a slightly elevated Previous modeling based on the size and distribution of large repeats in rate of potentialC frameshifts compared to GRCh3823. Of the 19,618 C the human genome predicted that an assembly of 30× ultra-long reads -coding genes annotated in the CHM13 de novo assembly, just 1 I I would approach the continuity of the human reference . Therefore, we 170 (0.86%) contain a predicted frameshift, or if measured by tran- hypothesized that high-coverage ultra-long-read nanopore sequencing scripts,T only 334 of 83,332 transcripts (0.40%) contain a predicted T would enable the first complete assembly of human chromosomes. frameshift (STbl 1). When used as a reference sequence for calling To circumvent the complexity of assembling both of Ra structural variants (SVs) in other , CHM13 reports an even R diploid genome, we selected the effectively haploid CHM13hTERT balance of insertion and deletion calls (Extended Data Fig. 3, SNote 5), as line for sequencing (abbr. CHM13)17. This cell line was derived from a expected, whereas GRCh38 demonstrates a deletion bias, as previously A 24 A complete hydatidiform mole with a 46,XX . The genomes of reported . Compared to other long-read assemblies, GRCh38 calls such uterine moles originate from a single sperm which has undergone twice the number of inversions compared to CHM13 (mean 26 vs. 13 post-meiotic chromosomal duplication and are, therefore,D uniformly inversions per genome), suggesting that some mis-oriented sequences D homozygous for one set of . CHM13 has previously been used to remain in the current human reference (SNote 5). Of these inversions, 19 patch gaps in the human reference2, benchmark genome assemblers are specific to GRCh38 and not found in 5 recently assembled long-read 18 E E and diploid variant callers , and investigate human segmental duplica- human genomes (STbl 5). We Identified telomeric sequences within tions19. Karyotyping of the CHM13 line confirmedT a stable 46,XX karyo- the assembly and the reads (SNote 4, Extended Data Fig. 4), which were T type, with no observable chromosomal anomalies (Extended Data Fig. 1, highly concordant in telomere size, and our assembly includes 41 of SNote 1). Maximum likelihood admixtureA analysis20 confidently assigns 46 expected at contig ends. Thus, in terms of continuity, A the majority of haplotypes to a European origin with the potential of completeness, and correctness, our CHM13 assembly exceeds all prior some Asian or Amerindian admixture (SNote 2, Extended Data Fig. 2). human de novo assemblies, including the current human reference R genome by some quality metrics (STbl 2). R Highly continuousE whole-genome assembly E High molecular weight DNA from CHM13 cells was extracted and pre- A finished human X chromosome pared for nanoporeL sequencing using a previously described ultra-long Using this whole-genome assembly as a basis, we selected the X chromo- L read protocol1. In total, we sequenced 98 MinION flow cells for a total of some for manual finishing and validation due to its high continuity in the 155 Gb (50×E coverage, 1.6 Gb/flow cell, SNote 3). Half of all sequenced initial assembly, distinctive and well-characterized centromeric alpha E bases were contained in reads of 70 kb or longer (78 Gb, 25× genome satellite array3,8,25, unique behavior during development26, and dispro- coverage)C and the longest validated read was 1.04 Mb. Once we had portionate involvement in Mendelian disease3. The de novo assembly of C collected sufficient sequencing coverage for de novo assembly, we the X chromosome was broken in three places, at the and combined 39× of the ultra-long reads with 70× coverage of previously two >100 kbp near-identical segmental duplications (Fig 1b). The two C 21 C generated PacBio data and assembled the CHM13 genome using Canu . segmental duplications breaking the assembly were manually resolved A Canu selected the longest 30× ultra-long and 7× PacBio reads for cor- by identifying ultra-long reads that completely spanned the repeats A rection and assembly. This initial assembly totaled 2.90 Gbp with and were uniquely anchored on either side, allowing for a confident half of the genome contained in continuous sequences (contigs) of placement in the assembly. Assembly quality improvements of these length 75 Mbp or greater (NG50), which exceeds the continuity of the difficult regions were evaluated by mapping an orthogonal set of PacBio GRCh38 reference genome (75 vs. 56 Mbp NG50). The assembly was high-fidelity (HiFi) long reads generated from CHM1322 and assessing then iteratively polished by each technology in order of longest to read-depth over informative single variant differences shortest read lengths: Nanopore, PacBio, linked-read Illumina. Consen- (Methods). In addition, experimental validation via droplet digital PCR sus accuracy improved from 99.46% for the initial assembly to 99.67% (ddPCR) confirmed the now complete assembly correctly represents

2 | Nature | www.nature.com the tandem repeats of the CHM13 genome, including seven CT47 genes regions that could be mapped (Fig 2e, SNote 4, Extended Data 8b,c), (7.02 ± 0.34), six CT45 genes (6.11 ± 0.38), 19 complete and two partial while Strand-seq data confirm the absence of any inversion errors39,40 GAGE genes (19.9 ± 0.745), 55 DXZ4 repeats (55.4 ± 2.09), and a 3.1 Mbp (Extended Data Fig. 8d,e). Single nucleotide variant calling via long read centromeric DXZ1 array (1408 ± 40.69 2057 bp repeats) (SNote 6). mapping revealed lower initial assembly quality in the large, tandemly Previous high-resolution studies of the haploid centromeric satellite repeated GAGE and CT47 gene families, but these issues were resolved array on the X chromosome (DXZ1) have informed our current genomic by polishing and validated via ultra-long read and optical mapping models of human centromere organization8. The X centromere, like all (Fig 1c,d, STbl 4, Extended Data Fig. 7c–j). Mapped long-read cover- W normal human , is defined at the sequence level by alpha age across the DXZ1 array shows uniform depth of coverage and highW satellite DNA, an AT-rich (~171 bp) , or ‘monomer’27. The accuracy as measured by TandemQUAST41 (Fig 2e, f, Extended Data E canonical repeat of the DXZ1 array is defined by twelve divergent mono- Figs. 7j, 8c). We identified all HiFi reads matching the DXZ1E repeat. All mers that are ordered to form a larger ~2 kb repeating unit, known as a reads, except one with a large, likely erroneous homopolymer, were I ‘higher-order repeat’ (HOR)28,29. The HORs are tandemly arranged into a explained by our reconstruction, confirming completenessI of the DXZ1 large, multi-megabase sized satellite array (ie. 2.2-3.7 Mb; mean of 3010 array. Mapped coverage across the entire X chromosome was uniform V 25 V kb (S.D. = 429; n = 49) with limited nucleotide differences between with coverage of only a small percentage of bases being more than three repeat copies8,30,31. These previous assessments were used to guide our standard deviations from the mean (0.44% Nanopore, 0.77% CLR, 2.4% E evaluation of the DXZ1 assembly, and offered established experimental HiFi). Low coverage HiFi regions were enrichedE for low unique marker methods to evaluate DXZ1 array structure25,32 (Extended Data Fig. 5a). density, making them difficult to assign due to their relatively short R To assemble the X centromere, we constructed a catalog of structural length (SNote 4). Further, variant callingR identified no high-frequency and single-nucleotide variants within the ~2 kbp canonical DXZ1 repeat variants from the HiFi or CLR data and only low-complexity variants P unit28,33 and used these variants as signposts8 to uniquely tile ultra-long from the UL data which likely representP errors in the UL data rather than reads across the entire centromeric satellite array (DXZ1) (Extended true assembly error. Our complete telomere-to-telomere version of the Data Fig. 5b–e), as was previously done for the Y centromere34. The DXZ1 X chromosome fully resolved 29 reference gaps3, totaling 1,147,861 bp E array was estimated by pulsed-field gel (PFGE) South- of previous N-bases. E ern blotting to be in the range of ~2.8-3.1 Mbp) (Fig 2b, Extended Data L Fig. 6), wherein the resulting restriction profiles were in agreement with L the structure of the predicted array assembly (Fig. 2a, b). Copy number Chromosome-wide DNA methylation maps C estimates of the DXZ1 repeat by ddPCR were benchmarked against a Nanopore Csequencing is sensitive to methylated bases as revealed by I panel of previously sized arrays by PFGE-Southern and provided further modulationI to the raw electrical signal42. Uniquely anchored ultra-long support for a ~2.8 Mb array (1408 ± 81.38 copies of the canonical 2057 reads provide a new method to profile patterns of methylation over T kbp repeat) (Fig 2c, STbl 3, SNote 7). Further, direct comparisons of repetitiveT regions that are often difficult to detect with short-read DXZ1 structural variant frequency with PacBio HiFi data were highly sequencing. The X chromosome has many epigenomic features that R concordant22 (Fig 2d, Extended Data Fig. 5c). Rare unique in the human genome. X-chromosome inactivation (XCI), Current long-read assemblies require rigorous consensus polishing in which one of the X chromosomes is silenced early in develop- A to achieve maximum base call accuracy35,36. Given the placementA of ment and remains inactive in somatic tissues, is expected to provide a each read in the assembly, these polishing tools statistically model the unique methylation profile chromosome-wide. In agreement with pre- underlying signal data to make accurate predictions for each sequenced vious studies43, we observe decreased methylation across the majority D base. Key to this process is the correct placement of eachD read that will of the pseudoautosomal regions (PAR1,2) located at both tips of the X contribute to the polishing. Due to ambiguous read mappings, our chromosome arms (Fig 3a). The inactive X chromosome also adopts E initial polishing attempts decreased the assemblyE quality within the an unusual spatial conformation, and consistent with prior studies44,45, largest X chromosome repeats (Extended Data Fig. 7a,b). To overcome CHM13 Hi-C data support two large superdomains partitioned at the T this, we analyzed Illumina sequencing dataT to catalog short (21 bp), repeat DXZ4 (Extended Data Fig. 9). On closer analysis unique (single-copy) sequences present on the CHM13 X chromosome of the DXZ4 array we found distinct bands of methylation (Fig 3c), (Extended Data Fig. 8a). Even within the largest repeat arrays, such as with hypomethylation observed at the distal edge, which is generally A A 46 DXZ1, there was enough variation between repeat copies to induce concordant with previously described structure . Interest- unique 21-mer markers at semi-regular intervals (Fig 2e,f, Extended ingly, we also identified a region of decreased methylation within the R Data Fig. 8c). These markers wereR used to inform the correct placement DXZ1 centromeric array (~60 kbp, chrX:59,217,708–59,279,205, Fig 3b). of long X-chromosome reads within the assembly (Methods). Two To test if this finding was specific to the X array, or also found at other E rounds of iterative polishingE were performed for each technology, first centromeric satellites, we manually assembled a ~2.02 Mbp centromeric with Oxford Nanopore, then PacBio, and finally Illumina linked reads37, array on (D8Z2)47,48 and employed the same unique L with consensus Laccuracy observed to increase after each round. The marker mapping strategy to confidently anchor long reads across the Illumina data was too short to confidently anchor using unique markers array (G.A.L. et al. in preparation). In doing so, we identified another E and was onlyE used to polish the unique regions where mappings were hypomethylated region within the D8Z2 array, similar to our observa- unambiguous. This careful polishing process proved critical for accu- tion on the DXZ1 array (Extended Data Fig. 10), further demonstrat- C ratelyC finishing X chromosome repeats that exceeded both Nanopore ing the capability of our ultra-long read mapping strategy to provide and PacBio read lengths. base-level chromosome-wide DNA methylation maps. Future studies C COur manually finished X chromosome assembly is complete, gap- will be needed to validate this finding for additional chromosomes less, and estimated to be 99.991% accurate based on X-specific BACs or and samples, and to evaluate the potential importance, if any, of these A A 99.995% accurate based on the mapped Illumina data with unambigu- methylation patterns. ous support for 99.9% of the assembly bases (SNote 4), which meets the original Bermuda Standards for finished genomic sequence38. Accuracy is predicted to be slightly lower (median identity 99.3%) A path for finishing the human genome across the largest repeats, such as the DXZ1 satellite array, but this is This first complete telomere-to-telomere assembly of a human chro- difficult to measure due to a lack of BAC clones from these regions. mosome demonstrates that it may now be possible to finish the entire Mapped long read and optical map data show uniform coverage across human genome using available technologies. Although we have focused the completed X chromosome and no evidence of structural errors in here on finishing the X chromosome, our whole-genome assembly has

Nature | www.nature.com | 3 Article reconstructed several other chromosomes with only a few remaining 18. Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018). gaps and can serve as the basis for completing additional chromo- 19. Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. somes. However, important challenges remain going forward. Applying Nat. Methods 16, 88–94 (2019). these approaches, for example, to diploid samples will require phasing 20. Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009). the underlying haplotypes to avoid mixing regions of complex struc- 21. Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer tural variation. Our preliminary analysis of other chromosomes shows weighting and repeat separation. Genome Res. 27, 722–736 (2017). that regions of duplication and centromeric satellites larger than that 22. Vollger, M. R. et al. Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads. Annals of Human Genetics vol. 84 125– of the X chromosome will require additional methods development49. 140 (2020). W W This is especially true of the acrocentric human chromosomes whose 23. Fiddes, I. T. et al. Comparative Annotation Toolkit (CAT)-simultaneous clade and personal massive satellite and segmental duplications have yet to be resolved genome annotation. Genome Res. 28, 1029–1038 (2018). 24. Chaisson, M. J. P. et al. Resolving the complexity of the human genome usingE E at the sequence level. In addition, Figure 1 highlights the centromeric single-molecule sequencing. Nature 517, 608–611 (2015). satellite arrays that are expected to be similar in sequence between 25. Mahtani, M. M. & Willard, H. F. Pulsed-field gel analysis of alpha-satelliteI DNA at the I non-homologous chromosomes. Such arrays will need to be phased human X chromosome centromere: high-frequency polymorphisms and array size estimate. Genomics 7, 607–613 (1990). V V both between and within chromosomes. 26. Migeon, B. R. & Kennedy, J. F. Evidence for the inactivation of an X chromosome early in Finishing the human genome will proceed as these remaining chal- the development of the human female. Am. J. Hum. Genet. 27, 233–239 (1975). lenges are overcome, beginning with the comparatively easier to 27. Manuelidis, L. & Wu, J. C. Homology between human andE simian repeated DNA. Nature E 276, 92–94 (1978). assemble chromosomes (e.g. 3, 6, 8, 10, 11, 12, 17, 18, 20), and eventu- 28. Willard, H. F., Smith, K. D. & Sutherland, J. Isolation and characterization of a major ally concluding with the chromosomes containing large blocks clas- tandem repeat family from the human X chromosome.R Nucleic Acids Res. 11, 2017–2033 R sical human satellites (1, 9, 16) and the acrocentrics (13, 14, 15, 21, 22). (1983). 29. Willard, H. F. & Waye, J. S. Hierarchical order in chromosome-specific human alpha In the near term, reference gaps closed in the CHM13 genome will be satellite DNA. Trends Genet. 3, 192–198P (1987). P integrated into GRCh38 using the GRC’s existing “patch” infrastruc- 30. Miga, K. H. et al. Centromere reference models for human chromosomes X and Y satellite ture. Once all CHM13 chromosomes are completed, we plan to pro- arrays. Genome Res. 24, 697–707 (2014). 31. Durfy, S. J. & Willard, H. F. Patterns of intra- and interarray sequence variation in alpha vide these to the GRC as the basis for a new, entirely gapless, reference satellite from the human EX chromosome: evidence for short-range homogenization of E genome release, which would likely be a of the current refer- tandemly repeated DNA sequences. Genomics 5, 810–821 (1989). ence with CHM13 sequence in the most difficult regions. Efforts to 32. Wevrick, R. & Willard,L H. F. Long-range organization of tandem arrays of alpha satellite L DNA at the centromeres of human chromosomes: high-frequency array-length finally complete the GRC human reference genome will help advance polymorphism and meiotic stability. Proceedings of the National Academy of Sciences 86, the necessary technology towards our ultimate goal of complete, 9394–9398C (1989). C telomere-to-telomere, diploid assemblies for all human genomes. 33. Waye, J. S. & Willard, H. F. Nucleotide sequence heterogeneity of alpha satellite repetitive DNA:I a survey of alphoid sequences from different human chromosomes. Nucleic Acids I Res. 15, 7549–7569 (1987). 34.T Jain M, Olsen HE, Turner D, Stoddart D, Paten B, Haussler D, Willard HF, Akeson M, Miga T Online content KH. Linear assembly of a human centromere on the . Nat. Biotechnol. 36, 321–323 (2018). Any methods, additional references, Nature Research reporting sumR- 35. Loman, N. J., Quick, J. & Simpson, J. T. A complete bacterial genome assembled de novo R maries, source data, extended data, supplementary information, using only nanopore sequencing data. Nat. Methods 12, 733–735 (2015). acknowledgements, peer review information; details of author con- 36. Koren, S., Phillippy, A. M., Simpson, J. T., Loman, N. J. & Loose, M. Reply to ‘Errors in A long-read assemblies can critically affect protein prediction’. Nat. Biotechnol. 37, 127–128 A tributions and competing interests; and statements of data and code (2019). availability are available at https://doi.org/10.1038/s41586-020-2547-7. 37. Garrison, E. & Marth, G. -based variant detection from short-read sequencing. arXiv [q-bio.GN] (2012). D 38. Schmutz, J. et al. Quality assessment of the human genome sequence. Nature 429, D 1. Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long 365–368 (2004). reads. Nat. Biotechnol. 36, 338 (2018). 39. Falconer, E. et al. DNA template strand sequencing of single-cells maps genomic 2. Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploidE genome assemblies rearrangements at high resolution. Nat. Methods 9, 1107–1112 (2012). E demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 40. Sanders, A. D. et al. Characterizing polymorphic inversions in human genomes by (2017). single-cell sequencing. Genome Res. 26, 1575–1587 (2016). T 41. Mikheenko, A., Bzikadze, A. V., Gurevich, A., Miga, K. H. & Pevzner, P. A. TandemMapper T 3. Ross, M. T. et al. The DNA sequence of the human X chromosome. Nature 434, 325–337 (2005). and TandemQUAST: mapping long reads and assessing/improving assembly quality in 4. Mefford, H. C. & Eichler, E. E. Duplication hotspots,A rare genomic disorders, and common extra-long tandem repeats. bioRxiv 2019.12.23.887158 (2019) https://doi.org/10.1101/2019. A disease. Curr. Opin. Genet. Dev. 19, 196–204 (2009). 12.23.887158. 5. Langley, S. A., Miga, K. H., Karpen, G. H. & Langley, C. H. Haplotypes spanning 42. Rand, A. C. et al. Mapping DNA methylation with high-throughput nanopore sequencing. centromeric regions reveal persistenceR of large blocks of archaic DNA. Elife 8, (2019). Nat. Methods 14, 411–413 (2017). R 6. Eichler, E. E. Masquerading repeats: paralogous pitfalls of the human genome. Genome 43. Carrel, L., Cottle, A. A., Goglin, K. C. & Willard, H. F. A first-generation X-inactivation profile Res. 8, 758–762 (1998). of the human X chromosome. Proc. Natl. Acad. Sci. U. S. A. 96, 14440–14444 (1999). 7. Breitwieser, F. P., Pertea, EM., Zimin, A. V. & Salzberg, S. L. Human contamination in 44. Giorgetti, L. et al. Structural organization of the inactive X chromosome in the mouse. E bacterial genomes has created thousands of spurious . Genome Res. 29, Nature 535, 575–579 (2016). 954–960 (2019). 45. Darrow, E. M. et al. Deletion of DXZ4 on the human inactive X chromosome alters 8. Schueler, M. G., Higgins,L A. W., Rudd, M. K., Gustashaw, K. & Willard, H. F. Genomic and higher-order genome architecture. Proc. Natl. Acad. Sci. U. S. A. 113, E4504–12 (2016). L genetic definition of a functional human centromere. Science 294, 109–115 (2001). 46. Chadwick, B. P. DXZ4 chromatin adopts an opposing conformation to that of the 9. Eichler, E.E E. et al. Missing heritability and strategies for finding the underlying causes of surrounding chromosome and acquires a novel inactive X-specific role involving CTCF E complex disease. Nat. Rev. Genet. 11, 446 (2010). and antisense transcripts. Genome Res. 18, 1259–1269 (2008). 10. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying 47. Donlon, T. A., Bruns, G. A., Latt, S. A., Mulholland, J. & Wyman, A. R. A chromosome mammalianC transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 (2008). 8-enriched alphoid repeat. in and Cell Genetics vol. 46 607–607 C 11. Park, P. J. ChIP-seq: advantages and challenges of a maturing technology. Nat. Rev. (Karger Allschwilerstrasse 10, CH-4009 Basel, Switzerland, 1987). Genet. 10, 669–680 (2009). 48. Ge, Y., Wagner, M. J., Siciliano, M. & Wells, D. E. Sequence, higher order repeat structure, C12. Buenrostro, J. D., Wu, B., Chang, H. Y. & Greenleaf, W. J. ATAC-seq: a method for assaying and long-range organization of alpha satellite DNA specific to human chromosome 8. C chromatin accessibility genome-wide. Curr. Protoc. Mol. Biol. 109, 21–29 (2015). Genomics 13, 585–593 (1992). 13. Staden, R. A strategy of DNA sequencing employing computer programs. Nucleic Acids 49. Bzikadze, A. V. & Pevzner, P. A. centroFlye: Assembling Centromeres with Long A Res. 6, 2601–2610 (1979). Error-Prone Reads. bioRxiv 772103 (2019) https://doi.org/10.1101/772103. A 14. Nagarajan, N. & Pop, M. demystified. Nat. Rev. Genet. 14, 157–167 50. Kronenberg, Z. N. et al. High-resolution comparative analysis of great ape genomes. (2013). Science 360 (2018). 15. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in 16. International Human Genome Sequencing Consortium. Finishing the euchromatic published maps and institutional affiliations. sequence of the human genome. Nature 431, 931–945 (2004). 17. Steinberg, K. M. et al. Single haplotype assembly of the human genome from a © This is a U.S. government work and not under copyright protection in the U.S.; foreign hydatidiform mole. Genome Res. 24, 2066–2076 (2014). copyright protection may apply 2020

4 | Nature | www.nature.com W W E E VI VI E E R R P P E E L L C C I Fig. 1 | CHM13 whole-genome assembly and validation. (a) Gapless contigs I T are illustrated as blue and orange bars next to the chromosome ideograms T (highlighting contig breaks). Several chromosomes are broken only in centromeric regions . Large gaps between contigs (e.g. middle of chr1) indicate R sites of large heterochromatic blocks (arrays of human satellite 2 and 3 in R yellow) or rDNA arrays with no GRCh38 sequence. Centromeric satellite arrays A expected to be similar in sequence between non-homologous chromosomesA are indicated: chr1, chr5, chr19 (green), chr4 and chr9 (light blue), chr5 and chr19 (pink), chr13 and chr21 (red), chr14 and chr22 (purple). (b) The X D chromosome was selected for manual assembly, and was initiallyD broken at three locations: the centromere (artificially collapsed in the assembly), a large segmental duplication (DMRTC1B, 120 kbp), and a second segmental E duplication with a paralog on (134 kbp).E Gaps in the GRCh38 reference (black) and known segmental duplications (red, paralogous to Y T pink) are annotated. Repeats larger than 100 kbT are named with the expected size (kbs) (blue, tandem repeats and red, segmental duplications). (c) A Mis-assembly of the GAGE identifiedA by the optical map (top), and corrected version (bottom) showing the final assembly of 19 (9.5 kbp) full R length repeat units and two partialR repeats (d) Quality of the GAGE locus before and after polishing using unique (single-copy) markers to place long reads. Dots indicate coverage depth of the primary (black) and secondary (red) alleles E recovered from mapped EPacBio high-fidelity (HiFi) reads (SNote 4). Because the CHM13 genome is effectively haploid, regions of low coverage or increased L secondary frequencyL indicate low-quality regions or potential repeat collapses. Marker-assisted polishing dramatically improved allele uniformity E of across theE entire GAGE locus. C C C C A A

Nature | www.nature.com | 5 Article

W W E E VI VI E E R R P P E E L L C C TI TI R R A A Fig. 2 | Validated structure of the 3.1 Mbp CHM13 X centromere array. (a) The array, with ~2 kbp repeat units labeled by vertical bands (grey is the canonical unit; colored are structural variants). A single LINE/L1Hs insertionD in the array is D marked by the arrowhead. Below, a predicted restriction map for BglI, with dashed lines indicating regions outside of the DXZ1E array. A minimum E tiling path was reconstructed for illustration purposes and was not the mechanism for initial assembly (Extended Data Fig. 5b). (b) Experimental PFGE Southern blotting for a BglI digest in duplicateT (band sizing indicated by T triangles, BglI, 2.87 Mb +/− 0.16), that matches the in silico predicted band patterns (a) for the CHM13 array (experimentallyA repeated six times with A similar results) (c) Array size estimates were provided by ddPCR (performed in triplicate, line and error bars showR mean and standard deviation) optimized R against PFGE Southerns (HAP1,n = 6; T6012,n = 4; LT690, n = 7; CHM13, n = 13). (d) Catalog of 33 DXZ1 structural variants identified relative to the 2057 bp canonical repeat unit (grey),E along with the number of instances observed, E frequency in the array, number of alpha satellite monomers, and size. (e) Coverage depth ofL mapped (gray) and uniquely anchored (black) nanopore L reads to the DXZ1 array. Marker-assisted polishing (bottom) improves coverage uniformity versusE the unpolished (top) assembly. Single-copy, unique markers E are shown as vertical green bands, with a decreased but non-zero density acrossC the array. One location of reduced coverage toward the right side of the C array marks a possible remaining issue (asterisk). (f) Distributions show the Cspacing between adjacent unique markers on chromosome X and DXZ1. On C average, unique markers are found every 66 bases on chrX, but only every 2.3 kbp in DXZ1, with the longest gap between any two adjacent markers being 42 A kbp. A

6 | Nature | www.nature.com W W E E VI VI E E R R Fig. 3 | Chromosome-wide analysis of CpG methylation. Methylation within PAR1 (770,545–801,293) with unmethylated bases in blue and P estimates were calculated by smoothing methylation frequency data with a methylated bases in red. (b) MethylationP in the DXZ1 array, with bottom IGV window size of 500 . Coverage depth and high quality methylation inset showing a ~93 kbp region of hypomethylation near the centromere of calls (|log-likelihood| > 2.5) for PAR1, DXZ1, and DXZ4 are shown as insets. Only chromosome X (59,213,083-59,306,271). (c) Vertical black dashed lines indicate E reads with a confident unique anchor mapping and the presence of at least one the beginning and end coordinatesE of the DXZ4 array. Left IGV inset shows a high-quality methylation call were considered. (a) Nanopore coverage and methylated region of DXZ4 in chromosome X (113,870,751–113,901,499), and L methylation calls for the 1 (PAR1) of chromosome X right IGV inset showsL a transition from a methylated to unmethylated region of C (1,563–2,600,000). Bottom IGV inset shows a region of hypomethylation DXZ4 (114,015,971–114,077,699).C TI TI R R A A D D E E T T A A R R E E L L E E C C C C A A

Nature | www.nature.com | 7 Article Table 1 | Assembly statistics for CHM13 and the human reference sorted by continuity

Primary Technology Assembly Size (Gbp) No. Ctgs NG50 (Mbp) %BACs resolved BACs %idy all BACs %idy uni 56× Illumina linked reads Supernova(this paper) 2.95 42,828 0.21 17.3 99.975 99.985 76× PacBio CLR FALCON(50) 2.88 1,916 28.2 36.37 99.981 99.995 24× PacBio HiFi Canu(22) 3.03 5,206 29.1 45.46 99.979 99.997 Sanger BACs GRCh38p13(2) 3.27 1,590 56.4 85.63 *99.731 *99.768 W W 39× Nanopore Ultra-Long Canu(this paper) 2.94 448 70.1 82.11 99.980 99.994

* GRCh38 is expected to have a lower identity to BACs derived from CHM13 as it represents a different human genome. Primary Technology: sequencing technology used for contig assembly. The PacBio CLR assembly was additionally polished using Illumina linked reads. The Nanopore Ultra-Long assemblyE E was polished with the PacBio CLR and Illumina linked reads. GRCh38 is primarily based on Sanger-sequenced BACs, but has been continually curated and patched since the completion of the . Assembly: assembler used and reference to the published assembly. Size: sum of bases in the assembly in Gbp including Ns. GRCh38 assembly sizeI includes 110 Mbp I of alternative (ALT) sequences. No. Ctgs: total number of contigs in the assembly, scaffolds were split at three consecutive Ns to obtain contigs. NG50: half of the 3.09V Gbp human genome size V contained in contigs of this length or greater in Mbp. Supernova NG50 statistics were identical between the two reported pseudo-haplotypes. %BACs resolved: percentage of 341 “challenging” CHM13 BACs found intact in the assembly. BACs unresolved by the best CHM13 assembly either map across multiple contigs or map to a single contig with large structural variation, indicating an error in either the BAC or whole-genome assembly. BACs %idy all: median alignment accuracy versus all validation BACs. BACs %idy uni: median alignmentE accuracy versus the 31 validation E BACs that occur outside of segmental duplications (SNote 4). R R P P E E L L C C TI TI R R A A D D E E T T A A R R E E L L E E C C C C A A

8 | Nature | www.nature.com Methods size-selected fragments containing ligation junctions were enriched using Enrichment Beads provided in the Arima-HiC kit, and con- verted into Illumina-compatible sequencing libraries using the Swift Cells from the complete hydatidiform mole CHM13 were originally Accel-NGS 2S Plus kit (P/N: 21024) reagents. After adapter ligation, cultured from one case of a hydatidiform mole at Magee-Womens Hos- DNA was PCR amplified and purified using SPRI beads. The purified pital (Pittsburgh, PA) as part of a research study that occurred in the DNA underwent standard QC (qPCR and Bioanalyzer) and sequenced early 2000's (IRB MWH-20-054). At that time, the CHM13 cells were on the HiSeq X following manufacturer's protocols. W cultured, karyotyped using Q banding, and subsequently immortalized W using human (hTERT). For this study, Nanopore and PacBio whole-genome assembly 21 E cryopreserved CHM13 cells were thawed and cultured in complete Canu v1.7.1 was run with all rel1 Oxford Nanopore data (on-instrumentE AmnioMax C-100 Basal Medium (ThermoFisher Scientific, Carlsbad, basecaller, rel1) generated on or before 2018/11/07 (totaling 39× cover- I CA) supplemented with 1% Penicillin-Streptomycin (ThermoFisher age) and PacBio sequences (SRA: PRJNA269593) generatedI in 2014 and V Scientific, Carlsbad, CA) and grown in a humidity-controlled environ- 2015 (totaling 70× coverage)2,56. Several chromosomesV in the assem- ment at 37 °C, with 95% O2 and 5% CO2. Fresh medium was exchanged bly are broken only inat centromeric regions (e.g. chr10, chr12, chr18, every 3 days and all cells used for this study did not exceed passage 10. etc.) (Fig. 1). Despite apparent continuity across several centromeres, E Cells have been authenticated and tested negative for mycoplasma (e.g. chr8, chr11, chrX), the assembler reportedE many fewer than the R contamination. expected number of repeat copies. R Karyotyping Manual gap closure P Metaphase slide preparations were made from human hydatidiform Gaps on the X chromosome wereP closed by mapping all reads against mole cell line CHM13, and prepared by standard air-drying technique the assembly and manually identifying reads joining contigs that as previously described51. DAPI banding techniques were performed were not included in the automated Canu assembly. This generated E to identify structural and numerical chromosome aberrations in the an initial candidate chromosomeE assembly, with the exception of the according to the ISCN52. Karyotypes were analyzed using centromere. Four regions of the candidate assembly were found to L a Zeiss M2 fluorescence microscope and Applied Spectral Imaging be structurally inconsistentL with the Bionano optical map and were software (Carlsbad, CA, SNote 1). corrected by manually selecting reads from those regions and locally C reassemblingC with Canu21 and Flye v2.457. Low coverage long reads that DNA extraction, library preparation, and sequencing confidently spanned the entire repeat region were used to guide and I 7 I High-molecular-weight (HMW) DNA was extracted from 5×10 CHM13 evaluate the final assembly where available. Evaluation of copy number T cells using a modified Sambrook and Russell protocol1,53. Libraries andT repeat organization between the re-assembled version and span- were constructed using the Rapid Sequencing Kit (SQK-RAD004) from ning reads was performed using HMMER (v3)58,59 trained on a specific R Oxford Nanopore Technologies with 15 μg of DNA. The initial reactionR tandem repeat unit, and the reported structures were manually com- was typically divided into thirds for loading and FRA Buffer (104 mM pared. Default parameters for Minimap260 resulted in uneven coverage A Tris pH 8.0, 233 mM NaCl) was added to bring the volume toA 21 ul. and polishing accuracy over tandemly repeated sequences. This was These reactions were incubated at 4 °C for 48 hrs to allow the buffers successfully addressed by increasing the Minimap2 -r parameter from to equilibrate before loading. Most sequencing was performed on the 500 to 10000 and increasing the maximum number of reported second- D Nanopore GridION with FLO-MIN106 or FLO-MIN106DD R9 flow cells, ary alignments (-N) from 5 to 50. Final evaluation of repeat base-level with the exception of one Flongle flow cell used for testing. Sequencing quality was determined by mapping of PacBio datasets (CLR and HiFi) E reads used in the initial assembly were first base-calledE on the sequenc - (Extended Data Fig. 7, SNote 4). ing instrument. After all data was collected, the reads were basecalled The alpha satellite array in the X centromere, due to its availability as again using the more recent Guppy algorithm (version 2.3.1 with the a haploid array in male genomes, is one of the best studied centromeric T T 28 “flip-flop” model enabled). regions at the genomic level, with a well-defined 2 kbp repeat unit , A A 10x Genomics linked-read genomicA library was prepared from physical/genetic maps8,30 and an expected range of array lengths25. We 1 ng of high molecular weight genomic DNA using a 10x Genomics initially generated a database of alpha satellite containing ultra-long Chromium device and Chromium Reagent Kit v2 according to the reads, by labeling those reads with at least one complete consensus R manufacturer’s protocol. TheR library was sequenced on an Illumina sequence33 of a 171 bp canonical repeat in both orientations, as previ- NovaSeq 6000 DNA sequencer on an S4 flow cell, generating 586 mil- ously described61. Reads containing alpha in the reverse orientation E lion paired-end 151 baseE reads. The raw data was processed using RTA were reverse complemented, and screened with HMMER (v3) using a 3.3.3 and bwa 0.7.1254. The resulting molecule size was calculated to be 2057 bp DXZ1 repeat unit. We then employed run-length encoding in L 130.6 kb from a SupernovaL 55 assembly. which runs of the 2057 bp canonical repeat (defined as any repeat in the DNA was prepared using the “Bionano Prep Cell Culture DNA Isolation range of min: 1957 bp, max: 2157 bp) were stored as a single data value E Protocol”. EAfter cells were harvested, they were put through a number and count, rather than the original run. This allowed us to redefine all of washes before embedding in agarose. A proteinase K digestion was reads as a series of variants, or repeats, that differ in size/structure from C performed,C followed by additional washes and agarose digestion. The the expected canonical repeat unit, with a defined spacing in between. DNA was assessed for quantity and quality using a Qubit dsDNA BR Identified CHM13 DXZ1-SVs in the UL-read data were compared to a Assay kit and CHEF gel. A 750 ng aliquot of DNA was labeled and stained library of previously characterized rearrangements in published PacBio C C 50 22 61 following the Bionano Prep Direct Label and Stain (DLS) protocol. Once (CLR and HiFi ) using alpha-CENTUARI, as described . Output anno- A A stained, the DNA was quantified using a Qubit dsDNA HS Assay kit and tation of SVs and canonical DXZ1 spacing for each read were manually run on the Saphyr chip. clustered to generate six initial contigs, two of which are known to Hi-C libraries were generated, in replicate, by Arima Genom- anchor into the adjacent Xp or Xq. To define the order and overlap ics using four restriction . After the modified chromatin between contigs, we identified all 21-mers that had an exact match digestion, digested ends were labelled, proximally ligated, and then within the high-quality DXZ1 array data obtained from CRISPR-Cas9 proximally-ligated DNA was purified. After the Arima-HiC protocol, duplexSeq targeted resequencing62 (SNote 8). Overlap between the Illumina-compatible sequencing libraries were prepared by first two or more 21-mers with equal spacing guided the organization of shearing then size-selecting DNA fragments using SPRI beads. The the assembly. Orthogonal validation of the spacing between contigs Article (and contig structure) was supported with additional ultra-long read stowers.org/research/publications/libpb-1453. Genome assemblies coverage, providing high-confidence in repeat unit counts for all but and sequencing data including raw signal files (FAST5), event-level three regions. data (FAST5), base-calls (FASTQ), and alignments (BAM/CRAM) are available as an Amazon Web Services Open Data set. Instructions for Chromosome X long-read polishing accessing the data, as well as future updates to the raw data and assem- We used a novel mapping pipeline to place reads within repeats using bly, are available from https://github.com/nanopore-wgs-consortium/ unique markers. Length k substrings (k-mers) were collected from the chm13. All data is additionally archived and available under NCBI Bio- Illumina linked reads, after trimming off the barcodes (the first 23 bases Project accession PRJNA559484, including the whole-genome assemblyW W of the first read in a pair). The read was placed in the location of the (GCA_009914755.1) and completed X chromosome (CM020874.1). assembly having the most unique markers in common with the read. E E Alignments were further filtered to exclude short and low identity align- ments. This process was repeated after each polishing round, with new Code availability I I unique markers and alignments recomputed after each round. Polishing No custom code was used for the analysis in this manuscript.V All uti- V proceeded with one round of Racon followed by two rounds of Nano- lized software is freely available: Canu https://github.com/marbl/canu polish and two rounds of Arrow. Post-polishing, all previously flagged BWA https://github.com/lh3/bwa Minimap2 https://github.com/lh3/ low-quality loci showed significant improvement, with the exception minimap2 Arrow https://github.com/PacificBiosciences/GenomicE - E of 139–140.3 which still had a coverage drop and was replaced with an Consensus Nanopolish https://github.com/jts/nanopolish hmmer alternate patch assembly generated by Canu using PacBio HiFi data. http://hmmer.org Supernova https://support.10xgenomics.comR Long R Ranger https://support.10xgenomics.com Juicer & Juicebox https:// Whole-genome long-read polishing github.com/aidenlab/JuiceboxP Flye https://github.com/fenderglass/ P The rest of the whole-genome assembly was polished similarly to the X Flye Bioconductor https://github.com/Bioconductor Samtools http:// chromosome, but without the use of unique k-mer anchoring. Instead, samtools.github.io Freebayes https://github.com/ekg/freebayes MUM- two rounds of Nanopolish, followed by two rounds of Arrow, were run mer http://mummer.sourceforge.netE CRISPR-DS https://github.com/ E using the above parameters, which rely on the mapping quality and risqueslab length/identity thresholds to determine the best placements of the long L L reads. As no concerted effort was made to correctly assemble the large 51. Dutra, A. S., Mignot, E. & Puck, J. M. Gene localization and syntenic mapping by FISH in the dog. Cytogenet. Cell Genet. 74, 113–117 (1996). satellite arrays on chromosomes outside of the X, this default polish- 52. Willatt, L.C & Morgan, S. M. Shaffer LG, Slovak ML, Campbell LJ (2009): ISCN 2009 an C ing method was deemed sufficient for the remainder of the genome. international system for human cytogenetic nomenclature. Hum. Genet. 126, 603 (2009).I I However, future efforts to complete these remaining chromosomes are 53. Quick, J. Ultra-long read sequencing protocol for RAD004 v3 (protocols.io.mrxc57n). expected to benefit from the unique k-mer anchoring mapping approach. Tprotocols.io https://doi.org/10.17504/protocols.io.mrxc57n. T 54. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler Whole-genome short-read polishing transform. Bioinformatics 25, 1754–1760 (2009). R55. Weisenfeld, N. I., Kumar, V., Shah, P., Church, D. M. & Jaffe, D. B. Direct determination of R The Illumina linked reads were used for a final polishing of the whole diploid genome sequences. Genome Res. 27, 757–767 (2017). assembly, including the X chromosome, but using only unambiguousA 56. Huddleston, J. et al. Discovery and genotyping of structural variation from long-read A mappings and correcting only small insertion/deletion errors (SNote 4). haploid genome sequence data. Genome Res. 27, 677–685 (2017). 57. Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019). Methylation analysis 58. Bateman, A. et al. Pfam 3.1: 1313 multiple alignments and profile HMMs match the D 63 majority of proteins. Nucleic Acids Res. 27, 260–262 (1999). D To measure CpG methylation in nanopore data we used Nanopolish . 59. Eddy, S. HMMER3: a new generation of search software. URL: Nanopolish employs a Hidden Markov Model (HMM) on the nano - http://hmmer.janelia.Org (2010). E 60. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, E pore current signal to distinguish 5-methylcytosine from unmeth- ylated cytosine. The methylation caller generates a log-likelihood 3094–3100 (2018). T 61. Sevim, V., Bashir, A., Chin, C.-S. & Miga, K. H. Alpha-CENTAURI: assessing novel T value for the ratio of probability of methylated to unmethylated centromeric repeat sequence variation with long read sequencing. Bioinformatics 32, CGs at a specific k-mer. We next filtered methylation calls using the 1921–1924 (2016). A 62. Nachmanson, D. et al. Targeted genome fragmentation with CRISPR/Cas9 enables fast A nanopore_methylation_utilities tool (https://github.com/isaclee/ and efficient enrichment of small genomic regions and ultra-accurate sequencing with nanopore-methylation-utilities), which uses a log-likelihood ratio of low DNA input (CRISPR-DS). Genome Research vol. 28 1589–1599 (2018). 2.5 as a threshold for calling methylationR 64. CpG sites with log-likelihood 63. Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. R ratios greater than 2.5 (methylated) or less than -2.5 (unmethylated) are Nat. Methods 14, 407–410 (2017). 64. Lee, I. et al. Simultaneous profiling of chromatin accessibility and methylation on human considered high-qualityE and included in the analysis. Reads that did not cell lines with nanopore sequencing: Supplemental Materials. https://doi.org/10.1101/ E have any high-quality CpG sites were excluded from the subsequent 504993. methylation analysis.L Figure 3 shows coverage of reads with at least 65. Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24 (2011). L 66. Hansen, K. D., Langmead, B. & Irizarry, R. A. BSmooth: from whole genome bisulfite one high quality CpG site. Nanopore_methylation_utilities integrates sequencing reads to differentially methylated regions. Genome Biol. 13, R83 methylationE information into the alignment BAM file for viewing in (2012). E 65 67. Sullivan, L. L., Boivin, C. D., Mravinac, B., Song, I. Y. & Sullivan, B. A. Genomic size of IGV’s bisulfite mode and also creates Bismark-style files which we then CENP-A domain is proportional to total alpha satellite array size at human centromeres 66 analyzedC with the R Bioconductor package BSseq (version 1.20.0) . We and expands in cancer cells. Chromosome Res. 19, 457–470 (2011). C used the BSmooth algorithm66 within the BSseq package for smoothing Cthe data to estimate the methylation level at specific regions of interest. Acknowledgements We acknowledge helpful conversations with Isac Lee on methylation C analysis, and a helpful review of the manuscript by Huntington F. Willard. Funding support: Reporting summary NIH/NHGRI R21 1R21HG010548-01 and NIH/NHGRI U01 1U01HG010971 (KHM); Intramural A Research Program of the National Human Genome Research Institute, National Institutes of A Further information on research design is available in the Nature Health (SK, AR, VM, AD, GGB, AMC, NFH, AY, JCM, AMP); Korea Health Technology R&D Project Research Reporting Summary linked to this paper. through the Korea Health Industry Development Institute HI17C2098 (AR); Intramural Research Program of the National Library of Medicine, National Institutes of Health (VAS, FT-N); Common Fund, Office of the Director, NIH (VM); Stowers Institute for Medical Research (EH, TP, JLG); NIH R01 GM124041 (BAS); NIH HG002385 and HG010169 (EEE); EEE is an investigator of the Data availability Howard Hughes Medical Institute; National Library of Medicine Big Data Training Grant for Genomics and Neuroscience 5T32LM012419-04 (MRV); NIH 1F32GM134558-01 (GAL); Original data generated at SIMR that underlies this manuscript can be NIH/NHGRI U54 1U54HG007990 and W. M. Keck Foundation DT06172015 and NIH/NHLBI U01 accessed from the Stowers Original Data Repository at http://www. 1U01HL137183 and NIH/NHGRI/EMBL 2U41HG007234 (BP); NIH/NHGRI R01 HG009190 and NIGMS T32 GM007445 (WT, AG); NIH R01CA181308 (RAR); NIH/NHGRI 2R44HG008118 (ADS to critical resources. JQ developed the initial ultra-long read protocol and updated to current and SS); Wellcome Trust [212965/Z/18/Z] (NH, NJL, ML); National Institute for Health Research chemistry. NJL provided an Amazon Web Services (AWS) account and coordinated data (NIHR) Surgical Reconstruction and Microbiology Research Centre (SRMRC) (JQ). The views sharing. KHM, SK, AR, MRV, and AMP developed figures. KHM and AMP coordinated the expressed are those of the author(s) and not necessarily those of the NIHR or the Department project. KHM, SK, and AMP drafted the manuscript. All authors read and approved the final of Health and Social Care. This work utilized the computational resources of the NIH HPC manuscript. Biowulf cluster (https://hpc.nih.gov). Competing interests EEE is on the scientific advisory board of DNAnexus, Inc. KHM, SK, and Author contributions SB, GAL, KT, VM, GGB, MYD, DCS, RS, GK, NH, ML, AY, JCM, EEE WT have received travel funds to speak at symposia organized by Oxford Nanopore. WT has performed CHM13 nanopore sequencing, cell line preparation, and primary data analysis. two patents licensed to Oxford Nanopore (US Patent 8,748,091 and US Patent 8,394,584). AY and JCM generated 10x and assembly. BAS performed ADS, JMB, and SS are employees of Arima Genomics. RAR shares equity in NanoString W PFGE-Southern blotting array size analysis. MK, CM, RF, TAGL, and IH generated Bionano data Technologies Inc. and is the principal investigator on an NIH SBIR subcontract research W and performed data analysis. JF and RR performed CRISPr-DS analysis. EH, TP, JLG performed agreement with TwinStrand Biosciences Inc. All other authors have no competing interests ddPCR and SKY analysis. EP, AD, EH, TP, JLG performed CMH13 cell line karyotyping. ABW and to declare. E EEE performed the admixture analysis. KHM performed repeat characterization and satellite E DNA assembly. KHM, SK, MRV, AMC, and AMP performed automated and manual assembly. Additional information I KHM, SK, AR, MRV, GAL, DP, JW, WC, KH, EEE, and AMP performed assembly curation and Supplementary information is available for this paper at https://doi.org/10.1038/s41586-020-I validation. SK, AR, and AMP performed marker-based assembly polishing. AG and WT 2547-7. V performed methylation analysis. AB and PAP generated automated satellite DNA assemblies. Correspondence and requests for materials should be addressed toV K.H.M. or A.M.P. ADS and JMB and SS performed Hi-C CHM13 sequencing. AR performed Hi-C analysis. NFH Peer review information Nature thanks Tomi Pastinen, Steven Salzberg and the other, performed SV analysis. JA and BP performed annotation analysis. VAS and FTN performed anonymous, reviewer(s) for their contribution to the peer review of this work. E alignment versus RefSeq, repeat characterization, and frameshift analysis. US provided access Reprints and permissions information is available at http://www.nature.com/reprintsE . R R P P E E L L C C TI TI R R A A D D E E T T A A R R E E L L E E C C C C A A Article

W W E E VI VI E E R R P P E E L L C C TI TI Extended Data Fig. 1 | (a) Chromosomes and karyotype of CHM13 cell line at karyotype is shown from one of ten spreads analyzed, all ten reported had passage 10. Mitotic metaphase spreads were prepared from cells treated withR similar results Bar, 10 μm. (b) CHM13 G-banding karyotype. A total of 20 CHM13 R colcemid and processed as detailed in Materials and Methods. Spectral metaphase spreads were independently characterized and all showed a similar karyotyping analysis demonstrated normal. 46,XX karyotype. RepresentativeA normal 46, XX female karyotype, as shown. A D D E E T T A A R R E E L L E E C C C C A A W W E E VI VI E E R R P P E E L L C C TI TI R R A A D D E E T T A A R R E E L L E E C C C C A A Extended Data Fig. 2 | Inferred ancestry of CHM13. Proportion of ancestry Analysis based on 1,964 unrelated individuals from the 1KG and SGDP. CHM13 is explained by each cluster as estimated by ADMIXTURE using a) K=6 or b) K=9 highlighted in red font along with a black bounding rectangle. for 10 randomly sampled individuals from each population and CHM13. Article

W W E E VI VI E E R R P P E E L L C C TI TI R R A A D D E E T T A A R R Extended Data Fig. 3 | Results of using CHM13 as a reference when GRCh38 (in black/gray). Using CHM13 as a reference yields balanced counts of describing structural variation.E Assemblytics large insertion and deletion insertions and deletions, while an excess of insertion calls is observed when E calls for four long readL assemblies with respect to CHM13 (in dark red/red) and using GRCh38, suggesting a probable deletion bias in GRCh38. L E E C C C C A A W W E E VI VI E E R R P P E E L L C C TI TI R R A A D D E E T T A A

R Extended Data Fig. 4 | TelomereR length in the reads and the assembly. The assembly telomere sizes are consistent with the larger sizes observed in the reads. E The shorter peak in telomereE length within the reads is likely an artifact of premature read end not the true telomere end. L L E E C C C C A A Article

W W E E VI VI E E R R P P E E L L C C TI TI Extended Data Fig. 5 | (a) The satellite array on the X chromosome (DXZ1) is unit variants were predicted in the HiFi dataset using alpha-CENTAURI61. The defined at the sequence level as a multi-megabase size array of alpha RDXZ1 repeat unit, shown as arrows, are composed of 12 smaller ~171 bp repeats R satellite DNA. The canonical repeat of the DXZ1 array is defined by twelve (indicated as small circles within the arrow). In total, we identified 7316 divergent monomers that are ordered to form a larger ~2 kb repeating unit,A DXZ1-containing HiFi reads. We characterized a database of 38,184 (98.2%) A known as a ‘higher-order repeat’ (HOR) (shown in grey, with HOR in black and full-length DXZ1 canonical 12-mer repeats and 691 HORs with variant repeat circles representing each of the twelve ~171 bp monomer). The HORs are structure (1.8%). Changes from the canonical repeat unit are indicated with a tandemly arranged into a large, multi-megabase sized satelliteD array (with dashed line and each structural variant marks a color, and its positioning within D previous published PFGE-Southern estimates suggesting a mean of 3Mb) with a the array assembly is indicated (ordered p-arm to qHiFi-arm) above. The limited number of rearrangements in the HOR repeat structure (as indicated in majority of reads were determined to contain purely DXZ1-alpha satellite yellow for a deletion to a 5-mer variant) and nucleotideE differences between (7305/7316, or 99.85%). Of the remaining reads, ten reads provided evidence E repeat copies. Our assembly strategy initially identified and annotated all for a transition from DXZ1 into the single L1Hs insertion in our assembly. We uninterrupted head-to-tail tandem arrays of ‘canonical’T repeats and sites of identified only a single read that we could not assign to our assembly due to a T structural variants in each nanopore read in our DXZ1 library (see Methods). 902 bp homopolymer ([G]n), which may present a sequencing artifact. (d) A The spacing of canonical repeats to flankingA SVs informed the precise minimum tiling path was reconstructed for illustration purposes (as shown in A alignment between reads. Contigs were generated by taking the consensus of Fig 2a) and was not the mechanism for initial assembly. (e) DXZ1 read overlap these uniquely placed ultra-longR reads. (b) The T2T-X CHM13 array was assembly using structural variant overlap and positioning. Read IDs and length R originally segmented into seven SV-determined contigs. Ordering and overlap are provided from Xp to Xq: (1) ab9c12a7-08db-4524-8332-373129eaa4fb, between the contigs was made using shared positions of Duplex-Seq DXZ1 442,119 bp. (2) 063fca09-81fc-4c2d-81ad-16fb2bfee76f, 364,710 bp. (3) kmers and low-coverage E(i.e., 1–2 reads) support of ultra-long data that 3d0fa869-028f-45be-be41-b2487897bb25, 380,361 bp. (4) a5cf4e19-8eff-4035- E confidently spanned contig ordering. Three regions (marked with an asterix) 8238-ae81963b854f, 362,052 bp. (5) c6f29ca1-d84d-4881-9042-dfb37bc9f111, were only determinedL by single nucleotide variant overlap. We improved the 482,907 bp. (6) 1ccd919f-5726-4d79-8cfe-fe2b344070a1, 275,718 bp. (7) L prediction of these overlaps in implementing an orthogonal method, e39308c6-0c73-45d5-9b8d-7f764af858be, 351,045 bp. (8) 86ac29ba-5a93- centroFlye, Ewhich studies single variant positions in the DXZ1 nanopore reads 4c08-aa18-c07829a5b696, 393,007 bp. (9) 64d464d1-f317- E to guide the final positioning of the overlap between the contigs (and confirm 4dff-a259-de6097a5cd4c, 221,510 bp. (10) 08e000a1-69dd-40fb-9fd1- the existingC overlap in the region closest to p-arm). (c) Comparisons with DXZ1 942f159ec6b7, 262,585 bp. (11) 1ef64f71-9477-4a5b-bf7e-a356785cc656, C higher-order repeat variant frequency in the nanopore UL data with 421,096 bp. (12) a1e01c13-7ca1-4dc5-85b1-6b69ec2124f9, 371,129 bp. Chigh-fidelity (HiFi) long-read pacbio data were highly concordant. DXZ1 repeat C A A W W E E VI VI E E R R P P E E L L C C TI TI R R A A D D E E T T A A R R E E L L E E C C C C

Extended Data Fig. 6 | Alpha satellite array sizes were estimated by PFGE ~200 kb and below the range of detection (as marked with grey band). The three A A 25,67 and Southern blotting using established methods . In silico digest of the remaining bands are once again concordant with observed PFGE-Southern ~3.1Mb DXZ1 array is predicted to produce three bands with a complete BglI replicates for BstEII (~1.8, ~0.7, and ~0.3 Mb). HAP1 and DLD1 are included as digest: ~659 kb, ~2153 kb, and ~294 kb, which are concordant with the replicate internal controls. This experiment was repeated seven times with similar PFGE Southern experiments shown for BglI (~2.1, ~0.7, and ~0.3 Mb). In silico results. digest with BstEII provides evidence for six bands, of which three are less than Article

W W E E VI VI E E R R P P E E L L C C TI TI R R A A D D E E T T A A R R E E L L E E C C C C A Extended Data Fig. 7 | The initial Canu assembly of the GAGE locus (a) was allele coverage becomes less uniform. A modified polishing process, using the A further corrupted due to standard long-read polishing (Arrow, unique k-mer strategy, corrects this effect. (c-f) The left-side plots are Nanopolish) (b). Black dots are coverage of the primary allele and red dots are assemblies before polishing. The right-side plots show the same regions after coverage of the secondary allele (PacBio CLR data). The CHM13 genome is unique k-mer assisted polishing (racon, 2 rounds nanpolish, 2 rounds arrow, 2 effectively haploid so one allele is expected. Regions of low coverage or rounds 10x). The regions are (c) GAGE locus (48.6–49 Mbp), (d) 70.8–71.3 Mbp, increased secondary allele frequency indicate low-quality regions or potential (e), 138.6–139.7 Mbp, and (f) cenX (57–61 Mbp). (g-j) Same loci as c-f but with repeat collapses. Due to mis-mapping of reads during the polishing process, PacBio HiFi rather than CLR mapped. W W E E VI VI E E R R P P E E L L C C TI TI R R A A D D E E T T

A Extended Data Fig. 8 | (a) 21-mer distributionA from the 10x Genomics reads. strand and WC – one homolog inherited Watson and the other Crick template 21-mers were collected with Meryl and the plot was generated with strand. By tracking changes in strand states along each chromosome we are R GenomeScope1.0 to visualize andR confirm the haploid nature of CHM13 and able to pinpoint locations of recurrent strand state changes that are indicative genome size (len). k-mers with counts between 5 and 58 (inclusive) were used as of a genome mis-assembly. We have analyzed in total 57 Strand-seq libraries E unique markers when polishingE the X chromosome. (b) Coverage histograms of and mapped 28 localized strand state changes. These strand state changes are PacBio CLR (black), HiFi (blue), and UL (green) reads across the complete X randomly distributed along chromosome X assembly and therefore are L chromosome. ReadsL were filtered using the same unique marker based indicative of a double-strand-break that occurred during DNA replication filtering as for polishing. (c) Mapped nanopore reads show uniform coverage instead of real genome mis-assembly. Such breaks are usually repaired by E across the completeE X chromosome. Reads were filtered using the same unique available sister and therefore often result in change in strand marker based filtering as for polishing. Marker density is shown below the read directionality. Black asterisks show small localized strand state changes. Such alignments. (d) Strand-seq validation of the chromosome X assembly. events are either caused by noisy reads inherent to Strand-seq library C Strand-seqC sequences only single template strands from each homologous preparation or two double-strand-breaks that occurred very close to each chromosome. Sequencing reads originating from such single stranded DNA other. (e) Because it is unlikely for a double-strand-break to occur at exactly the C Cpossess directionality, a feature that can be used to assess a long range same position in multiple single cells, a real genome mis-assembly is visible in contiguity of individual homologs. Based on the inheritance of single stranded Strand-seq data as a recurrent change in strand state at the same position in a DNA we distinguish 3 possible strand states: WW – both homologs inherited given contig or scaffold. None of these signatures was observed in the CHM13 A A Watson template strand, CC – both homologs inherited Watson template chromosome X assembly. Article

W W E E VI VI E E R R P P E E

Extended Data Fig. 9 | Hi-C read mapping to the chromosome X assembly. The heatmap showsL clear boundaries around DXZ4, indicating 2 large L The whole X is shown on the left, and the right is zoomed on the DXZ4 locus. superdomainsC separated by DXZ4. C TI TI R R A A D D E E T T A A R R E E L L E E C C C C A A W W E E VI VI E E R R P P E E L L C C TI TI R R A A D D E E T T A A R R E E L L E Extended DataE Fig. 10 | Methylation estimates across centromeric satellite read |log-likelihood| > 2.5. Similar to our previous methylation analysis on array assembly on chromosome 8 (D8Z2) (chr8:43,281,085–45,333,062). chromosome X centromeric satellite array (DXZ1), we observe an Methylated values were calculated by smoothing frequency data with a window unmethylated region (~75 kb) in the centromere of chromosome 8 (as shown: C size ofC 500 nucleotides. Read coverage shown relies on our unique-anchor chr8:44,830,000–44,900,000). C Cmapping and the presence of at least one high-quality methylation call on the A A      

()00123)4564789@A)0B2CD n8014lXg678845b58PgXeA6UU633G    E82@9358@15FG89@A)0B2CD  `1Foaopop  

H13)0@647I9PP80G    

Q8@901H12180RAS62A12@)6P30)T1@A10130)59R6F6U6@G)V@A1S)0W@A8@S139FU62AXYA62V)0P30)T65122@09R@901V)0R)4262@14RG845@084238014RG   

64013)0@647X`)0V90@A1064V)0P8@6)4)4Q8@901H12180RA3)U6R612a211b9@A)02cH1V10112845@A1d56@)068Ue)U6RG(A1RWU62@X     

I@8@62@6R2   `)08UU2@8@62@6R8U848UG212aR)4V60P@A8@@A1V)UU)S6476@1P2801301214@64@A1V67901U17145a@8FU1U17145aP864@1f@a)0g1@A)5221R@6)4X

4h8 ()4V60P15

q YA11f8R@28P3U126i1BpCV)018RA1f3106P14@8U70)93hR)456@6)4a76T14828562R01@149PF10845946@)VP182901P14@

q b2@8@1P14@)4SA1@A10P182901P14@2S101@8W14V0)P562@64R@28P3U12)0SA1@A10@A128P128P3U1S82P182901501318@15UG YA12@8@62@6R8U@12@B2C9215bQqSA1@A10@A1G801)41r)0@S)r26515 q sptuvwxyyxpv€‚€‚v‚ƒx„t v†v ‚w‡ˆ† v‚xttuv†uvp‰yv ‚w‡ˆ†vyx‡vwxy‘t’v€wƒpˆ“„‚vˆpv€ƒv”€ƒx ‚v‚w€ˆxp•

q b512R063@6)4)V8UUR)T8068@12@12@15

q b512R063@6)4)V84G8229P3@6)42)0R)001R@6)42a29RA82@12@2)V4)0P8U6@G84585–92@P14@V)0P9U@63U1R)P38062)42 bV9UU512R063@6)4)V@A12@8@62@6R8U3808P1@10264RU95647R14@08U@14514RGB1X7XP1842C)0)@A10F826R12@6P8@12B1X7X01701226)4R)1VV6R614@C q bQqT8068@6)4B1X7X2@84580551T68@6)4C)0822)R68@1512@6P8@12)V94R10@864@GB1X7XR)4V6514R164@10T8U2C

`)049UUAG3)@A1262@12@647a@A1@12@2@8@62@6RB1X7X—a€a‡CS6@AR)4V6514R164@10T8U2a1VV1R@26i12a5170112)VV0115)P845˜T8U914)@15 q ™ˆdv˜vd‰t„‚v‰‚v’‰w€vd‰t„‚veƒpd‡v‚„ˆ€‰†t•

q `)0f8G12684848UG262a64V)0P8@6)4)4@A1RA)6R1)V306)02845g80W)TRA864g)4@1(80U)21@@6472

q `)0A61080RA6R8U845R)P3U1f5126742a6514@6V6R8@6)4)V@A18330)3068@1U1T1UV)0@12@2845V9UU013)0@647)V)9@R)P12

q d2@6P8@12)V1VV1R@26i12B1X7X()A14g2 ae1802)4g2‡Ca6456R8@647A)S@A1GS101R8UR9U8@15

s„‡ve†vwxttw€ˆxpvxpv‚€‰€ˆ‚€ˆw‚vhx‡v†ˆxtxiˆ‚€‚vwxp€‰ˆp‚v‰‡€ˆwt‚vxpvy‰puvxhv€ƒv‘xˆp€‚v‰†xd• I)V@S801845R)51 e)U6RG64V)0P8@6)48F)9@8T86U8F6U6@G)VR)P39@10R)51 q8@8R)UU1R@6)4 g64nQrjBT1026)4sXtXuC2)V@S801aF821R8UU647S82310V)0P1592647k933GBVU63rVU)3T1026)4oXsXvCavpwk08S58@8S8230)R1221592647 HYbsXsXsa

q8@8848UG262 (849vXxXvaFS8pXxXvoag646P83oToXxvrytvab00)SToXoXoV0)PIgHYU64WzXpXpXtx{tvaQ84)3)U62ATpXvvXpaAPP10TsaI93104)T8ToXvXva E)47H84710ToXoXoa|96R10TvXuXzaP832S101T6298U6i15S6@A|96R1F)fTvX{X{ag10GUV0)P(849TvX{a`UG1oXta28P@))U2TvXyaV011F8G12 TvXoXp845TvXsXvag}gP10T1026)4sXosa8T86U8FU1(H~IeHrqI2)V@S801BA@@32Dhh76@A9FXR)Ph062m912U8FC `)0P8492R063@29@6U6i647R92@)P8U7)06@AP2)02)V@S801@A8@801R14@08U@)@A1012180RAF9@4)@G1@512R06F156439FU62A15U6@108@901a2)V@S801P92@F1P8518T86U8FU1@)156@)02h01T61S102X j12@0)47UG14R)90871R)51513)26@6)4648R)PP946@G013)26@)0GB1X7Xk6@l9FCXI11@A1Q8@901H12180RA79651U6412V)029FP6@@647R)51c2)V@S801V)0V90@A1064V)0P8@6)4X q8@8 e)U6RG64V)0P8@6)48F)9@8T86U8F6U6@G)V58@8 bUUP8492R063@2P92@64RU951858@88T86U8F6U6@G2@8@1P14@XYA622@8@1P14@2A)9U530)T651@A1V)UU)S64764V)0P8@6)4aSA101833U6R8FU1D rbRR1226)4R)512a946m916514@6V6102a)0S1FU64W2V)039FU6RUG8T86U8FU158@821@2 rbU62@)VV679012@A8@A8T1822)R68@1508S58@8 

rb512R063@6)4)V84G012@06R@6)42)458@88T86U8F6U6@G  

! "

r067648U58@8714108@158@I~gH@A8@94510U612@A62P8492R063@R84F18RR12215V0)P@A1I@)S102r067648Uq8@8H13)26@)0G8@A@@3DhhSSSX2@)S102X)07h012180RAh # $

39FU6R8@6)42hU6F3FrvtusXk14)P18221PFU61284521m914R64758@864RU9564708S26748UV6U12B`bIYuCa1T14@rU1T1U58@8B`bIYuCaF821rR8UU2B`bIYCa8458U674P14@2 % & Bfbgh(HbgC8018T86U8FU18284bP8i)4j1FI10T6R12r314q8@821@X~42@09R@6)42V)08RR122647@A158@8a82S1UU82V9@9019358@12@)@A108S58@88458221PFUGa ' 8018T86U8FU1V0)PA@@32Dhh76@A9FXR)Ph484)3)01rS72rR)42)0@69PhRAPvsXbUU58@8628556@6)48UUG80RA6T158458T86U8FU194510Q(f~f6)e0)–1R@8RR1226)4 eH|Qbuuyt{t64RU95647@A1SA)U1r714)P18221PFUGBk(b€ppyyvtxuuXvC845R)P3U1@15wRA0)P)2)P1B(gpop{xtXvCX

   `61U5r231R6V6R013)0@647     eU182121U1R@@A1)41F1U)S@A8@62@A1F12@V6@V)0G)90012180RAX~VG)98014)@2901a0185@A18330)3068@121R@6)42F1V)01P8W647G)9021U1R@6)4X   

q E6V12R614R12 f1A8T6)908Uc2)R68U2R614R12 dR)U)76R8Ua1T)U9@6)480Gc14T60)4P14@8U2R614R12  

`)0801V1014R1R)3G)V@A15)R9P14@S6@A8UU21R@6)42a21148@901XR)Ph5)R9P14@2h40r013)0@647r29PP80GrVU8@X35V       

E6V12R614R122@95G512674    bUU2@95612P92@562RU)21)4@A1213)64@21T14SA14@A1562RU)2901624178@6T1X    I8P3U126i1 r4UG)41R1UUU641B(lgvsC62921564@A622@95G@)0159R1@A1R)P3U1f6@G)V01318@8221PFUG  q8@81fRU926)42 Q)58@8S821fRU9515V0)P@A622@95G 

H13U6R8@6)4 55e(HR)3G49PF1012@6P8@12S101310V)0P1564@063U6R8@1aRG@)7141@6R8221221P14@S101310V)0P15)T10@14P1@83A8212301852a39U215r V61U571UI)9@A1041f3106P14@2S101310V)0P15S6@A@1RA46R8U013U6R8@12

H845)P6i8@6)4 YA62624)@01U1T84@@))90S)0Wa4)0845)P6i8@6)4S82310V)0P1582S180192647)4128P3U1

fU645647 fU645647S824)@01U1T84@@)@A62S)0W

H13)0@647V)0231R6V6RP8@1068U2a2G2@1P2845P1@A)52 j101m960164V)0P8@6)4V0)P89@A)028F)9@2)P1@G312)VP8@1068U2a1f3106P14@8U2G2@1P2845P1@A)52921564P84G2@95612Xl101a6456R8@1SA1@A1018RAP8@1068Ua 2G2@1P)0P1@A)5U62@156201U1T84@@)G)902@95GX~VG)98014)@29016V8U62@6@1P833U612@)G)90012180RAa0185@A18330)3068@121R@6)4F1V)0121U1R@64780123)421X g8@1068U2c1f3106P14@8U2G2@1P2 g1@A)52 4h8 ~4T)UT1564@A12@95G 4h8 ~4T)UT1564@A12@95G q b4@6F)5612 q (A~er21m

q d9W80G)@6RR1UUU6412 q `U)SRG@)P1@0G

q e8U81)4@)U)7G q gH~rF82154190)6P87647

q b46P8U2845)@A10)078462P2

q l9P84012180RA380@6R6384@2

q (U646R8U58@8 d9W80G)@6RR1UUU6412 e)U6RG64V)0P8@6)48F)9@R1UUU6412 (1UUU6412)90R1B2C (1UU2V0)P8R821)V8R)P3U1@1AG58@656V)0PP)U1(lgvsS101R9U@9015aW80G)@G31592647F845647845R0G)301210T158@ g8711rj)P142l)236@8UBe6@@2F907AaebCX

b9@A14@6R8@6)4 YA1(lgvsU641S8289@A14@6R8@15FGRG@)7141@6R848UG262BkrF845647845In‚CF1V)01921XQ)R)4@8P648@6)4S826514@6V615X

gGR)3U82P8R)4@8P648@6)4 (lgvsA82F11451@10P6415@)F14178@6T1V)0gGR)3U82P8R)4@8P648@6)4

()PP)4UGP626514@6V615U6412 Qhb BI11~(Eb(01762@10C   

! " # $ % & '

