Segmental duplications and their variation in a complete

Mitchell R. Vollger1, Xavi Guitart1, Philip C. Dishuck1, Ludovica Mercuri2, William T. Harvey1, Ariel Gershman3, Mark Diekhans4, Arvis Sulovari1, Katherine M. Munson1, Alexandra M. Lewis1, Kendra Hoekzema1, David Porubsky1, Ruiyang Li1, Sergey Nurk5, Sergey Koren5, Karen H. Miga4, Adam M. Phillippy5, Winston Timp3, Mario Ventura2, Evan E. Eichler1,6

1. Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA 2. Department of Biology, University of Bari, Aldo Moro, Bari 70125, Italy 3. Department of Molecular Biology and Genetics, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA 4. UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA 5. Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA 6. Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA

Supplemental Notes

Supplemental note 1: We report 81.3 Mbp of SD sequence that overlaps with the new or structurally variable sequence in T2T-CHM13 assembly. This differs from the 68.3 Mbp reported by Nurk et al. (1) because we considered all SDs that overlapped with new sequence whereas Nurk et al. uses a strict intersection. For example, if a 75 kbp SD overlapped 50 kbp of new sequence we would report 75 kbp of new SD sequence whereas Nurk et al. would report 50 kbp.

Supplemental note 2: Nurk et al. reports 140 new -coding (1), and we report 182 new genes with multiple exons and an open reading frame (ORF). This difference comes from different annotation sets and different filtering steps. Nurk et al. uses CAT (2) to annotate genes in the assembly which they then supplement with Liftoff (3) while considering only new gene copies that are 100% identical to previously annotated paralogs. We used Liftoff to generate our gene annotations and allowed new paralogs to diverge by up to 15% from previously annotated paralogs if they still had an ORF— and this difference accounts for the additional genes we identify. Of the 140 genes, 58 overlap with the 182 we report. This difference comes from an additional set of filters we used to offset the false positives that were identified by allowing 15% divergence. Our additional filters were that the gene must have an ORF of at least 200 bp and have multiple exons.

1 Supplemental Figures

Figure S1. Comparison of SD length and identity in different regions of the genome. The length (left) and identity (right) of SDs across commonly delineated regions of the genome (colors).

2

Figure S2. SD density comparison between T2T-CHM13 and GRCh38. a) Density of SDs in T2T-CHM13 (red) and GRCh38 (blue). In the ideogram highlighted in orange are the 15 regions with the largest increase in the number of SDs. b) Histogram showing the log2 fold change between the number of SDs in T2T-CHM13 and GRCh38 per non- overlapping 5 Mbp window. c) Histogram showing the log2 fold change between the number of bp in SDs for T2T-CHM13 and GRCh38 per non-overlapping 5 Mbp window.

3

Figure S3. SDs within heterochromatin on 1, 9, and 16. This figure shows where the SDs that separate the HSAT and arrays on chromosomes 1, 9, and 16 align to (left) compared to the overall distribution of that (right). Blue are intrachromosomal SDs and red are interchromosomal.

4

Figure S4. FISH support for new CHM13 duplications. This table shows the location and chromosomes of FISH signals (x) for each probe (y) across the different cell lines (color). Black shows the predicted locations of FISH signals from the T2T- CHM13 assembly.

5

Figure S5. CHM13 inversions supported by Strand-seq. a) Locations of inversions in CHM13 relative to GRCh38 as identified with Strand-seq. The color indicates whether the inversion is shared (orange) with at least one sample from HGSVC1 (4) or unique to CHM13 (cyan). b) Size distribution of inversions in CHM13. c) Comparison of inversions shared between CHM1 and CHM13. d) Bar chart showing the counts of shared and unique inversions in CHM13.

6

Figure S6. Length of CHM13 inversions. The length of inversions in CHM13 as predicted by Strand-seq stratified by the presence of flanking SDs (green) or lack thereof (orange). Inversions flanked by SDs are significantly longer than other inversions (p = 0.0065, one-sided Wilcoxon rank-sum test).

7 Figure S7. Pangenome graph of NOTCH2NL and SRGAP2. a) Variation in human haplotypes across the NOTCH2NL expansion site: a graph representation [rGFA generated using minigraph (5)] of the where colors indicate the source genome for the sequence. The graph visualization was created using the software tool Bandage (6). b) The path for each haplotype-resolved assembly through the graph. The “squashed dot plot” represents a vertically compressed dot plot comparing the haplotype-resolved sequence (horizontal) against the graph (vertical). Color represents the source haplotype for the vertical sequence. Structural variants can be identified from discontinuities in height (deletion), changes between colors (insertion), or changes in the direction of the polygon (inversion NOTCH2NL, the gene of interest, is shown with red arrows and other genic content in the region are shown with black arrows. The final line is a duplicon track showing the ancestral duplications (color) that make up the larger duplication block.

8

Figure S8. Single-nucleotide variants (SNVs) in SDs between T2T-CHM13 and GRCh38. a) Divergence of 10 kbp windows with synteny between GRCh38 and T2T-CHM13. b) Distribution of the number of SNVs per 10 kbp windows aligned from GRCh38 to T2T-CHM13 in unique and SD regions. c) Distribution of the distance between SNVs in the syntenic regions of GRCh38 and T2T-CHM13. d) SD regions with synteny between T2T-CHM13 and GRCh38

9 and their average levels of single-nucleotide variation in 1 kbp windows. The bottom row has windows of SD with 0-2 SNVs per kbp, middle row 2-5 SNVs per kbp, and top row is greater than 5 SNVs per kbp.

Figure S9. Size distribution of non-syntenic regions between GRCh38 and T2T-CHM13. Histogram showing the size of non-syntenic regions (Methods) between GRCh38 and T2T- CHM13, and a table of statistics on the lengths of the region.

10

Figure S10. Non-syntenic regions where the reference copy number reflects SGDP. Copy number (CN) of SD regions that are new or structurally different in T2T-CHM13 compared to GRCh38 and 268 individuals from the SGDP (7). The histogram shows the number in Mbp where the median sample CN from SGDP was within two standard deviations (s.d.) of the given assembly [T2T-CHM13 (red), GRCh38 (blue), neither (green), or both (equal CN)].

11

Figure S11. Genic SD expansions in T2T-CHM13 relative to chimpanzee. The blue (no genes) and orange (containing genes) peaks in the ideogram show regions of expansion in CHM13 relative to the Clint_PTR assembly within SD space. The bottom panel shows the density of genic SDs in T2T-CHM13. The genes highlighted as biomedically or evolutionarily important loci are labeled and colored green if they are part of a human expansion.

12

Figure S12. Pangenome graph of SMN. For a description of the elements within this figure, see Figure S7.

Figure S13. Pangenome graph of CYP2D6. For a description of the elements within this figure, see Figure S7.

13

Figure S14. Pangenome graph of ARHGAP11. For a description of the elements within this figure, see Figure S7.

14 Variation in TBC1D3 (1) a b Squashed dot plot CHM13 X 2 Genes Duplicons

ins: 42 kbp (2) GRCh38 X 6

ins: 144 kbp (4) HG002.mat X 3

ins: 230 kbp (6) HG02723.pat

del: 14 kbp (1) HG02080.mat ins: 60 kbp (2)

ins: 117 kbp (2) HG03125.pat

ins: 217 kbp (5) HG01109.mat

ins: 195 kbp (4) HG02080.pat

del: 14 kbp (1) HG00514.mat ins: 167 kbp (4)

ins: 233 kbp (5) HG01243.mat inv: 39 kbp (1)

ins: 309 kbp (8) HG00733.mat

ins: 26 kbp (3) HG02723.mat inv: 275 kbp (5)

ins: 214 kbp (6) NA19240.mat inv: 336 kbp (12)

ins: 140 kbp (5) NA19240.pat inv: 336 kbp (12)

0 250,000 500,000 750,000 1,000,000 1,250,000 Genomic position (bp)

Figure S15. Pangenome graph of TBC1D3 expansion site one. For a description of the elements within this figure, see Figure S7.

15

Figure S16. Pangenome graph of TBC1D3 expansion site two. For a description of the elements within this figure, see Figure S7.

16

Figure S17. Validation of assembly using ONT coverage over TBC1D3 for HG002. Ultra-long ONT coverage of HG002 across the maternal haplotype of TBC1D3 expansion site one, and the coverage across the maternal and paternal haplotypes of TBC1D3 expansion site two. Black dots show the coverage of the most frequent base at each genomic position and red dots show the coverage of the second most frequent base.

17

Figure S18. Validation of assembly using ONT coverage over TBC1D3 for HG00733. Ultra-long ONT coverage of HG00733 across the maternal and paternal haplotypes of TBC1D3 expansion site one, and the coverage across the maternal and paternal haplotypes of TBC1D3 expansion site two. Black dots show the coverage of the most frequent base at each genomic position and red dots show the coverage of the second most frequent base.

18

Figure S19. Clustering of methylation status in SD blocks. Heatmap of CpG methylation of all SD blocks with at least 50 kbp of flanking sequence clustered using the “pheatmap” package in R. The horizontal annotation shows in cyan the 50 kbp of unique flanking sequence and red the SD block.

19 CpG count GC content

[

0

,

1

TRUE )

FALSE

[

1

,

1

s t

0

TRUE )

n

u

o

c

t

p

i r

c FALSE

s

n

a

r

t

q

e

S

o

s

I

d e

[

1 n

0

n i

,

I

n B

TRUE f

]

FALSE

A

l TRUE l

FALSE

0 100 200 300 400 0.2 0.4 0.6 0.8 CpG count (left) and GC content (r ight) 1500 bp upstream and downstream of TSS n 40 80 120 160 SD gene FALSE TRUE Figure S20. CpG and GC content within 1,500 bp of the transcription start site (TSS). Shown are the density of the number of CpGs within +/-1,500 bp of the TSS (left), and the density of bases that are G or C within +/-1,500 bp of the TSS (right), both stratified by the level of Iso-Seq transcription (vertical positioning) and SD content (color).

20

Figure S21. Methylation and transcription levels across multi-copy gene families.

21

Figure S22. Better CN representation of the SGDP across super populations. This figure shows pie charts for each human super population from the Simon Genome Diversity Panel (SGDP, n = 268). Individual pie charts show the relative fraction of SGDP samples for every non-syntenic SD region where the individual’s CN more closely matches the T2T-CHM13 CN (red), GRCh38 CN (blue), or is represented equally well by both (gray).

22 References

1. S. Nurk, S. Koren, A. Rhie, M. Rautianen, A. v. Bzikadze, A. Mikheenko, M. R. Vollger, N. Altemose, L. Uralsky, A. Gershman, S. Aganezov, S. J. Hoyt, M. Diekhans, G. A. Logsdon, M. Alonge, S. E. Antonarakis, M. Borchers, G. G. Bouffard, S. Y. Brooks, G. V. Galdas, H. Cheng, C.-S. Chin, W. Chow, G. de Lima Leonardo, M. Y. Dennis, P. C. Dishuck, R. Durbin, T. Dvorkina, I. T. Fiddes, G. Formenti, R. S. Fulton, A. Fungtammasan, E. Garrison, P. G. S. Grady, T. A. Graves-Lindsay, I. M. Hall, N. F. Hansen, G. A. Hartley, M. Haukness, K. Howe, M. W. Hunkapiller, C. Jain, M. Jain, E. D. Jarvis, P. Kerpedjiev, M. Kirsche, M. Kolmogorov, J. Korlach, M. Kremitzki, H. Li, V. V. Maduro, T. Marschall, A. M. McCartney, R. C. McCoy, D. E. Miller, J. C. Mullikin, E. W. Myers, B. Paten, P. Peluso, D. Porubsky, T. Potapova, E. I. Rogaev, J. A. Rosenfeld, S. L. Salzberg, V. A. Schneider, J. Sedlazeck Fritz, K. Shafin, C. J. Shew, A. Shumate, Y. Sims, D. C. Soto, I. Sović, A. Streets, B. A. Sullivan, F. Thibaud-Nissen, J. Torrance, J. Wagner, B. P. Walenz, Wood Jonathan M. D, C. Xiao, S. M. Yan, A. C. Young, U. Surti, I. A. Alexandrov, P. A. Pevzner, J. L. Gerton, R. J. O’Neill, W. Timp, J. M. Zook, M. C. Schatz, E. E. Eichler, K. H. Miga1, A. M. Phillippy, The complete sequence of a human genome. bioRxiv (2021).

2. I. T. Fiddes, J. Armstrong, M. Diekhans, S. Nachtweide, Z. N. Kronenberg, J. G. Underwood, D. Gordon, D. Earl, T. Keane, E. E. Eichler, D. Haussler, M. Stanke, B. Paten, Comparative Annotation Toolkit (CAT)-simultaneous clade and personal genome annotation. Genome Res. 28, 1029–1038 (2018).

3. A. Shumate, S. L. Salzberg, Liftoff: accurate mapping of gene annotations. Bioinformatics (2020), doi:10.1093/bioinformatics/btaa1016.

4. M. J. P. Chaisson, A. D. Sanders, X. Zhao, A. Malhotra, D. Porubsky, T. Rausch, E. J. Gardner, O. L. Rodriguez, L. Guo, R. L. Collins, X. Fan, J. Wen, R. E. Handsaker, S. Fairley, Z. N. Kronenberg, X. Kong, F. Hormozdiari, D. Lee, A. M. Wenger, A. R. Hastie, D. Antaki, T. Anantharaman, P. A. Audano, H. Brand, S. Cantsilieris, H. Cao, E. Cerveira, C. Chen, X. Chen, C.-S. Chin, Z. Chong, N. T. Chuang, C. C. Lambert, D. M. Church, L. Clarke, A. Farrell, J. Flores, T. Galeev, D. U. Gorkin, M. Gujral, V. Guryev, W. H. Heaton, J. Korlach, S. Kumar, J. Y. Kwon, E. T. Lam, J. E. Lee, J. Lee, W.-P. Lee, S. P. Lee, S. Li, P. Marks, K. Viaud-Martinez, S. Meiers, K. M. Munson, F. C. P. Navarro, B. J. Nelson, C. Nodzak, A. Noor, S. Kyriazopoulou-Panagiotopoulou, A. W. C. Pang, Y. Qiu, G. Rosanio, M. Ryan, A. Stütz, D. C. J. Spierings, A. Ward, A. E. Welch, M. Xiao, W. Xu, C. Zhang, Q. Zhu, X. Zheng-Bradley, E. Lowy, S. Yakneen, S. McCarroll, G. Jun, L. Ding, C. L. Koh, B. Ren, P. Flicek, K. Chen, M. B. Gerstein, P.-Y. Kwok, P. M. Lansdorp, G. T. Marth, J. Sebat, X. Shi, A. Bashir, K. Ye, S. E. Devine, M. E. Talkowski, R. E. Mills, T. Marschall, J. O. Korbel, E. E. Eichler, C. Lee, Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1784 (2019).

5. H. Li, X. Feng, C. Chu, The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 265 (2020).

6. W. R. R., S. M. B., Z. J., H. K. E, Bandage: interactive visualisation of de novo genome assemblies. Bioinformatics. 31, 3350–3352 (2015).

7. S. Mallick, H. Li, M. Lipson, I. Mathieson, M. Gymrek, F. Racimo, M. Zhao, N. Chennagiri, S. Nordenfelt, A. Tandon, P. Skoglund, I. Lazaridis, S. Sankararaman, Q. Fu, N. Rohland,

23 G. Renaud, Y. Erlich, T. Willems, C. Gallo, J. P. Spence, Y. S. Song, G. Poletti, F. Balloux, G. van Driem, P. de Knijff, I. G. Romero, A. R. Jha, D. M. Behar, C. M. Bravi, C. Capelli, T. Hervig, A. Moreno-Estrada, O. L. Posukh, E. Balanovska, O. Balanovsky, S. Karachanak- Yankova, H. Sahakyan, D. Toncheva, L. Yepiskoposyan, C. Tyler-Smith, Y. Xue, M. S. Abdullah, A. Ruiz-Linares, C. M. Beall, A. Di Rienzo, C. Jeong, E. B. Starikovskaya, E. Metspalu, J. Parik, R. Villems, B. M. Henn, U. Hodoglugil, R. Mahley, A. Sajantila, G. Stamatoyannopoulos, J. T. S. Wee, R. Khusainova, E. Khusnutdinova, S. Litvinov, G. Ayodo, D. Comas, M. F. Hammer, T. Kivisild, W. Klitz, C. A. Winkler, D. Labuda, M. Bamshad, L. B. Jorde, S. A. Tishkoff, W. S. Watkins, M. Metspalu, S. Dryomov, R. Sukernik, L. Singh, K. Thangaraj, S. Pääbo, J. Kelso, N. Patterson, D. Reich, The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature. 538, 201– 206 (2016).

24