Performance benchmarking for calling and phasing of single-nucleotide polymorphisms and structural variants The length, accuracy and low bias of nanopore reads makes them ideally suited to the characterisation and phasing of structural variants and single-nucleotide polymorphisms across the entire Contact: [email protected] More information at: www.nanoporetech.com and publications.nanoporetech.com

a) c) d) e) Chr20 p13 p12.3 p12.2 p12.1 p11.23 p11.21 q11.23 q12 q13.12 q13.13 q13.2 q13.33

FASTQ Visualization Evaluation Overall Split by type and size 283 bp 54,123,200 bp 54,123,300 bp (q-score filtered) (IGV & Ribbon) (truvari v2.0 & GIAB) Deletions Insertions 95.47 97.53 Depth [0 - 58] 100% 100% 41,147

1600 Haplotype 1 41,268 cuteSV 41,005 41,099 LRA Filtering reads 41,344 (genotyping) 40,943 41,071 80% 80% 41,441 41,044 220 min 15 min Haplotype 2 41,344 1200 reads 41,096 40,595

b) 60% 60% f) Chr 1 p34.3 p31.2 p31.1 q12 q32.1 q41 q43 Detecting and characterising structural variation 800 43 kb

40% SV count (bars) 40% 100% 152,560 kb 152,570 kb 152,580 kb 152,590 kb Precision / Recall (lines) Precision Depth 80% Precision 400 20% 20% Haplotype 1 60% Recall reads

32,192 32,192 32,192 32,192 32,192 40% 0% 0 0% Haplotype 2 32,192 32,192 32,192 reads 32,192 0 5 10 15 20 25 30 35 40 Precision Recall -3000 -1000-750 -500 -250 0 250 500 750 10003000 32,192 32,192 32,192 Read depth Metric Size [bp] LCE3D LCE3C LCE3B

Fig. 1 Calling structural variants (SV) with nanopore reads a) bioinformatics workflow b) and c) precision and recall d) calling different SV classes e) calling very long insertions and deletions The read lengths enabled by nanopore sequencing make it feasible to call long structural variants with high sensitivity and specificity, even in difficult regions of the Our newest version of the SV analysis pipeline benefits from LRA’s improved performance in aligning long reads in areas of structural variation and repeats. Furthermore, cuteSV improves the exact identification of break points, alternative allele sequences and variant genotypes compared to older tools. To validate this pipeline we called SVs in our publicly available 50x HG002 dataset and compared to the most recent GIAB truth-set. We found high overall performance with >95% precision and 97.5% recall (Figs. 1b and 1c). The size distribution of the detected SVs clearly shows the characteristic ALU peak around 300 bp and LINE1 peak above 5,000 bp that is expected in human samples. Low-complexity short tandem repeats are known to be under-represented in the current human reference genome and this is seen as a higher number of insertions than deletions in our data. We found that high precision is independent of read depth. Recall is more directly influenced by read depth with 30x being required for identifying >95% of all variants. However, even with as little as 15x >90% are detected. Performance is largely independent of SV size (Fig. 1d); even very large SVs are called with base-pair precision (Fig. 1e and f). These examples show SV-calling combined with read phasing: Fig. 1e shows a homozygous 40 kb insertion which has been captured in single reads from each haplotype and Fig. 1f shows a 30 kb heterozygous deletion that covers LCE3B and LCE3C. A deletion of those two is known to be a susceptibility factor for .

a) FDA Precision Challenge data, HG002 c) Overall Split by read length e) Chr 6 p22.3 p12.3 q12 q14.1 q16.1 q21 q22.31 100 10 ONT whole genome ONT MHC region 4,962 kb (2.5 Gb, 55x) (6 Mb Gb, 55x) 84.3 80 8 29,000 kb 30,000 kb 31,000 kb 32,000 kb 33,000 kb 100 100 [0-79] 98 98 60 6 Depth Recall 96 96 40 4 MHC Precision pairs (%) LINC00533 HCG15 OR14J1 LINC01015 IFITM4P HCG9 HCG17 HLA-E PPP1R18 SFTA2 C6orf15 HLA-B NFKBIL1 DDAH2 C2 C4A PPT2 TSBP1-AS1 HLA-DRB6 HLA-DMA COL11A2 KIFC1 SNP SNP 94 94 F1 20 15.7 2 Sequenced base 92 92 0 0 ~7.4 kb 0 29,941,000 bp 29,943,000 bp 29,945,000 bp 29,947,000 bp 250050007500 90 90 100001250015000175002000022500250002750030000325003500037500400004250045000 Phased [0-70] Unphased Depth b) Phase-block length N50 for different read lengths (HG002) d) Haplotypes covered with >5 phased reads 3,000 Haplotype 1 2,500 reads 2,000

1,500 Haplotype 1 1,000 length (kbp) Haplotype 2 Haplotype 2 500 Unphased reads

Expected phasing block Reference gaps 0 Control g-Tube SRE+MRIII SRE+MRIII SRE+MRIII SRE+MRIII (25 kb) (11 kb) (17 kb) (26 kb) (32 kb) (36 kb) HLA-A ~3.5 kb Method of shearing Chr 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 M X Y

Fig. 2 SNP-calling a) recall and precision using publicly available datasets b)-d) phasing e) phasing genes in the major incompatibility complex The combination of long reads with highly sensitive and specific calling of single-nucleotide polymorphisms simplifies phasing and results in exceptionally long phase-blocks To test SNP-calling performance we used DeepVariant on the publicly available FDA Precision Challenge HG002 dataset. Results show high precision (>99.7%) and recall (>99.5%) for substitutions across the whole genome, including highly repetitive regions, segmental duplications, and the MHC region (Fig. 2a). The length of nanopore reads can be leveraged to assign SNP, SVs, and methylation calls to their homologous of origin. To illustrate phasing on a typical dataset, we used whatshap on an internal 40x HG002 dataset with a read N50 of 23 kb. By using information from overlapping long reads the resulting phase blocks can span several tens of megabases of the genome, typically only interrupted by long stretches of homozygosity (Fig. 2b). Phase-block lengths are influenced by read length. For 10-15 kb reads we observe a phase-block length of between 100-200 kb, sufficient to fully phase roughly 60% of all bodies. Increasing read length to >30 kb causes median phase-block length to increase tenfold, exceeding 1.5 Mb, allowing full phasing of >90% of all genes in the human genome. A high percentage of phased reads allows phasing of small variants, SVs and methylation calls. We used whatshap haplotag to scan raw reads for the presence of phased variants. With a read N50 of 23 kb we were able to assign >85% of all sequenced nucleotides to a haplotype (Fig. 2c). We found that >98% of the entire human genome (excluding centromeres and telomeres) is covered by phased reads (Fig. 2d). As a demonstration of phasing performance, we choose the MHC region, where we were able to cover 99% of the 6 Mb locus with 5 phase blocks and fully phase >99% of all genes (Fig. 2e). Oxford Nanopore Technologies and the Wheel icon are registered trademarks of Oxford Nanopore Technologies in various countries. All other brands are the property of their respective owners. © 2020 Oxford Nanopore Technologies. All rights reserved. Oxford Nanopore Technologies’ products are for research use only. P20005 - Version 1