Population-Scale Discovery of Structural Variants with PacBio SMRT Sequencing Ralph Vogelsang, Aaron Wenger, Luke Hickey, Armin Töpfer, Yuan Li, Jonas Korlach, Mike Hunkapiller PacBio, 1305 O’Brien Drive, Menlo Park, CA 94025

Introduction Goal: Population-Scale SV Discovery Active and Future Projects

Structural variants (SVs) – genomic Existing population-scale databases of differences ≥50 base pairs – are few by small variants must be extended with count compared to single nucleotide PacBio SMRT sequencing to add sensitivity variants (SNVs) and but include for structural variants. Initial efforts to do so most of the base pairs that differ between are using the same low-coverage study two humans. design as the 1000 Project.

SNVs indels 50bp structural variants 100% 100% 75% Population-Specific Reference Structural Variant Discovery 60% 80% variant frequency (de novo assembly) (Low coverage) 50% base pairs affected 60% 0.10% 25% count cumulative 0.5% 0.50% 0% 40% 1% Numerous efforts across the globe are 1 10 100 1,000 10,000 100,000 5% 20% applying PacBio long-read sequencing to variant size (bp) %SVs discovered 10% build population-specific reference genomes 0% Figure 1. Frequency and size of variants in a human 1 10 100 1,000 and to characterize structural variation, – structural variant, , and single nucleotide # individuals sequenced (5-fold each) which is largely missed by short-read variant calls in HG00733 against GRCh38 from multiple sequencing technologies.1 60% of the total base pairs sequencing. These databases will Figure 3. Power to discover structural variants with come from the 0.5% of variants that are structural (≥50 bp). empower rare and common disease joint calling, by cohort size. In an idealized population studies for SVs, just as ExAC has done for SVs cause rare and common diseases, with no substructure, the power to detect SVs of different population frequencies depends on the number of small variants. and contribute to human traits and individuals sequenced. Common variant discovery evolution. SVs are suspected to explain saturates with few individuals. Rarer variants require PacBio Projects many still unsolved diseases. larger cohorts. (pacb.com/calculator-structural-variation) Structural variation (5-fold) De novo assembly (50-fold) Diseases Caused by SVs OMIM Mendelian Phenotypes Methods Carney complex

Poor drug metabolism prepare library sequence call variants Breast & ovarian cancer Unsolved Neurofibromatosis 3,351 SMRTbell Solved Sequel pbsv Chronic myeloid leukemia 5,236 Express System 2015 2016 2017 2018 2019 2020s Figure 4. Workflow to detect SVs from PacBio long reads.

Most human SVs are detected only by A Conclusion PacBio long read SMRT Sequencing. discover variants (joint) per sample Find SV Cluster SV Filter by Summarize Genotype - Variant databases built with short-read PacBio signatures signatures support consensus sequencing miss most SVs, and thus Illumina Deletions most of the variant base pairs, in the Bionano Insertions B human population. 0 5,000 10,000 15,000 20,000 25,000 30,000 structural variants detected in a human genome - Progress in library prep, throughput, and Figure 2. Sensitivity for structural variants in a human variant calling support the application of genome by technology. Most SVs in HG00733 were low-coverage PacBio sequencing for detected only by PacBio because many SVs involve population-scale SV discovery. repeats that are too large to be spanned by short reads but are too small to be detected by optical mapping.1 Cluster - Active projects are building databases of

2,3 common SVs to support studies of rare The applied 280 bp Variant Call and common diseases. C low-coverage (4-fold) short-read Solo Joint sequencing of many individuals to identify common small variants. This was Individual #1 - REF/ALT References effective, but studies that use short-read - REF/REF sequencing are blind to most SVs. Individual #2 1. Chaisson MJ, et al. (2017). Multi-platform discovery of haplotype-resolved structural variation in human REF/ALT REF/ALT Individual #3 genomes. bioRxiv. doi:10.1101/193144. 2. 1000 Genomes Project Consortium. (2015). A global Nature Figure 5. pbsv joint calling identifies structural variants reference for human . . simultaneously in multiple samples. (A) Joint calling 526(7571):68-74. pools reads from samples to discover variants then 3. 1000 Genomes Project Consortium. (2010). A map of genotypes per sample, increasing sensitivity at low human genome variation from population-scale coverage. (B) To call structural variants, pbsv identifies large sequencing. Nature. 467(7319):1061-73. deletion or signatures, clusters nearby signatures, 4. Telenti A, et al. (2016). Deep sequencing of 10,000 and summarizes into a call. (C) Joint calling uses support in human genomes. PNAS. 113(42):11901-6. the population to call variants in each sample. Thank you to David Scherer for poster production support.

For Research Use Only. Not for use in diagnostics procedures. © Copyright 2018 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. FEMTO Pulse and Fragment Analyzer are trademarks of Advanced Analytical Technologies. All other trademarks are the sole property of their respective owners.