Comprehensive Variant Detection in a Human Genome with Pacbio High-Fidelity Reads William J

Comprehensive Variant Detection in a Human Genome with Pacbio High-Fidelity Reads William J

Comprehensive variant detection in a human genome with PacBio high-fidelity reads William J. Rowell, Paul Peluso, John Harting, Yufeng Qian, Aaron Wenger, Richard Hall, David R. Rank PacBio, 1305 O’Brien Drive, Menlo Park, CA 94025 Abstract Mapping Clinically Relevant Genes Figure 2. High-Fidelity reads have (a) Figure 5. Structural Human genomic variations range in size from single nucleotide substitutions to large improved ‘mappability’ compared to short variant detection. High- chromosomal rearrangements. Sequencing technologies tend to be optimized for reads. (a) The majority of high-fidelity reads fidelity reads can be more detecting particular variant types and sizes. Short reads excel at detecting SNVs and can be mapped at MAQP60. (b) 15 kb high- easily aligned in repetitive small indels, while long or linked reads are typically used to detect larger structural fidelity reads are mapped with high regions, extending SV variants or phase distant loci. Long reads are more easily mapped to repetitive confidence to the 123 kb region centered on detection into previously regions, but tend to have lower per-base accuracy, making it difficult to call short CYP2D6, which has several large segmental difficult-to-map regions like variants. The PacBio Sequel System produces two main data types: long continuous duplications. simple repeats in the reads (up to 100 kbp), generated by single passes over a long template, and Circular (b) Consensus Sequence (CCS) reads, generated by calculating the consensus of many AUTS2 locus (left panel), sequencing passes over a single shorter template (500 bp to 20 kbp). The long-range implicated in autism, and information in continuous reads is useful for genome assembly and structural variant the CNTNAP2 locus (right detection. The higher base accuracy of CCS effectively detects and phases short panel), implicated in variants in single molecules. epilepsy. Recent improvements in library preparation protocols and sequencing chemistry have increased the length, accuracy, and throughput of CCS reads. For the human sample HG002, we collected 28-fold coverage 15 kbp high-fidelity CCS reads with an Figure 6. Short variant average read quality above Q20 (99% accuracy). The length and accuracy of these Structural Variant Calling detection in reads allow us to detect SNVs, indels, and structural variants not only in the Genome PMS2/PMS2CL. High- in a Bottle (GIAB) high confidence regions, but also in segmental duplications, HLA fidelity reads can be Sequencing assigned to the correct loci, and clinically relevant “difficult-to-map” genes. (a) (b) Figure 3. Structural 5 Mb 3 Mb 10 Mb High-Fidelity 15 kbp reads locus of the gene- As with continuous long reads, we call structural variants at 90.0% recall compared to variant detection. PacBio Sequel System pseudogene pair PMS2 (left the GIAB structural variant benchmark “truth” set, with the added advantages of base SNVs indels structural variants (a) Structural variants of 1-49 bp ≥50 bp PMS2CL pair resolution for variant calls and improved recall at compound heterozygous loci. Read Mapping 20 bp and larger can be panel) and (right minimap2 detected (b) with panel). Using a simple With minimap2 alignments, GATK4 HaplotypeCaller variant calls, and simple variant GATK workflow, short Variant Calling minimap2 alignment filtration, we have achieved a SNP F-Score of 99.51% and an INDEL F-Score of variants can be detected 80.10% against the GIAB short variant benchmark “truth” set, in addition to calling pbsv followed by pbsv analysis. (c) ~39,000 over the majority of the variants outside of the high confidence region established by GIAB using previous Visualization difficult-to-map PMS2 locus, technologies. With the long-range information available in 15 kbp reads, we applied deletion insertion inversion translocation IGV variants are detected in HG002, and (d) we a gene associated with the read-backed phasing tool WhatsHap to generate phase blocks with a mean length Lynch syndrome. of 65 kbp across the entire genome. Using an alignment-based approach, we typed achieve high recall and precision compared to all major MHC class I and class II genes to at least 3-field precision. (c) (a) the GIAB SV Tier1 v0.6 This new data type has the potential to expand the GIAB high confidence regions and benchmark. “truth” benchmark sets to many previously difficult-to-map genes and allow a single (d) sequencing protocol to address both short variants and large structural variants. Recall 90.0% (b) Locus Allele 1 Allele 2 Precision 94.0% HLA-A A*01:01:01:01 A*26:01:01 Sample Preparation and Primary Analysis HLA-B B*35:08:01 B*38:01:01:01 HLA-C C*04:01:01 C*12:03:01 HLA-DPA1 DPA1*01:03:01:02 DPA1*01:03:01:04 HLA-DPB1 DPB1*04:01:01:01 *DPB1*04:01:01 Short Variant Calling HLA-DQA1 DQA1*01:05:01 DQA1*03:01:01 HG003 HG004 (a) 3-4 µg DNA HLA-DQB1 DQB1*05:01:01 DQB1*03:02:01 (GIAB HG002) HLA-DRB1 DRB1*04:03:01 DRB1*10:01:01:03 HG002 Sequencing (a) (b) GIAB ‘truth’ GATK HC HLA-DRB4 DRB4*01:03:01 - 20 kbp shear (Megaruptor) High-Fidelity 15 kbp reads Benchmark on CCS PacBio Sequel System 13,438 (FN) 3,047,837 16,248 (FP) (TP) SMRTbell adapter ligation Read Mapping Figure 7. HLA typing. (a) High-fidelity reads are easily aligned to MHC class I and II minimap2 99.56% recall genes, such as HLA-DQB1, and (b) allow typing to three fields or greater. 15 kbp band size selection (SageELF) 99.47% precision Variant Calling polymerase binding GATK4 HaplotypeCaller (c) Conclusions 12 hr pre-extension Filtering GATK4 Filter Variants 24 hr movie • High-fidelity long reads can be mapped definitively to more parts of the genome Phasing than short reads, including segmental duplications, tandem repeats, structural WhatsHap variants, and highly divergent sequences. Visualization (b) (c) • Structural variant detection with PBSV is similar to that of continuous long reads IGV (CLR), with the added benefit of base pair resolution of breakpoints. ~88% accuracy • Because of the increased accuracy, high-fidelity long reads can be used with existing workflows to detect short variants with comparable SNP F1 score. The >99% accuracy (QV20) majority of indel miscalls are 1bp in length, and precision can be improved by (e) splitting indels by size before applying filters. (d) • High-fidelity long reads give users the ability to detect both short variants and (d) structural variants with the same dataset. CYP2D6 (f) References • From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella K, Altshuler D, Gabriel S, DePristo M, 2013 CURRENT Figure 1. Generation of high-fidelity long reads. (a) Following shearing and Figure 4. Short variant detection and phasing. (a) High-fidelity reads can be used PROTOCOLS IN BIOINFORMATICS 43:11.10.1-11.10.33 SMRTbell library creation, the SageELF system was used to select the 15 kbp band, with pre-existing short variant callers, (b) achieving SNP recall and precision • WhatsHap: fast and accurate read-based phasing. Marcel Martin, Murray which was sequenced for 20 hours on the Sequel System. (b) High-fidelity reads comparable with short read technology at similar depths. (c) Unfiltered indels are Patterson, Shilpa Garg, Sarah O. Fischer, Nadia Pisanti, Gunnar W. Klau, were generated as the consensus of multiple sequencing passes over single dominated by a low quality peak corresponding to 1 bp indels, which can be Alexander Schoenhuth, Tobias Marschall. bioRxiv 085050; doi: SMRTbell templates. (c) Q20+ CCS reads from a representative single Sequel addressed by splitting indels by size before applying filters. (d) Distribution of https://doi.org/10.1101/085050 SMRT Cell. (d) High-fidelity reads have a tight read length distribution around 14 WhatsHap read-backed phase block lengths. (e) 500 kbp phase blocks covering the • http://www.broadinstitute.org/igv kbp, and read qualities approaching Q30. difficult-to-sequence CYP2D6 locus on chromosome 22. (f) Phasing of the HLA loci. • https://www.pacb.com/support/software-downloads/ For Research Use Only. Not for use in diagnostic procedures. © Copyright 2018 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. Fragment Analyzer is a trademark of Advanced Analytical Technologies. All other trademarks are the sole property of their respective owners..

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    1 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us