UNIVERSITY of CALIFORNIA SAN DIEGO Haplotype Assembly And
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF CALIFORNIA SAN DIEGO Haplotype Assembly and Small Variant Calling using Emerging Sequencing Technologies A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science by Peter Joseph Edge Committee in charge: Professor Vikas Bansal, Chair Professor Vineet Bafna Professor Melissa Gymrek Professor Pavel Pevzner Professor Kun Zhang 2019 Copyright Peter Joseph Edge, 2019 All rights reserved. The dissertation of Peter Joseph Edge is approved, and it is acceptable in quality and form for publication on microfilm and electronically: Chair University of California San Diego 2019 iii DEDICATION To my parents, Chris and Karen, who always encouraged me to follow my passions. And to my grandpa, Ron, the original Dr. Edge. iv EPIGRAPH One never notices what has been done; one can only see what remains to be done. —Marie Curie However difficult life may seem, there is always something you can do, and succeed at. It matters that you don’t just give up. —Stephen Hawking I see this as an absolute win! —Bruce Banner v TABLE OF CONTENTS Signature Page . iii Dedication . iv Epigraph . .v Table of Contents . vi List of Figures . .x List of Tables . xii Acknowledgements . xiii Vita ............................................. xiv Abstract of the Dissertation . xv Chapter 1 Introduction . .1 1.1 The diploid human genome . .1 1.2 Advances in DNA sequencing technology . .2 1.3 Single nucleotide variant calling . .2 1.4 Limitations of second generation sequencing . .3 1.5 New technologies and new challenges . .4 1.6 Scope of the thesis . .5 Chapter 2 HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies . .7 2.1 Abstract . .7 2.2 Introduction . .8 2.3 Results . 11 2.3.1 Overview of HapCUT2 algorithm . 12 2.3.2 Comparison of runtimes on simulated data . 13 2.3.3 Comparison of methods on diverse WGS datasets for a single individual . 16 2.3.4 Comparison of haplotypes assembled using Hi-C and SMRT sequencing . 21 2.3.5 Considerations when haplotyping with Hi-C . 22 2.4 Discussion . 24 2.5 Methods . 27 2.5.1 Haplotype likelihood for sequence reads . 28 2.5.2 Likelihood-based HapCUT2 algorithm . 29 vi 2.5.3 Complexity of HapCUT2 . 31 2.5.4 Estimation of h-trans error probabilities in Hi-C data . 31 2.5.5 Post-processing of haplotypes . 32 2.5.6 Accuracy and completeness of haplotype assemblies . 33 2.5.7 Long read datasets and haplotype assembly tools . 33 2.5.8 Variant calls and haplotypes for NA12878 . 34 2.5.9 Alignment and processing of Hi-C data . 34 2.5.10 Read simulations . 35 2.6 Software availability . 35 2.7 Disclosure declaration . 35 2.8 Acknowledgments . 36 2.9 Tables . 37 2.10 Figures . 38 Chapter 3 Computational techniques for highly accurate variant calling and haplotyp- ing of single human cells . 41 3.1 Abstract . 41 3.2 Introduction . 42 3.3 Results . 44 3.3.1 SNV calling algorithm . 44 3.3.2 Haplotype Assembly . 45 3.3.3 Strand-to-strand matching for improved SNV accuracy . 47 3.4 Discussion . 48 3.5 Methods . 49 3.5.1 SNV calling algorithm . 49 3.5.2 Haplotype assembly . 56 3.5.3 Accuracy of haplotypes . 57 3.5.4 Same haplotype strand pairing . 58 3.5.5 Accuracy of SNV calling . 59 3.5.6 Workflow management . 59 3.6 Acknowledgements . 59 3.7 Figures and Tables . 60 Chapter 4 Longshot enables accurate variant calling in diploid genomes from single- molecule long read sequencing . 67 4.1 Abstract . 67 4.2 Introduction . 68 4.3 Results . 71 4.3.1 Overview of method . 71 4.3.2 Accurate SNV calling using simulated data . 72 4.3.3 Accurate SNV calling using whole-genome PacBio data . 73 4.3.4 Accuracy of Longshot haplotypes . 76 4.3.5 SNV calling using Oxford Nanopore reads . 77 vii 4.3.6 Analysis of SNV calls in repetitive regions . 77 4.4 Discussion . 79 4.5 Methods . 82 4.5.1 Identification of candidate SNVs . 82 4.5.2 Local realignment using pair-HMMs . 82 4.5.3 Haplotype-informed genotyping . 83 4.5.4 Variant filtering . 85 4.5.5 Simulations . 86 4.5.6 Whole-Genome Sequencing data . 86 4.5.7 Assessment of variant calling and phasing accuracy . 87 4.5.8 Server configuration . 88 4.6 Data Availability . 88 4.7 Code availability . 90 4.8 Author Contributions . 90 4.9 Competing Interests . 90 4.10 Acknowledgments . 90 4.11 Figures and Tables . 92 Appendix A Supplemental Material for Chapter 2 . 97 A.1 Supplemental Methods for Chapter 2 . 98 A.1.1 Maximum Likelihood cut heuristic . 98 A.1.2 Implementation of HapCUT2 . 98 A.1.3 Likelihood-based variant pruning . 99 A.1.4 Block Splitting . 100 A.1.5 Estimating t(I) for Hi-C reads . 101 A.1.6 Extraction of haplotype informative reads . 104 A.1.7 Post processing of alignments for Hi-C reads . 104 A.1.8 Experiment and Pipeline Management . 105 A.2 Acknowledgments . 105 A.3 Supplemental Figures and Tables for Chapter 2 . 106 Appendix B Supplemental Material for Chapter 4 . 115 B.1 Supplemental Methods for Chapter 4 . 116 B.1.1 Simulating a diploid genome . 116 B.1.2 Estimating coverage from aligned reads . 116 B.1.3 Identification of candidate SNVs . 116 B.1.4 Finding non-repetitive anchors . 117 B.1.5 Pair-HMM realignment for clusters of SNVs . 118 B.1.6 Priors on genotypes . 119 B.1.7 Haplotyping and measuring accuracy . 119 B.1.8 Separation of reads by haplotype . 120 B.1.9 Alignment Parameter Estimation . 121 B.1.10 Variant calling using Clairvoyante and WhatsHap . 122 viii B.1.11 Variant calling using Nanopolish . 123 B.2 Acknowledgments . 123 B.3 Supplemental Figures and Tables for Chapter 4 . 125 Bibliography . 139 ix LIST OF FIGURES Figure 2.1: Comparison of runtime (top panel) and switch+mismatch error rate (bottom panel) for HapCUT2 with four methods for haplotype assembly (HapCUT, RefHap, ProbHap, and FastHare) on simulated read data . 38 Figure 2.2: Accuracy of HapCUT2 compared to four other methods for haplotype assem- bly on diverse whole-genome sequence datasets for NA12878 . 39 Figure 2.3: Accuracy of HapCUT2 compared to four other methods for haplotype assem- bly on diverse whole-genome sequence datasets for NA12878 . 40 Figure 2.4: Improvements in the (A) completeness and (B) accuracy (switch + mismatch error rates) of the largest haplotype block with increasing Hi-C sequencing coverage for two different restriction enzymes: MboI and HindIII. 40 Figure 3.1: An overview of the experimental process of SISSOR technology. 63 Figure 3.2: Overview of the SISSOR variant calling algorithm. 64 Figure 3.3: Strategies used to remove haplotype errors from SISSOR fragments. 65 Figure 3.4: Accuracy of haplotypes assembled with SISSOR with and without fragment processing techniques. 66 Figure 4.1: Overview of the Longshot algorithm. 92 Figure 4.2: Accuracy and completeness of Longshot SNV calls on whole-genome SMS data. 93 Figure 4.3: Accurate variant calling using SMS reads and Longshot in the duplicated gene STRC................................... 94 Figure A1: An expanded version of Figure 2.1 with shaded areas added to represent the standard deviation of the 10 replicate experiments. 106 Figure A2: Comparison of the performance of HapCUT2 with other tools on NA12878 fosmid data across all chromosomes. 107 Figure A3: Comparison of the.