Computational Methods Addressing Genetic Variation In
Total Page:16
File Type:pdf, Size:1020Kb
COMPUTATIONAL METHODS ADDRESSING GENETIC VARIATION IN NEXT-GENERATION SEQUENCING DATA by Charlotte A. Darby A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy Baltimore, Maryland June 2020 © 2020 Charlotte A. Darby All rights reserved Abstract Computational genomics involves the development and application of computational meth- ods for whole-genome-scale datasets to gain biological insight into the composition and func- tion of genomes, including how genetic variation mediates molecular phenotypes and disease. New biotechnologies such as next-generation sequencing generate genomic data on a massive scale and have transformed the field thanks to simultaneous advances in the analysis toolkit. In this thesis, I present three computational methods that use next-generation sequencing data, each of which addresses the genetic variations within and between human individuals in a different way. First, Samovar is a software tool for performing single-sample mosaic single-nucleotide variant calling on whole genome sequencing linked read data. Using haplotype assembly of heterozygous germline variants, uniquely made possible by linked reads, Samovar identifies variations in different cells that make up a bulk sequencing sample. We apply it to 13cancer samples in collaboration with researchers at Nationwide Childrens Hospital. Second, scHLAcount is a software pipeline that computes allele-specific molecule counts for the HLA genes from single-cell gene expression data. We use a personalized reference genome based on the individual’s genotypes to reveal allele-specific and cell type-specific gene expression patterns. Even given technology-specific biases of single-cell gene expression data, we can resolve allele-specific expression for these genes since the alleles are often quite different between the two haplotypes of an individual. Third, Vargas implements an optimal algorithm for edit distance alignment of sequencing ii reads to variant graph or linear reference genomes. We use these alignments to assess and improve the accuracy of several popular read alignment algorithms that use suboptimal (heuristic) algorithms. These methodological innovations enhance the capability of biotechnologies such as linked reads, single-cell gene expression, whole-genome sequencing, and RNA sequencing to reveal biological insights. iii Thesis Committee Ben Langmead (Advisor) Associate Professor Department of Computer Science Johns Hopkins Whiting School of Engineering Michael Schatz (Advisor) Bloomberg Associate Professor Department of Biology Johns Hopkins Krieger School of Arts & Sciences Department of Computer Science Johns Hopkins Whiting School of Engineering Steven Salzberg Bloomberg Professor Department of Biomedical Engineering Department of Computer Science Johns Hopkins Whiting School of Engineering iv Acknowledgments I am so grateful for the opportunity to have Ben Langmead and Mike Schatz as my PhD advisors. Thank you for mentoring me as a student and a researcher, and taking interest in my life and well-being outside of those roles. Thank you for sending me to conferences - I benefited so much from these opportunities to learn about research in the wider genomics community, be inspired by the newest discoveries in the field, and network. Internship and teaching opportunities were a big part of my PhD experience, and I appreciate your support for these plans. Thanks to Steven Salzberg for rounding out my thesis committee. Thank you also for founding and facilitating the JHU Genomics joint lab meeting, where I have had the opportunity to present my work several times and gain a greater understanding of the diverse genomics research conducted across the university. Jonathan Pevsner, Sarah Wheelan, Ben, Mike, and Steven, my GBO committee. Thank you for your feedback that guided the development of these projects. Ian Fiddes and Álvaro Martínez Barrio, my mentors at 10x Genomics, and Mark Kunitomi and Kun Hu, my mentors at IBM Research Almaden, and all my coworkers at both companies. Thank you for facilitating my internships and guiding me to explore new research directions. All my coworkers in the Langmead and Schatz labs - the # malone _comp _bio crew. I have learned so much from all of you as scientists and as people. Thank you for your support and friendship. v Table of Contents Abstract ii Acknowledgments v List of Tables viii List of Figures x 1 Introduction 1 1.1 Fundamental questions in genomics . 1 1.2 Thesis overview . 9 1.3 Other contributions . 10 2 Samovar: Single-sample mosaic single-nucleotide variant calling with linked reads 14 2.1 Background . 14 2.2 Samovar pipeline . 26 2.3 Simulated dataset . 35 2.4 Pediatric cancer dataset . 46 2.5 Discussion . 52 3 scHLAcount: Allele-specific HLA expression from single-cell gene expres- vi sion data 55 3.1 Background . 55 3.2 scHLAcount pipeline . 64 3.3 CD8+ T cell dataset . 68 3.4 Acute myeloid leukemia dataset . 69 3.5 Merkel cell carcinoma dataset . 73 3.6 3’ versus 5’ GEX data . 78 3.7 Discussion . 82 4 Vargas: heuristic-free alignment for assessing linear and graph read align- ers 84 4.1 Background . 84 4.2 Vargas software . 95 4.3 Alignment accuracy . 99 4.4 Mapping quality . 118 4.5 Optimizing alignment correctness of WGS reads . 121 4.6 Optimizing alignment correctness of ChIP-seq reads . 122 4.7 Discussion . 123 5 Conclusion 128 References 133 Candidate Biography 146 vii List of Tables 2.1 Summary of somatic mutation callers . 24 2.2 Random forest feature importances . 37 2.3 Short-read-only feature importances . 40 2.4 No-phasing feature importances . 42 2.5 Precision and recall on simulated variants . 45 2.6 Precision and recall on simulated variants with genomic filter . 45 2.7 Summary and estimated precision: pediatric cancer . 48 2.8 Pediatric cancer samples . 50 2.9 Variant counts and estimated precision in pediatric cancer samples (normal) 50 3.1 Allele-resolved counts in CD8+ T cell datasets . 69 3.2 Allele-resolved counts in AML datasets . 71 3.3 AML subject 809653 - HLA-DRB1 . 72 3.4 AML subject 809653 - HLA-C . 72 3.5 Allele-resolved counts in MCC datasets . 77 3.6 MCC discovery subject . 77 3.7 MCC validation subject . 78 4.1 Summary of exact alignment algorithms . 88 4.2 100bp read alignments . 109 4.3 250bp read alignments . 110 viii 4.4 Vargas simulation experiment . 113 4.5 Salmon RNA-seq read alignments . 117 4.6 100bp WGS reads optimization . 122 4.7 ChIP-seq reads optimization . 124 ix List of Figures 1.1 Some applications of sequencing reads . 5 2.1 Linked reads . 18 2.2 Mosaic variant signatures in phased linked reads . 21 2.3 Standard workflow . 27 2.4 Simulation workflow . 28 2.5 preFilter features . 30 2.6 Random forest model features . 31 2.7 postFilter example . 34 2.8 Short-read-only model features . 39 2.9 No-phasing model features . 39 2.10 Precision and recall on simulated variants . 41 2.11 Precision with different Samovar models . 42 2.12 Precision with genomic filter . 48 2.13 Estimated precision in pediatric cancer samples (tumor) . 51 3.1 HLA nomenclature . 59 3.2 scHLAcount pipeline . 66 3.3 AML subject 809653 . 74 3.4 MCC validation subject . 79 3.5 HLA read coverage in 3’ and 5’ GEX data . 81 x 4.1 Edit distance . 87 4.2 Vectorized graph alignment . 91 4.3 SKX weak scaling (semiglobal) . 100 4.4 KNL weak scaling (semiglobal) . 101 4.5 SKX weak scaling (local) . 102 4.6 KNL weak scaling (local) . 103 4.7 Correctness-by-score of 100bp read alignments . 107 4.8 Correctness-by-score of 250bp read alignments . 108 4.9 Correct-by-location and -by-score for unique and repetitive 100bp read align- ments . 115 4.10 Correctness-by-score for Salmon RNA-seq read alignments . 116 4.11 Mapping quality . 120 xi Chapter 1 Introduction 1.1 Fundamental questions in genomics Genomics encompasses a wide variety of research efforts into the composition and function of genomes. The existence of this field in its current form has been made possible by sequencing- based biotechnologies and the development of specialized computational methods. Genomics workflows combining experimental and computational techniques have been used to build reference genomes, identify variations between individuals and their consequences, and in- terrogate the function of genomic elements, among other applications (Figure 1.1). Some approaches take the consensus of results from many cells; others specifically address subpop- ulations of cells or even single cells. Genome assembly One of the foundational questions of genomics is to determine the genome sequence of an organism. No existing or currently feasible technology can directly report the millions to tens-of-billions of DNA bases that comprise an organism’s genome in order without error. Available DNA sequencing approaches report short snippets of hundreds to tens of thousands of bases, each of which is referred to as a read [1]. Sequencing also introduces errors; the type of error and rate depends on the technology. The problem of accurately and completely 1 reconstructing the original genome sequence from reads is known as genome assembly. Many algorithms exist for genome assembly, and the strategy varies depending on the type and quantity of input reads, and the quality of output assembly desired. Assemblies are evaluated in terms of correctness (how many errors they contain), completeness (how much of the total genome they contain), and contiguity (how many pieces they are broken into, compared to the number of chromosomes expected in the genome) [2]. Even the most well-optimized algorithms can be extremely expensive in terms of runtime, memory, and disk resources when dealing with large genomes and/or large read sets. The general paradigm is to identify reads with shared sequence, overlap them to create a new longer sequence, and repeat. Challenges arise when the genome contains repeated sequence, and to accurately reconstruct the genome with all the repeated sequences, the reads must long enough to span the repeated region and anchor to unique sequence on either side.