COMPUTATIONAL METHODS ADDRESSING GENETIC VARIATION IN

NEXT-GENERATION SEQUENCING DATA

by

Charlotte A. Darby

A dissertation submitted to in conformity with the requirements for the degree of

Baltimore, Maryland June 2020

© 2020 Charlotte A. Darby

All rights reserved Abstract

Computational genomics involves the development and application of computational meth- ods for whole-genome-scale datasets to gain biological insight into the composition and func- tion of genomes, including how genetic variation mediates molecular phenotypes and disease.

New biotechnologies such as next-generation sequencing generate genomic data on a massive scale and have transformed the field thanks to simultaneous advances in the analysis toolkit. In this , I present three computational methods that use next-generation sequencing data, each of which addresses the genetic variations within and between human individuals in a different way. First, Samovar is a software tool for performing single-sample mosaic single-nucleotide variant calling on whole genome sequencing linked read data. Using haplotype assembly of heterozygous germline variants, uniquely made possible by linked reads, Samovar identifies variations in different cells that make up a bulk sequencing sample. We apply it to 13cancer samples in collaboration with researchers at Nationwide Childrens Hospital. Second, scHLAcount is a software pipeline that computes allele-specific molecule counts for the HLA genes from single-cell gene expression data. We use a personalized reference genome based on the individual’s genotypes to reveal allele-specific and cell type-specific gene expression patterns. Even given technology-specific biases of single-cell gene expression data, we can resolve allele-specific expression for these genes since the alleles are often quite different between the two haplotypes of an individual. Third, Vargas implements an optimal algorithm for edit distance alignment of sequencing

ii reads to variant graph or linear reference genomes. We use these alignments to assess and improve the accuracy of several popular read alignment algorithms that use suboptimal (heuristic) algorithms. These methodological innovations enhance the capability of biotechnologies such as linked reads, single-cell gene expression, whole-genome sequencing, and RNA sequencing to reveal biological insights.

iii Thesis Committee

Ben Langmead (Advisor) Associate Professor Department of Computer Science Johns Hopkins Whiting School of Engineering

Michael Schatz (Advisor) Bloomberg Associate Professor Department of Biology Johns Hopkins Krieger School of Arts & Sciences Department of Computer Science Johns Hopkins Whiting School of Engineering

Steven Salzberg Bloomberg Professor Department of Biomedical Engineering Department of Computer Science Johns Hopkins Whiting School of Engineering

iv Acknowledgments

I am so grateful for the opportunity to have Ben Langmead and Mike Schatz as my PhD advisors. Thank you for mentoring me as a student and a researcher, and taking interest in my life and well-being outside of those roles. Thank you for sending me to conferences - I benefited so much from these opportunities to learn about research in the wider genomics community, be inspired by the newest discoveries in the field, and network. Internship and teaching opportunities were a big part of my PhD experience, and I appreciate your support for these plans. Thanks to Steven Salzberg for rounding out my thesis committee. Thank you also for founding and facilitating the JHU Genomics joint lab meeting, where I have had the opportunity to present my work several times and gain a greater understanding of the diverse genomics research conducted across the university. Jonathan Pevsner, Sarah Wheelan, Ben, Mike, and Steven, my GBO committee.

Thank you for your feedback that guided the development of these projects. Ian Fiddes and Álvaro Martínez Barrio, my mentors at 10x Genomics, and Mark Kunitomi and Kun Hu, my mentors at IBM Research Almaden, and all my coworkers at both companies. Thank you for facilitating my internships and guiding me to explore new research directions. All my coworkers in the Langmead and Schatz labs - the # malone _comp _bio crew.

I have learned so much from all of you as scientists and as people. Thank you for your support and friendship.

v Table of Contents

Abstract ii

Acknowledgments v

List of Tables viii

List of Figures x

1 Introduction 1 1.1 Fundamental questions in genomics ...... 1 1.2 Thesis overview ...... 9

1.3 Other contributions ...... 10

2 Samovar: Single-sample mosaic single-nucleotide variant calling with linked reads 14

2.1 Background ...... 14 2.2 Samovar pipeline ...... 26 2.3 Simulated dataset ...... 35

2.4 Pediatric cancer dataset ...... 46 2.5 Discussion ...... 52

3 scHLAcount: Allele-specific HLA expression from single-cell gene expres-

vi sion data 55

3.1 Background ...... 55 3.2 scHLAcount pipeline ...... 64 3.3 CD8+ T cell dataset ...... 68

3.4 Acute myeloid leukemia dataset ...... 69 3.5 Merkel cell carcinoma dataset ...... 73 3.6 3’ versus 5’ GEX data ...... 78

3.7 Discussion ...... 82

4 Vargas: heuristic-free alignment for assessing linear and graph read align- ers 84

4.1 Background ...... 84 4.2 Vargas software ...... 95 4.3 Alignment accuracy ...... 99

4.4 Mapping quality ...... 118 4.5 Optimizing alignment correctness of WGS reads ...... 121 4.6 Optimizing alignment correctness of ChIP-seq reads ...... 122

4.7 Discussion ...... 123

5 Conclusion 128

References 133

Candidate Biography 146

vii List of Tables

2.1 Summary of somatic mutation callers ...... 24

2.2 Random forest feature importances ...... 37 2.3 Short-read-only feature importances ...... 40 2.4 No-phasing feature importances ...... 42

2.5 Precision and recall on simulated variants ...... 45 2.6 Precision and recall on simulated variants with genomic filter ...... 45 2.7 Summary and estimated precision: pediatric cancer ...... 48

2.8 Pediatric cancer samples ...... 50 2.9 Variant counts and estimated precision in pediatric cancer samples (normal) 50

3.1 Allele-resolved counts in CD8+ T cell datasets ...... 69 3.2 Allele-resolved counts in AML datasets ...... 71

3.3 AML subject 809653 - HLA-DRB1 ...... 72 3.4 AML subject 809653 - HLA-C ...... 72 3.5 Allele-resolved counts in MCC datasets ...... 77

3.6 MCC discovery subject ...... 77 3.7 MCC validation subject ...... 78

4.1 Summary of exact alignment algorithms ...... 88 4.2 100bp read alignments ...... 109

4.3 250bp read alignments ...... 110

viii 4.4 Vargas simulation experiment ...... 113

4.5 Salmon RNA-seq read alignments ...... 117 4.6 100bp WGS reads optimization ...... 122 4.7 ChIP-seq reads optimization ...... 124

ix List of Figures

1.1 Some applications of sequencing reads ...... 5

2.1 Linked reads ...... 18

2.2 Mosaic variant signatures in phased linked reads ...... 21 2.3 Standard workflow ...... 27 2.4 Simulation workflow ...... 28

2.5 preFilter features ...... 30 2.6 Random forest model features ...... 31 2.7 postFilter example ...... 34

2.8 Short-read-only model features ...... 39 2.9 No-phasing model features ...... 39 2.10 Precision and recall on simulated variants ...... 41

2.11 Precision with different Samovar models ...... 42 2.12 Precision with genomic filter ...... 48 2.13 Estimated precision in pediatric cancer samples (tumor) ...... 51

3.1 HLA nomenclature ...... 59

3.2 scHLAcount pipeline ...... 66 3.3 AML subject 809653 ...... 74 3.4 MCC validation subject ...... 79

3.5 HLA read coverage in 3’ and 5’ GEX data ...... 81

x 4.1 Edit distance ...... 87

4.2 Vectorized graph alignment ...... 91 4.3 SKX weak scaling (semiglobal) ...... 100 4.4 KNL weak scaling (semiglobal) ...... 101

4.5 SKX weak scaling (local) ...... 102 4.6 KNL weak scaling (local) ...... 103 4.7 Correctness-by-score of 100bp read alignments ...... 107

4.8 Correctness-by-score of 250bp read alignments ...... 108 4.9 Correct-by-location and -by-score for unique and repetitive 100bp read align- ments ...... 115

4.10 Correctness-by-score for Salmon RNA-seq read alignments ...... 116 4.11 Mapping quality ...... 120

xi Chapter 1

Introduction

1.1 Fundamental questions in genomics

Genomics encompasses a wide variety of research efforts into the composition and function of genomes. The existence of this field in its current form has been made possible by sequencing- based biotechnologies and the development of specialized computational methods. Genomics workflows combining experimental and computational techniques have been used to build reference genomes, identify variations between individuals and their consequences, and in- terrogate the function of genomic elements, among other applications (Figure 1.1). Some approaches take the consensus of results from many cells; others specifically address subpop- ulations of cells or even single cells.

Genome assembly

One of the foundational questions of genomics is to determine the genome sequence of an organism. No existing or currently feasible technology can directly report the millions to tens-of-billions of DNA bases that comprise an organism’s genome in order without error. Available DNA sequencing approaches report short snippets of hundreds to tens of thousands of bases, each of which is referred to as a read [1]. Sequencing also introduces errors; the type of error and rate depends on the technology. The problem of accurately and completely

1 reconstructing the original genome sequence from reads is known as genome assembly.

Many algorithms exist for genome assembly, and the strategy varies depending on the type and quantity of input reads, and the quality of output assembly desired. Assemblies are evaluated in terms of correctness (how many errors they contain), completeness (how much of the total genome they contain), and contiguity (how many pieces they are broken into, compared to the number of chromosomes expected in the genome) [2]. Even the most well-optimized algorithms can be extremely expensive in terms of runtime, memory, and disk resources when dealing with large genomes and/or large read sets. The general paradigm is to identify reads with shared sequence, overlap them to create a new longer sequence, and repeat. Challenges arise when the genome contains repeated sequence, and to accurately reconstruct the genome with all the repeated sequences, the reads must long enough to span the repeated region and anchor to unique sequence on either side. An assembly may be based upon multiple sequencing technologies and complementary approaches to avoid the systematic biases or weaknesses of one particular assay. See [3] for a recent review of assembly algorithms and biotechnologies. In the case of the human genome, efforts in the 1990’s culminated in the first release of the human reference genome in 2001 [4, 5], which has been updated many times in the years since, but still has missing and un-placed sequence. This reference is a patchwork of a few individuals. Recently, a resequencing effort focusing on data generated from a single cell line

- the Telomere to Telomere consortium - seeks a completed human genome and has released results for the X chromosome [6].

Alignment and variant calling

A genome assembly is an important starting point to understand the organization and func- tion of a species’ genome. Furthermore, if a standardized assembly is made widely available as the “reference genome” for a species, it dramatically lowers the time and cost needed to

2 study more individuals. While it is possible to attempt a genome assembly from scratch (de novo) for each individual studied, it saves time and resources to use a reference, given that the broad structure and base-level sequence is near-identical among individuals in a species. In reference-guided assembly, the reference genome can be used to order non-overlapping pieces of a fragmented assembly or fill in gaps between fragments. Another approach to analyzing a genome is to forgo assembly of the new individual altogether and match the sequencing reads from the new individual to their point of origin on the reference genome.

Since the reference genome is up to billions of characters long, and there may be millions to billions of sequencing reads in an experiment, algorithms addressing this read alignment problem face a daunting task, just like assembly. Alignment strategies need to address the fact that a read may not have an exact match in the reference due to errors in the reference, sequencing errors in the reads, or genetic variation, but it should be mapped to a location with few differences if one exists. Again, there are a host of different strategies. Onecommon theme is to first find short exact matches between the read and reference using apre-built index of the reference, then join or extend these short matches into candidate full alignments. The general problem is reviewed in [7] and many specific algorithmic strategies are discussed in [8]. After reads are aligned to the reference, variant calling is performed to identify single bases, small regions, and large regions where the individual sequenced differs from the refer- ence [9]. Variant calling algorithms need to discriminate genetic variation from sequencing errors in the reads or artifacts introduced by the alignment algorithm. In general, less data is needed to perform a variant calling analysis than are needed for reference-guided or de novo assembly. If the organism has two or more homologous copies of a segment of DNA (e.g. humans are diploid, having one copy of the genome from either parent) haplotype assembly can also be performed based on reads overlapping multiple variants. Haplotype assembly is the

3 reconstruction of variants on each copy of the homologous chromosomes [10]. Long reads and linked reads (discussed further in Chapter 2) are ideal for this task.

Genome and variant interpretation

Given data from many individuals, population-scale statistical studies can be performed to find correlations between particular genetic variations and phenotypes [11]. The phenotype in question can be on the cellular level (e.g. expression of a particular gene), the whole organism (e.g. height), or any scale in between. Correlations can also be found between genetic variation and risks or outcomes related to disease. Variant effects can also be explored on the individual level when paired with another class of interpretive analysis which seeks to annotate the function of different parts of the genome [12]. Assays beyond whole-genome sequencing are used in concert with the genome sequence. For example, data can be collected to determine the proximity of certain parts of DNA to each other in the nucleus; whether a segment of DNA is in open or closed chromatin; what DNA binds to certain proteins; and which regions are transcribed to RNA or translated to protein. Using functional annotations of the genome, it may be possible to determine the effect of a particular variant on an individual or the variants causing a particular phenotype, especially if the variant in question affects a protein or the phenotype is very severe [13]. It is more difficult to associate variants and phenotypes without a large sample sizeifthe variant is non-coding and mediates phenotypes through gene expression or other regulatory effects [14].

Single-cell analysis

The same questions about genome sequence, genome function and cell behavior can be assayed at higher resolution using experimental methods that interrogate individual cells. Single cell assays have been used to explore cell differentiation, tumor heterogeneity, im-

4 sequencing reads alignment algorithm read alignments to reference genome

assembly algorithm variant calling algorithm

homozygous and heterozygous variant calls with respect to reference genome genome assembly functional data

haplotype assembly algorithm annotation algorithm

H1 H2 phased heterozygous variant calls functional annotations

Figure 1.1: Sequencing reads can be assembled into a genome, which coupled with func- tional data, can be annotated (left side). Alternatively, with the use of a reference genome, sequencing reads can be aligned and used to call and phase variants (right side).

5 munology, and many other applications that benefit from capturing diversity at a cellular level. DNA sequencing of single cells, typically performed at low coverage, can reveal broad strokes of genome heterogeneity by measuring copy number variation [15]. This technique is most commonly applied to cancer samples, where different subclones of a tumor over time or in different locations in the body may have loss or gain of large segments ofDNA. The transcriptome of single cells can also be measured using a number of technologies, discussed further in Chapter 3 [16]. Different cells express different genes according tothe function of that cell type in the organism. Some available technologies are easier than per- forming bulk sequencing on a homogeneous population isolated by cell-sorting or dissection.

They can also reveal transcriptionally distinct cell types that would difficult to identify, isolate, and study by other means. Other functional assays such as ATAC-seq for open chromatin or cell surface protein expression have been modified to work on single cells. Research is also ongoing todevelop protocols to perform multiple simultaneous assays on the same cell, known as multiomics [17]. For example, CITE-seq [18] and 10x Genomics Immune Profiling with Feature Barcoding

[19] measure surface proteins and gene expression from the same cell.

Next-generation sequencing

Next-generation sequencing (NGS) refers to the biotechnology paradigm where short frag- ments of DNA are sequenced at scale for a low cost [1]. Currently, the most widely used NGS instruments are produced by Illumina. These sequencers produce reads up to 300 bases long with tens of millions to billions of reads generated per run [20].

NGS can be employed for “shotgun” whole-genome sequencing, where fragments of DNA from cells are sequenced to determine the sequence of the genome using the assembly or alignment/variant calling workflows described above. However, this is far from the only

6 application. Other biological techniques have been developed where NGS is the final read- out. In RNA-seq, RNA molecules are reverse-transcribed to DNA and then sequenced. Genomic regions can be selected for sequencing based on their DNA sequence, as employed in whole-exome sequencing (selecting genes) or other targeted panel methods. Genomic regions can also be selected based on other functional or spatial characteristics. For exam- ple, Hi-C identifies segments of DNA that are physically near each other; ATAC-seq selects open chromatin; and ChIP-seq uses immunoprecipitation to select DNA that interacts with intracellular proteins such as transcription factors and histones. Over the past decade, cost per base sequenced has declined precipitously. As of August 2019, the National Human Genome Research Institute (NHGRI) reported that NGS costs approximately 1 cent per million bases (Mb) [21]. As a result of decreasing cost and the wide range of experimental techniques based on NGS, these instruments are now ubiquitous in laboratories and sequencing core facilities. Worldwide sequencing capacity (including NGS and other technologies) generates data at the rate of exabases per year, i.e. on the order of 1018 [22]. A portion of this is data is generated for research purposes and deposited in public archives, such as the NCBI Sequence Read Archive (SRA), which holds a total of 39 petabases (3.9 × 1016 bases) as of March 2020 [23].

Computational methods

The field of genomics is continually expanding in scope. More species and individuals are being studied, and new biotechnologies emerge to measure more genomic attributes at dif- ferent resolutions. However, the data generated by new assays and platforms can only be put to use thanks to simultaneous development of new computational analysis methods.

Some of the specifications of the problems discussed above have changed over time; for example, read length and sequencing error model vary among sequencing technologies. The algorithmic strategies employed for alignment, assembly, variant calling, or haplotype

7 assembly using high-accuracy reads less than a few hundred bases are very different from those applied to reads tens of thousands of bases long with lower base-level accuracy. Tools developed for early iterations of a technology may not perform well - or work at all - for the most recent versions. There is a continual feedback cycle between the development of biotechnologies and the corresponding computational methods. For a given problem, different tools take different approaches to compromise between speed and resource consumption and the quality of the result. For example, in genome assembly, there are cases where a high-quality assembly is required and there are also cases where a fragmented assembly that takes much less time and memory to construct is sufficient. Pseudoalignment approaches to read mapping, discussed in Chapters 3 and 4, forgo base-pair resolution alignment of read and reference but have proved useful in RNA-seq analysis, and may be considerably faster than alignment methods. Computational resources are also evolving. New software has been developed to lever- age emerging hardware and computing paradigms. Algorithms originally developed for and deployed on CPU have been adapted to other hardware such as GPU or FPGA. If the algo- rithm can be adapted to to the constraints of the hardware programming model, substantial speedup can be gained. Genomics software is often deployed in high-performance comput- ing environments, so algorithms that perform well when parallelized among many cores of a computing cluster are particularly important. Parallelization can also be implemented at a hardware-instruction level using the single-instruction multiple-data (SIMD) paradigm, which is described in detail in Chapter 5. The increase in available sequencing data has led to a need for scalable applications. Some take advantage of large-scale computing resources such as clusters or are deployed in the cloud. Others employ heuristics, which are algorithms not guaranteed to lead to an optimal solution, but are fast and accurate on commonly-encountered problem instances. Addressing large datasets by manipulating a reduced-size representation generated via compression,

8 indexing, or sketching is another recent initiative [24, 25]. Early tools were adequate when databases of genomes or sequencing read datasets were small. However, the implementations or the fundamental strategies have seldom scaled to today’s data. This required - and continues to require - substantial methodological innovation.

1.2 Thesis overview

Developing new algorithms and computational methods remains an important initiative in genomics in order to gain biological insight from new types of data and the increasing scale of available data. In this thesis, I present three computational methods developed in response to emerging NGS-based biotechnology innovation. Each of the software described in this thesis addresses specific attributes of the NGS-based assay(s) to which it can be applied and addresses - or even leverages - those features in its algorithms. We focus the development and testing of these methods on human genomics, but the principles are transferable to other species, and even to other datatypes besides those directly addressed. All three methods in- corporate sequencing data and address genetic variation; depending on the method, “genetic variation” has a slightly different definition. In Chapter 2, I describe Samovar, a software tool for performing single-sample mosaic single-nucleotide variant calling on whole genome sequencing linked read data. Using hap- lotype assembly of heterozygous germline variants, uniquely made possible by linked reads, Samovar identifies variations in different cells that make up a bulk sequencing sample.We applied Samovar to 13 cancer samples in collaboration with researchers at Nationwide Chil- drens Hospital. Chapter 3 addresses scHLAcount, which uses 10x Genomics single-cell gene expression data and HLA genotypes to compute allele-specific molecule counts. We show how itcan be applied to large cancer datasets from the literature. Since the sequence of the HLA genes varies considerably in the human population, using the standard reference genome

9 may underestimate expression. Furthermore, even given technology-specific biases of single- cell gene expression data, we can resolve allele-specific expression for these genes since the alleles are often quite different between the two haplotypes of an individual. In Chapter 4, I present Vargas, a program for computing optimal edit distance alignments between sequencing reads and variant graph or linear reference genomes. These alignments are used to assess and improve heuristic read alignment algorithms. The variant graph refer- ence genomes we explore incorporate small genetic variation (single-nucleotide variants and small insertions and deletions) cataloged in a survey of thousands of individuals worldwide. Read alignment algorithms employ different strategies to find inexact matches due tose- quencing error and genetic variation between the read and reference, and our analysis based on Vargas optimal alignments evaluates the performance of these strategies. These methodological innovations enhance the capability of biotechnologies such as linked reads, single-cell gene expression, whole-genome sequencing, and RNA-seq to reveal biological insights. Given the scale of current and future datasets, computational efficiency is also carefully considered.

1.3 Other contributions

Besides the three projects described in detail in the following chapters, I had the opportunity to contribute to several other research projects, which I will briefly describe here.

Simulation of linked read sequencing data. LRSim is a software that generates sim- ulated 10x Genomics linked read sequencing data [26]. (Linked read technology is described in detail in Chapter 2.) Simulated data is an essential component of methods development. In real data, the true sequence of a genome is unknown, so simulated sequencing reads are often employed to evaluate an algorithm’s accuracy since they are generated based on a known genome. LRSim generates read sequences based on the reference genome. However,

10 in real datasets, reads from the individual being sequenced will have genetic variation with respect to the reference. My role in this project was to implement a feature that makes the simulated reads more realistic by adding homozygous and heterozygous single-nucleotide variants. The variants and their haplotypes are recorded for evaluation purposes. I also contributed to editing the manuscript and am a coauthor on the publication [27].

Review article on the of third-generation sequencing. I contributed to and am a coauthor of the Nature Reviews Genetics article “Piercing the dark matter: Bioinformatics of long-range sequencing and mapping” [2]. This article describes current and future applications for several emerging sequencing technologies that go beyond short-read next generation sequencing. Collectively, these approaches are known as “third-generation” technologies. Some build upon short-read sequencing (linked reads and Hi-C); others use completely different paradigms to get much longer sequencing reads (PacBio and Oxford

Nanopore). The review specifically focuses on bioinformatics algorithms developed to pro- cess these new types of data and characterize regions of the genome - the “dark matter” - that challenge other technologies.

I wrote the section “Haplotype phasing and allele-specific analysis.” This section describes and contextualizes the problem of haplotype assembly (phasing), which is the process of reconstructing the sequence of alleles at heterozygous sites on each homologous copy of a chromosome. I compare and contrast algorithms in the literature that leverage third- generation sequencing technologies to achieve this goal and discuss applications of a genome with phased variants.

Taxonomic classification of bacterial genome sequences. I completed a 3-month internship at IBM Research Almaden in summer 2018 in the Industrial and Applied Genomics department, directly supervised by Research Staff Member Mark Kunitomi. One ofthe ongoing efforts in the department is bacterial pathogen surveillance in the food supply chain

11 [28]. In this context, they have collected a huge database of bacterial genome assemblies from a number of sources. Some metadata on the taxonomic classification of the genomes (e.g. genus, species, strain) is inevitably missing or incorrect. My goal was to explore possible approaches for identifying incorrect metadata and generating metadata.

Genome assemblies from bacteria with the same taxonomic classification often have sim- ilar genome sequences. I explored different methods from the literature designed for mea- suring the distance between two strings (in this case, bacterial genome assemblies) based on their k-mers (substrings of k characters). Based on these distances, I also experimented with many clustering algorithms. I came up with a workflow that used a weighted nearest- neighbor approach to infer taxonomic labels based on a collection of curated, labeled genome assemblies. This approach could be used identify incorrect metadata if the existing label did not match the inferred label, or create metadata for unlabeled genomes.

Polyploid haplotype assembly. I worked on a project with a Johns Hopkins undergrad- uate student, George Botev, where we developed a heuristic algorithm to perform haplotype assembly (phasing) for polyploid genomes. Some species (many plants, for example) nat- urally have genomes with more than two copies of each chromosome. Polyploidy can also occur in regions of diploid genomes - in cancer, chromosomes or parts of chromosomes can be duplicated, and each copy accrues separate mutations. Most algorithms that reconstruct the sequence of variant alleles on each chromosome copy are limited to the diploid case. There are exponentially more possible solutions to consider when there are more than two copies of each chromosome. After reviewing numerous approaches in the literature for both diploid and polyploid haplotype phasing, we developed an algorithm of our own. We tested its accuracy by recon- structing simulated haplotypes. Developing the simulator was also an important component of the work, as it has parameters to take into account many characteristics of the genome and

12 the sequencing reads. George presented a poster at the Biological Data Science conference at Cold Spring Harbor Laboratory in November 2018.

13 Chapter 2

Samovar: Single-sample mosaic single-nucleotide variant calling with linked reads

This chapter describes Samovar, a method for single-sample mosaic SNV calling with linked reads, and an application of the method to pediatric cancer samples. This work was presented at the RECOMB-Seq conference in Washington, D.C. in May 2019, where it won the Best Paper award [29] and was published in iScience in the RECOMB-Seq 2019 special issue [30].

Charlotte A. Darby, James R. Fitch, Patrick J. Brennan, Benjamin J. Kelly, Na- talie Bir, Vincent Magrini, Jeffrey Leonard, Catherine E. Cottrell, Julie M. Gastier-

Foster, Richard K. Wilson, Elaine R. Mardis, Peter White, Ben Langmead, and Michael C. Schatz. “Samovar: Single-Sample Mosaic Single-Nucleotide Variant

Calling with Linked Reads.” In: iScience 18 (2019).

Samovar is available on Github under the MIT license [31].

2.1 Background

This background section is organized in the order experimental and computational steps would be taken in an analysis workflow. First, next-generation sequencing data is generated

14 using a library preparation protocol for linked reads (e.g. 10x Genomics Chromium). Then, sequencing reads are aligned to the reference genome and germline variants are called and phased. Finally, the Samovar algorithm is employed to call mosaic variants.

Linked reads

Linked reads are the emerging biotechnology inspiring the development of the method pre- sented in this chapter. Linked reads are an extension of paired-end sequencing. In paired-end (or mate-pair) sequencing configurations, both ends of a DNA molecule are sequenced and the two resulting reads are reported as a pair in the sequencing output. Molecules are usually size-selected before sequencing, and this estimated size distribution can be used to improve alignment. If one read of the pair is aligned, the other read should align nearby within the size distribution. This locality information can resolve the alignment location for a read that has multiple candidate alignments. If the reads align confidently but not in the relative location expected by the experimental protocol, this may be a signal of structural variation between the reference and the individual being sequenced. Paired-end reads have limited utility in linking distant genomic elements, as the original molecules are typically only a few hundred bases long and there are only two short reads per pair.

A linked read is a set of next-generation (Illumina) short reads, which might themselves be paired-end reads, sparsely sampling a long DNA molecule. A DNA barcode labels short reads to indicate which long molecule (or set of molecules) they originated from. The reads making up a linked read are expected to align to a continuous segment on the reference genome. Like paired-end reads, this locality information is used to inform alignments and identify structural variants that disrupt the expected alignment positions. Furthermore, linked reads can be used for haplotype assembly, which is discussed in detail shortly, and produce haplotype-resolved de novo genome assemblies with higher quality than assemblies generated from paired-end reads alone.

15 Linked reads are similar to synthetic long reads, where deep short-read coverage is se- quenced from a long fragment with the goal of assembling the entire original molecule. Syn- thetic long read experimental protocols described in the literature include Illumina TruSeq [32], LFR [33], and LRseq [34]. The group of short reads sharing a barcode in linked reads and synthetic long reads has been generally referred to as a read cloud (e.g. [35–37]). Compared to synthetic long reads, the linked read approach achieves higher physical coverage (number of long molecules spanning a genomic locus) at the expense of sparse molecule coverage. Each individual long molecule has less than 1x short-read sequencing coverage, but more molecules are sequenced. Experimental protocols include TELL-seq [38], CPTv2-seq [39], stLFR [40], and most notably the now-deprecated commercialization by 10x

Genomics first known as GEMCode and later as Chromium [41, 42]. This work exclusively uses linked read data from 10x Genomics Chromium, but the principles are potentially transferable to other linked read, synthetic long read and even true long read protocols. From here, we refer to 10x Genomics Chromium as linked reads and focus on the specifications of that technology.

10x Genomics linked read library preparation

In the linked read library preparation, genomic DNA is sheared into long fragments of the desired length, which can be tens to hundreds of kilobases long. An average of 10 molecules are encapsulated in an oil droplet along with primers and enzymes using microfluidic tech- nology. All reads from the same droplet in the resultant paired-end Illumina library share a 16bp DNA barcode. There are 4 million barcodes and, in a typical run, 1.4 million droplets are generated [43]. Exome capture can be performed after the Illumina library is generated, before next-generation sequencing [42]. The 10x Genomics experimental protocol is optimized for the human genome, but it has been successfully applied to a wide variety of organisms, including plants, insects, fish, and

16 even metagenomic samples [44]. Key parameters in experimental design, as explored by Luo et al. [27] in the context of the LRSim linked read simulation tool, include the quantity of input DNA, sequencing coverage, and molecule size. Experimental parameters in library preparation are selected based on the genome size of the organism studied and the desired application (e.g. de novo genome assembly versus germline variant phasing). Luo et al. [27] perform simulation experiments based on the genome, which is 1/20 the size of the human genome, to illustrate that the recommended parameters are not ideal for organisms with different genome sizes than human.

Linked read aware short-read alignment

Unlike synthetic long reads, short reads in a linked read sparsely sample the original long molecule so they cannot be assembled to reconstruct the molecule. Therefore, whole-genome or whole-exome short-read sequencing is typically followed by read alignment to the reference genome, not assembly. The reads resulting from linked read sequencing are still fundamen- tally short Illumina reads, so read alignment programs designed for that datatype could be used. However, barcode sharing provides additional locality information that can be used to improve read alignment in certain cases.

There are an average of 10 DNA molecules per droplet in the library preparation step. Since all reads from a droplet share a barcode, all reads with any given barcode are expected to map to an average of 10 genomic regions. The length of the input molecules determines the size of the regions on the reference genome. (Figure 2.1) Suppose a read has many possible alignments to the reference genome with the same alignment score, or similar alignment scores. The candidate mapping positions of other reads with the same barcode can be used to prefer some alignments over others based on the knowledge that the reads with this barcode came from several long molecules that are probably contiguous segments in the reference genome. RFA [35] was designed for synthetic

17 Figure 2.1: Each paired-end short read (small horizontal line) has a molecular barcode, indicated by its color. Reads are aligned to the reference genome; in this figure, only some reads are shown for each barcode. Reads with the same barcode generally cluster into one or more genomic locations. Based on the alignment locations, molecules can be inferred for each barcode, diagrammed as colored boxes atop the reference genome. For example, reads in the diagram with the green barcode form two clusters long reads and is the precursor of Lariat, which is a linked-read specific aligner developed by 10x Genomics [42]. EMA [45] is another barcode-aware aligner. After all of the reads sharing a barcode are aligned to the reference, they can be grouped into linked reads based on the alignment locations (Figure 2.1). This is typically done by splitting linked reads between short reads that align more than 50kb apart.

Germline small-variant calling

The next step in the typical sequencing analysis workflow after read alignment is variant calling. Here we discuss only small variant calling (single-nucleotide variants and short in- sertions and deletions, collectively called indels) for which there are no linked-read specific algorithms. Many structural variant calling algorithms have been developed specifically for linked reads, but approaches to calling these large variants (structural variants are canoni- cally defined as alterations affecting >50 bases) are fundamentally quite different andoutside our scope here. The goal of germline (inherited) variant calling is identifying locations where the indi-

18 vidual sequenced differs from the reference genome in one or both alleles. While thehuman reference genome is haploid, meaning that it represents a single haplotype, the genome in non-germline human cells has two haplotypes (diploid). At a particular locus, if the indi- vidual has two identical alleles this is referred to as homozygous. If the alleles differ, this is heterozygous. On the order of 1/1000 of the base pairs of an individual’s genome are het- erozygous. An allele matching the reference genome is the reference allele and a mismatch is referred to as the alternate allele. Germline SNV and indel variant calling uses read align- ments to the reference genome and statistical models to identify homozygous alternate and heterozygous sites. Commonly used variant callers for Illumina sequencing include GATK [46], freebayes [47], and Strelka2 [48]. In their preprint on the variant caller Octopus, Cooke,

Wedge, and Lunter [49] find that variant calling accuracy compared to a gold-standard vari- ant catalog benchmark is slightly lower in 10x Chromium data compared to Illumina data, although the 10x libraries had shorter read length and less sequencing coverage compared to the Illumina libraries.

Haplotype phasing and assigning reads to haplotypes

After small variants are called, haplotype assembly can be performed. The goal of haplo- type assembly, also known as variant phasing, is to reconstruct the sequence of alleles at heterozygous sites on each copy of a homologous chromosome. Reads (in this case, linked reads) that overlap two or more heterozygous loci are haplotype-informative. The chance that a single paired-end read overlaps two or more heterozygous loci is small because the reads are short (100–250bp) and heterozygous variants in the human genome are on average 1000bp apart. For linked reads, the chance that among all 10s to 100s of reads in a linked read this is true is much higher and most linked reads are haplotype-informative. The linked read-specific phasing algorithm in the Longranger pipeline is described in[41] and a more general algorithm extended to linked and long reads, HAPCUT2, is presented in

19 [50]. These algorithms, and others of the same class not specifically compatible with linked

reads, typically cast phasing as an optimization problem, such as minimizing the number of heterozygous sites on the fragments that differ from the final haplotype solution. How- ever, the exact solution to this problem is computationally intractable for a whole genome

sequencing dataset. Instead, these algorithms find an approximate solution. It is theoretically possible that a droplet contains molecules from both haplotypes of the same genomic locus, which would violate assumptions of these algorithms, as the reads

inferred to be in the linked read would come from two long molecules with different alleles. Due to the small number of molecules per droplet and large size of the genome compared to the molecule length, the frequency of this occurring in a typical sequencing run was

calculated by Xia et al. [51] to be much less than 1. The output of a haplotype phasing algorithm is haplotype blocks: sequences of alleles inferred to be on one copy of a homologous chromosome. Using the haplotype blocks, linked

read molecules (and therefore all individual constituent reads, whether they overlap variants or not) can be assigned to haplotypes. As an example, in a diploid sample with haplotypes H1 and H2, suppose a mosaic muta-

tion occurs on haplotype H2 yielding a new haplotype, H2′ (Figure 2.2a). Linked reads from H2′ have the mosaic allele, but otherwise have the heterozygous alleles from H2. The mosaic mutation will likely be tolerated by the haplotype assembler and linked reads from H2′ will be assigned to H2 (Figure 2.2b). The fact that all the reads with the mosaic allele reads fall on the same haplotype is a hallmark of post-zygotic mosaicism [52] and contrasts with sequencing error, which would tend to distribute the mosaic alleles evenly across haplotypes

[53]. Reads with the mosaic allele are called haplotype-discordant reads, and these are the most reliable kind of evidence we can gather in support of mosaic variants.

20 (a) Reference genome heterozygous SNP

Haplotypes H1

H2

somatic mutation H2’ (b) Phased short reads haplotype 1 linked read molecule

Phased haplotype 2

Haplotype-discordant reads

Figure 2.2: (a) A mosaic mutation occurs on haplotype H2. (b) Therefore, in linked read sequencing, where short reads can be phased when linked reads overlap phased heterozy- gous variants, mosaic mutations manifest on reads from only one haplotype, here H2. This diagram is adapted from Figure 3 of Dou et al. [54].

21 Somatic mutation as a biological phenomenon

Both germline variants, which are inherited from the parents, and de novo mutations, which are not inherited, are present by definition in all the cells of an organism. In contrast, mosaic mutations are not inherited and occur in some but not all cells in the organism [55,

56]. Genomic mosaicism - different genomes in different cells - can result from mutations acquired during early development, throughout life, or due to aging, which are propagated due to cell division. These mutations range from single-nucleotide changes to larger structural variants and whole chromosome aneuploidy. The distribution and prevalence of cells with a mosaic mutation depend on a combination of the developmental cell lineage, stage at which the mutation occurred, selection for or against cells with the mutation [57], and cell migration [58]. Somatic mosaicism refers to genetic heterogeneity among non-germ cells, which accrue in normally dividing cells throughout the human lifetime [59–61] corroborated by monozygotic twin studies [62]. Mosaicism also plays an important role in many genetic diseases. Pathologically, cancer is characterized by an overall increased mutational load in tumor cells as well as a high level of intra-tumor genetic heterogeneity [63, 64]. Mosaicism has also been implicated in autism [52] and is being explored in connection to other neurological disease [65–67]. Causal mosaic mutations have also been found for Sturge-Weber syndrome [68], McCune-Albright syndrome [69], and Proteus syndrome [70], among others.

Somatic mutation calling

Mosaic variants can be detected using whole-genome or targeted sequencing reads from affected tissue, possibly with accompanying sequencing data from unaffected (normal) tissue. The mosaic variant caller’s task is to distinguish the signature of a mosaic variant from that of a variant affecting all cells in the presence of sequencing errors, alignment errors, copy- number changes and other confounders. Most methods employ statistical tests on the counts of sequencing reads aligned to a particular site. Methods that use paired sequencing data can

22 compare the allele frequency between the affected sample and normal sample; even methods without a normal sample can compare the observed data to the data that would be expected if the variant was present in all cells. Table 2.1 summarizes many tools from the literature for mosaic single-nucleotide variant (SNV) and/or insertion-deletion (indel) variant calling from next-generation sequencing. The sequencing datatype and biological samples required, model or algorithmic strategy, and type of variants detected are listed. One important point of contrast is whether the approach requires sequencing data from paired affected/unaffected (tumor/normal in the cancer context) samples or calls variants on a single sample. The approach for the first case is typically based on detecting variants present in a tumor sample and absent from a normal sample, and ignoring variants present in both samples (because they are, by definition, not somatic). If every cell in the affected sample is heterozygous for a somatic mutation, it will not be mosaic and would require the extra information that it is not present in the control to identify it as somatic. The second relies on the single sample being a mixture of affected and unaffected cells, causing aminor allele fraction that deviates from germline homozygous or heterozygous. This poses a more challenging problem, which can be addressed in many ways including using information from external databases, obtaining very high sequencing coverage, or using haplotype phasing information. Another point of comparison is whether whole-exome sequencing (WES) and/or whole- genome sequencing is supported. Targeted or exome sequencing is typically performed to higher read coverage than WGS. Therefore, more reads are available at a given site to support the variant allele, especially when the variant has low allele frequency, but the area of the genome where variants can be called is drastically reduced.

23 Tool Name Sequencing Samples Variants Model Notes Cerebro [71] WES Paired T/N SNV, indel Decision tree Uses a normal reference panel (WGS) EBCall [72] WES Paired T/N SNV, indel Empirical Bayesian of at least 10 samples to estimate parameters Goby3 [73, 74] Paired T/N Models are trained on semi-simulated data RNA-seq; WES (RNA-seq or WES); SNV, indel Deep learning (neural network) (artificial somatic mutations in real datasets) trio (WES) Uses local haplotype information HapMuC [53] WGS; WES Paired T/N SNV, indel Bayesian hierarchical from heterozygous germline variants Uses a cancer reference panel (1600 samples) ISOWN [75] WES; targeted Tumor-only SNV Various supervised classifiers and cancer variant databases JointSNVMix [76] WGS; WES Paired T/N SNV probabilistic graphical model Lancet [77] WGS Paired T/N SNV, indel Local assembly Supports linked reads

Uses local haplotype information LocHap [78] WGS; WES Single-sample SNV Bayesian hierarchical from heterozygous germline variants Paired T/N; LoFreq [79] WGS; WES SNV, indel Probability model single-sample

24 Random forest; requires candidate Uses local haplotype information MosaicForecast [80] WGS Single-sample SNV, indel variant calls from e.g. MuTect2 from heterozygous germline variants Paired T/N; MosaicHunter [81, 82] WGS; WES SNV Bayesian hierarchical single-sample; trio MutationSeq [83] WGS; WES Paired T/N SNV Various supervised classifiers MuTect [84] WGS; WES Paired T/N SNV Bayesian classifiers Paired T/N; Local assembly and realignment; Can filter based on germline variants MuTect2 [85] WGS; WES SNV, indel single-sample Bayesian model seen in panel of normals SNooPer [86] WGS; WES Paired T/N SNV, indel Random forest Trains a dataset-specific model Bayesian model, SomaticSniper [87] WGS Paired T/N SNV based on MAQ [88] Strelka [89] WGS Paired T/N SNV, indel Bayesian model Strelka2 [48] WGS Paired T/N SNV, indel Probability model Also performs germline variant calling WGS; WES; Single-sample; VarDict [90] SNV, indel, SV Realignment, local assembly targeted paired T/N VarScan2 [91] WES Paired T/N SNV, CNV Statistical test Virmid [92] WES Paired T/N SNV Bayesian model Determines level of tumor purity

Table 2.1: Summary of somatic mutation calling algorithms available in the literature, their input datatypes, models, and relevant special features Approaches that incorporate local haplotype phasing of germline heterozygous sites such

as HapMuC [53], LocHap [78], and MosaicForecast [80] are limited by read length. Even with paired-end library configurations where 100-250bp reads are sequenced from eachend of a fragment of several hundred bases, it is unlikely that a short read will be haplotype-

informative and span a candidate mosaic site. The candidate mosaic site has to be close to heterozygous germline variant(s), and it has to have enough reads spanning the candidate site and the heterozygous site to provide haplotype-based evidence for the variant. HapMuC

performs better on mosaic variants when incorporating local haplotype, but only ∼15% of mosaic variant candidates had nearby heterozygous germline variants. LocHap does not evaluate what fraction of the genome can be analyzed for local haplotype variants based on read length and heterozygous variant frequency, but it is likely quite limited. MosaicForecast models candidate mosaic variant sites with nearby heterozygous variants (phasable sites) which they report covers 10-30% of candidate variants, and applies the model to non-phasable sites. The aim of this project is to use read-level haplotype phasing from linked reads to achieve better single-sample mosaic variant calling in whole-genome sequencing. Studies in short reads demonstrate the value of haplotype-aware approaches. Linked reads enable phasing of almost all short reads by eliminating the requirement that the short read overlap a het- erozygous variant. Instead, all short reads making up a linked read can be phased if any single read, which could be quite distant from the candidate mosaic variant site, overlapss a phased variant.

Results

Among available mosaic variant callers, Samovar is unique in that it is the first to evaluate haplotype-discordant reads identified through linked read sequencing, thus enabling phas- ing and mosaic variant detection across essentially the entire genome. In contrast, even

25 approaches that use germline variants and haplotype-specific signal are limited by the read length of paired-end short reads. Samovar also evaluates the statistical characteristics of the haplotypes, depth of coverage, and potential confounders such as alignment errors, to robustly identify mosaic variants from a single sample.

In the following sections, we describe the Samovar pipeline and its performance. First, we demonstrate the precision and recall of our method compared to competitors based on simulated mosaic variants at 60X and 30X read coverage. We also present two weaker

Samovar models - one using only short-read phasing information and one using no phasing information - to quantify the performance advantage of linked reads. Finally, we called mosaic variants in linked read whole-genome sequencing of 13 pediatric cancer samples and the accompanying normal controls, and corroborated Samovar variant calls (in the capture regions) with whole-exome sequencing of the same samples.

2.2 Samovar pipeline

Workflow

The standard Samovar workflow is shown in Figure 2.3 and proceeds in six major steps.In step 1, Samovar identifies all genomic sites where there is sufficient data to apply ourmodel. This is done by filtering based on features such as depth of coverage, fraction of reads that are phased, frequency of the candidate mosaic allele, and related data characteristics. In step

2, Samovar modifies the input BAM file to introduce synthetic mosaic variants to beused as sample-specific training data. Specifically, these variants are used as positive examples for training our model, whereas real homozygous/heterozygous variants, as called by Long

Ranger, are used as negative examples. In step 3, Samovar trains a random forest model containing an ensemble of 100 individual decision trees that scores sites according to their resemblance to the synthetic-mosaic sites. In step 4, Samovar scores all sites that passed the initial filter using this model. In step 5, complex repeat regions and non-diploid copy-

26 10X Genomics generateVarfile fastq reads

variant phased coordinates Longranger VCF

mosaic-like germline examples examples 2 2 simulate read simulate --simulate alignments --het / --hom

3 train 1 preFilter

random forest 4 classify classifier

5 repeats bedtools & CNV intersect

ranked 6 variant calls postFilter

Figure 2.3: Standard Samovar workflow

27 10X Genomics generateVarfile fastq reads

generateVarfile

variant phased coordinates Longranger VCF target mutations

mosaic-like germline examples examples

bamsurgeon read simulate simulate addsnv.py --simulate alignments --het / --hom

alignments with mutations

train

preFilter

random forest classifier classify repeat evaluate regions correctness

bedtools intersect

ranked postFilter variant calls

Figure 2.4: Simulation experiment Samovar workflow. In addition to the standard work- flow, this workflow evaluates correctness of the calls based on mutations generated with bamsurgeon.

28 number regions are optionally filtered out. In step 6, a final filter removes false positives resulting from alignment errors to produce scored mosaic variant calls. Each of these steps is described now in detail.

(1) preFilter Samovar first scans the genome calculating the features listed in Figure 2.5 at each site. Each feature has a numerical threshold, and if all filters are passed the site is considered in step 4 (classify) as a candidate variant site. These filters examine measure- ments such as depth, number of haplotype-discordant reads, quality of the alignments and credibility of the read phasing.

(2) simulate Simulated mosaic training examples are generated at regular intervals across the genome at a range of mosaic allele frequency (MAF) from 0.025 to 0.475 at increments of 0.025. We refer to these sites as simulation sites. Sites harboring germline variant calls can be excluded by specifying them in a VCF. For each phased aligned read with the reference allele at the simulation site, the reference allele is randomly changed to the mosaic base with probability equal to the target MAF. For an unphased alignment having the reference

MAF allele, the reference allele is randomly changed to the mosaic base with probability 2 , on the principle that unphased reads are equally likely to originate from either haplotype. The features listed in Figure 2.6 are computed for the simulation sites to obtain true-mosaic training examples. The same features are computed for FILTER=PASS phased heterozygous (GT=0|1 or GT=1|0) and homozygous (GT=1|1 or GT= 0|0) variant sites from the VCF to get true-non-mosaic examples.

(3) train A random forest model is trained with an equal number of simulation sites and non-mosaic sites. Non-mosaic sites are selected to have equal amounts of heterozygous and homozygous calls in the VCF. We use the RandomForestClassifier module from the scikit-learn library [93] with max_leaf_nodes 50 and n_estimators 100, though Samovar

29 1. Minimum depth (excluding marked duplicates, QC fail, secondary and supplementary alignments) [at least 16] 2. Minimum fraction of reads phased [at least 0.5] 3. Minimum fraction of reads on less-prevalent haplotype [at least 0.3] 4. Maximum fraction of reads that have neither reference nor mosaic allele [at most 0.05] 5. Minimum mosaic allele frequency [at least 0.05] 6. Minimum number of haplotype-discordant reads [at least 4] 7. Maximum number of haplotype-discordant reads on the less-prevalent haplotype [at most 0.1] 8. Minimum average position from end of alignment of haplotype-discordant reads [at least 10]

The following filters can also optionally be used; the default setting is to usethem: 1. At least one haplotype-discordant read, one haplotype-concordant read, one reference- allele read and one mosaic-allele read must be aligned in proper pair orientation 2. At least one haplotype-discordant read, one haplotype-concordant read, one reference- allele read and one mosaic-allele read must have an alignment that is not soft-clipped 3. At least one haplotype-discordant read, one haplotype-concordant read, one reference- allele read and one mosaic-allele read must be aligned on the plus and on the minus strand

Figure 2.5: List of Samovar preFilter features. Default value to pass filter is denoted with [brackets]

30 1. Depth [excluding marked duplicates, QC fail, secondary and supplementary alignments] 2. Fraction of reads phased [HP tag assigned by Long Ranger] 3. Fraction of reads on the more common haplotype [max(number of HP=1 reads, number of HP=2 reads)] 4. MAF 5. MAF of phased reads 6. Number of haplotype-discordant (HD) reads 7. Fraction of phased reads that are HD 8. Fraction of HD reads on the more common haplotype [max(number of HP=1 HD reads, number of HP=2 HD reads)] 9. MAF of HD reads 10. Average base quality of HD reads 11. Average position from the closer end of the alignment on HD reads of the site being classified 12. Average number of soft-clipped bases on HD reads 13. Average number of indels in alignment of HD reads 14. Average value of AS - XS (Lariat alignment scores) of HD reads 15-21. Features 8–14 for the set of phased reads that are not HD 22-26. Features 10–14 for the set of mosaic-allele reads 27-31. Features 10–14 for the set of reference-allele reads 32. weighted HD read base quality: sum of HD read base quality / sum of all phased reads base quality 33. weighted mosaic-allele read base quality: sum of mosaic-allele read base quality / sum of reference- and mosaic-allele read base quality

Figure 2.6: Features calculated at each genomic position as input to the Samovar random forest model

31 allows the user to customize these hyperparameters. The random forest features described in

Table 2.6 take into account the abundance and consistency of evidence for a mosaic variant, including the number of haplotype discordant reads, mosaic allele fraction, base quality, alignment score, amount of soft clipping, presence of indels, etc.

After cross-validation at a variety of sequencing depths, we found that using 20,000 mosaic, 10,000 heterozygous and 10,000 homozygous training examples achieved a balance of computational efficiency and accuracy. We subsampled the NA24385 BAM fileused

for the simulation experiment and ran the Samovar simulate and train steps. For each number of training examples, average performance statistics are reported for ten independent train/validation splits; 0.5 and 0.9 refer to the random forest probability that the example

is in the mosaic class.

(4) classify Genomic sites passing the preFilter are classified by the trained random forest model, yielding the predicted probability that the site is mosaic. Sites with probability above a cutoff are reported in BED format. Based on cross-validation at a variety of sequencing depths, we found that a probability cutoff of 0.5 balances false positive rate and true positive rate, although this can be adjusted to trade between sensitivity and precision.

(5) region-based filter As Illumina sequencing is known to have high error rates within microsatellites and simple repeat sequences [94], we exclude candidate mosaic variants identi- fied in these regions. Specifically, we exclude variants within +/- 2bp from 1,2,3,4-bp repeats

at least 4bp long with at least 3 copies of the unit. Within hg19, 72.0% of autosomes and 71.4% of autosomes+X+Y will remain after this region filter, and within GRCh38 73.8% of autosomes and 73.1% of autosomes+X+Y remain. We also exclude any CNV regions +/-

5bp identified by CNVNATOR [95] because polymorphism among the copies of a repeated region would be misconstrued as mosaicism.

32 (6) postFilter Our expectation is that mosaic variants are isolated events. Samovar applies a final test to distinguish an isolated, likely mosaic variant from the situation where there are many nearby variants co-occurring on the same reads. The latter pattern is usually caused by alignment errors in the presence of repetitive DNA and copy number variation.

Specifically, we examine each base within a fixed distance of the mutative mosaic locus.At each base we conduct a Fisher’s exact test, testing if the alleles observed at the query base associate with the haplotype-discordant reads. This is diagrammed in Figure 2.7. If the most significant p-value among all the statistical tests is less than the threshold, the site isfiltered out. Based on simulations, we find that the p-value threshold can be set to 0.005 (default) or lower based on the desired balance between precision and recall. There is an option to avoid particular sites when calculating the minimum p-value among all nearby sites and it is recommended to use the germline VCF of variant calls here. The final mosaic variant calls are reported in VCF format. VCF INFO tags areusedto record depth, allele frequency, fraction of reads phased by Long Ranger, number of haplotype- discordant reads, the model-predicted probability, and the minimal p-value obtained by the postFilter.

Computational efficiency

Running time for each tool is reported for the 30X simulation experiment described in the next section. Samovar was run on a single machine with 48 cores for the “filter” step and 4 cores for other parallelizable steps, with pypy when possible. Maximum memory usage was 19.2 GB, and the filter step reported 4200% CPU usage when allocated 48 cores. Samovar completed in 7 hours.

MosaicHunter and MuTect2 were run on a cluster in a scatter-gather format where each chromosome was computed independently and the results were merged. MosaicHunter does not offer parallelism options, although slightly greater than 100% average CPU usagewas

33 at this position H-D Not H-D p = 0.0045 read read Has mismatch 3 0 Alignment end (clip)Indel Mosaic variant positionMismatch No mismatch 0 9

Haplotype-discordant Phased

Haplotype1 reads Phased Haplotype2

at this position H-D Not H-D p = 0.1429 read read Has alignment-end 2 1 No alignment-end 0 4

Figure 2.7: The postFilter step calculates statistical association between haplotype- discordant reads and alignment features such as start/end position, indel or mismatch.

34 seen. On chromosome 1, paired mode used maximum 25.6 GB memory; tumor-only mode

used 25.3 GB; trio mode used 25.0 GB. MosaicHunter tumor-only and trio modes completed in 29 hours each and paired mode completed in 7 hours. MuTect2 was run with 48 cores for the native pair HMM, although only 600% CPU

usage was seen on average. On chromosome 1, paired mode used maximum 5.7 GB memory; tumor-only mode used 5.5 GB. MuTect2 paired mode completed in 136 hours.

2.3 Simulated dataset

To benchmark Samovar, we used a custom fork of bamsurgeon [96] to insert synthetic mosaic variants into the NA24385 10x Genomics Chromium BAM file from the Genome in a Bottle (GIAB) project [97]. Given a target MAF, a 2 × MAF fraction of reads with tag HP=1, and a MAF fraction of reads with no HP tag are selected to mutate. (The HP tag is added by Long Ranger to indicate whether a phased read is from haplotype 1 or 2.) The alternate allele is chosen randomly among the three non-reference bases. Simulated mosaic mutations were introduced at evenly spaced intervals every 20,000 bp on the autosomes with target MAF between 0.025 and 0.475 in increments of 0.025. Reads were realigned with BWA-MEM after mutations were introduced. To compute precision, the denominator is sites with at least 4 alt-allele reads and 16 total reads (not marked duplicate or QC fail). This is because the parameters we chose for Samovar and MosaicHunter require at least 4 reads to call a mosaic variant, and Samovar’s depth filter threshold is 16 (MosaicHunter’s minimum depth is25, which we keep, so technically fewer sites are visible to MosaicHunter). Training and testing occurred using sites on the autosomal chromosomes only since NA24385 is male, and the training used an independent set of synthetic variants from those used for the evaluation. The mean inferred linked read length is 16,176 bp with stan- dard deviation 54,387 bp. To evaluate performance at lower coverage and in other tools’ tumor/normal or paired mode, the original BAM file (mean coverage 61.8; median 60at

35 bamsurgeon-modified sites, excluding reads marked duplicate) was split in half basedon

read group tag and we subsequently modified only one half with bamsurgeon (mean cov- erage 30.6, median 29 at bamsurgeon-modified sites). Splitting by read group tag ensures that an entire linked read will be placed into the derivative BAM file. Experiments with

the original BAM file are referred to as 60X coverage and those with the subsample as30X coverage. This modified workflow is diagrammed in Figure 2.4.

Training the Samovar models

We use 20,000 simulated mosaic, 10,000 heterozygous and 10,000 homozygous training ex- amples to train each Samovar random forest model described in this section.

Default Samovar model First, we trained a model as described in the previous workflow

section. Table 2.2 shows the feature importances of this Samovar model, with abbreviation and number as in Table 2.6.

Samovar short-read-only phasing model Samovar is designed to take advantage of the long-range phasing information given by linked reads. Prior work showed that even the phasing information from short fragments can improve mosaic variant calling accuracy [53, 78]. We can simulate the paired-end strategy in Samovar, allowing us to compare to the linked-read strategy while holding the rest of the pipeline constant, by creating a short-read- only Samovar model that breaks down the linked reads into their constituent paired-end reads and considers only these shorter fragments when compiling linked-read-related features such as haplotype-discordant reads. In this model, a paired-end read is assigned to a haplotype only if one of the ends overlaps a heterozygous variant phased by Long Ranger. Supposing that we have the complete haplotype phasing from Long Ranger, we assign a haplotype to a pair of reads if either mate overlaps at least SNP with a phased genotype in the VCF. Out of 1.91 billion reads, 9.76% of reads could be phased. Only 0.006% of reads overlapped

36 Number in Importance Abbreviation Figure 2.6 0.206699 weightedMbq 33 0.136303 MAF 4 0.115912 MAFphased 5 0.101952 weightedCbq 32 0.078008 fracC 7 0.075791 CMAF 9 0.065965 nC 6 0.058420 Mavgbq 22 0.050114 Cavgbq 10 0.028026 NMAF 16 0.016379 Mavgclip 24 0.009496 MavgASXS 26 0.008695 Mavgind 25 0.007776 NavgASXS 21 0.006754 JavgASXS 31 0.006130 CavgASXS 14 0.003744 Cfrach 8 0.003665 Cavgind 13 0.003250 Cavgclip 12 0.002759 Navgbq 17 0.002578 Navgind 20 0.002276 Javgind 30 0.001835 Javgbq 27 0.001569 Javgclip 29 0.001236 fracphased 2 0.001149 depth 1 0.000996 Navgpos 18 0.000904 Mavgpos 23 0.000504 Cavgpos 11 0.000476 Javgpos 28 0.000405 Navgclip 19 0.000148 frach 3 0.000086 Nfrach 15

Table 2.2: Samovar model feature importances in the simulation experiment

37 variants but had alleles for conflicting haplotypes - these were not phased. Table 2.3 hasthe feature importances of this limited model, with abbreviation and number as in Table 2.8.

Samovar no-phasing model While we do not advocate this approach, for the purposes of comparison, we removed all phasing-related features from Samovar to create a no-phasing model. Table 2.4 has the feature importances of this limited no-phasing model, with ab- breviation and number as in Table 2.9. Filters use the default parameters described in the preFilter feature list (Table 2.5).

Samovar model comparison

For the short-read-only phasing model, we find while the precision is comparable tothe Samovar full model, the number of variant calls is much lower, resulting in a genome-wide recall of 2.0% at 30X and 60X, because there are few sites for which adequate phasing information can be compiled from short reads alone (Figure 2.11, Table 2.5). For the no- phasing Samovar model, when stratified by mosaic allele frequency (MAF), precision in every bin is near zero, although genome-wide recall is 68.3%, underscoring the importance of phasing features to our approach (Figure 2.11, Table 2.5).

MosaicHunter and MuTect2 comparison

We compared Samovar to MosaicHunter v. 1.1 [82]. We ran MosaicHunter in tumor-only mode analyzing only the bamsurgeon-mutated BAM file from NA24385, as well as in trio mode where the unaltered GIAB 10x Genomics Chromium BAM files from the mother

(NA24143) and father (NA24149) were also provided. The parental BAM files were similarly produced by Long Ranger but not modified by bamsurgeon. While Samovar does not use trio information, we hypothesized that its modeling of linked-reads would allow it to have competitive accuracy. The modified and unmodified halves of the BAM file split byread

38 1. Depth [excluding marked duplicates, QC fail, secondary and supplementary alignments] 2. Fraction of reads phased [computed based on read or its mate overlapping phased variants] 3. Fraction of reads on the more common haplotype [max(number of HP=1 reads, number of HP=2 reads)] 4. MAF 5. MAF of phased reads 6. Number of haplotype-discordant [HD] reads 7. Fraction of phased reads that are HD 8. Fraction of HD reads on the more common haplotype [max(number of HP=1 HD reads, number of HP=2 HD reads)] 9. MAF of HD reads 10. Average base quality of HD reads 11. Average position from the closer end of the alignment on HD reads of the site being classified 12. Average number of soft-clipped bases on HD reads 13. Average number of indels in alignment of HD reads 14-19. Features 8-13 for the set of phased reads that are not HD 20-23. Features 10-13 for the set of mosaic-allele reads 24-27. Features 10-13 for the set of reference-allele reads 28. weighted HD read base quality: sum of HD read base quality / sum of all phased reads base quality 29. weighted mosaic-allele read base quality: sum of mosaic-allele read base quality / sum of reference- and mosaic-allele read base quality Figure 2.8: Random forest features used in the short-read-only Samovar model

1. Depth [excluding marked duplicates, QC fail, secondary and supplementary alignments] 2. MAF 3. Average base quality of mosaic-allele reads 4. Average position from the closer end of the alignment on mosaic-allele reads of the site being classified 5. Average number of soft-clipped bases on mosaic-allele reads 6. Average number of indels in alignment of mosaic-allele reads 7-10. Features 3-6 for the set of reference-allele reads 11. weighted mosaic-allele read base quality: sum of mosaic-allele read base quality / sum of reference- and mosaic-allele read base quality

Figure 2.9: Random forest features used in the no-phasing Samovar model

39 Number in Importance Abbreviation Figure 2.8 0.20520657 weightedMbq 33 0.16749462 MAF 4 0.13410861 weightedCbq 32 0.08401046 fracC 7 0.07369292 MAFphased 5 0.06894606 Mavgbq 22 0.06203258 nC 6 0.02828793 Cfrach 8 0.02817725 Mavgclip 24 0.02439597 Cavgbq 10 0.01783998 MavgASXS 26 0.01466872 JavgASXS 31 0.01386411 CMAF 9 0.01311324 CavgASXS 14 0.01274496 NMAF 16 0.00876721 NavgASXS 21 0.008098 Mavgind 25 0.00757443 fracphased 2 0.00331775 Navgind 20 0.00318227 Javgind 30 0.00307518 Cavgind 13 0.00297741 Mavgpos 23 0.00291929 Cavgpos 11 0.00285171 Javgbq 27 0.00209531 Javgclip 29 0.00169362 Navgbq 17 0.00123443 Javgpos 28 0.00070641 Navgclip 19 0.00070569 Nfrach 15 0.00069773 Navgpos 18 0.00069729 depth 1 0.00052806 frach 3 0.00029423 Cavgclip 12

Table 2.3: Samovar short-read-only model feature importances in the simulation experiment

40 1.0 (a) 1.0 (b) 0.8 0.8 0.6 0.6 Precision Precision 0.4 0.4 0.2 0.2 0.0 0.0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 WGS MAF WGS MAF

1.0 (c) 1.0 (d) 0.8 0.8 0.6 0.6 Recall Recall 0.4 0.4 0.2 0.2 0.0 0.0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 WGS MAF WGS MAF

MosaicHunter Trio (27,556 30X / 75,629 60X) MuTect2 Paired (152,014 / NA) MosaicHunter Paired (24,476 / NA) MuTect2 Tumor−Only (2,912,888 / 2,732,181) MosaicHunter Tumor−Only (15,253 / 62,060) Samovar (33,644 / 66,144)

Figure 2.10: Precision and recall calculated for Samovar, MuTect2, and MosaicHunter variant calls stratified by mosaic● allele fraction (MAF) in the whole ●genome sequencing data (WGS). (a) 30X coverage, precision (b) 60X coverage, precision (c) 30X coverage, recall (d) 60X coverage, recall

41 Number in Importance Abbreviation Figure 0.42298002 weightedMbq 11 0.29348068 MAF 2 0.14507064 Mavgbq 3 0.07905771 Mavgclip 5 0.03207496 Mavgind 6 0.00746506 Javgind 10 0.00448625 Javgpos 8 0.00438944 Javgbq 7 0.00403807 depth 1 0.0038253 Mavgpos 4 0.00313186 Javgclip 9

Table 2.4: Samovar no-phasing model feature importances in the simulation experiment

1.0 (a) 1.0 (b) 0.8 0.8 0.6 0.6 0.4 0.4 Precision Precision 0.2 0.2 0.0 0.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 WGS MAF WGS MAF

Samovar full model 30X (33,644 calls) Samovar full model 30X (13,837 calls) Samovar full model 60X (66,144) Samovar full model 60X (27,453) Samovar short 30X (2,243) Samovar short 30X (915) Samovar short 60X (3,001) Samovar short 60X (1,257) Samovar no phasing 30X (2,116,038) Samovar no phasing 30X (853,383) Samovar no phasing 60X (2,780,143) Samovar no phasing 60X (1,131,997) ● ●

Figure 2.11: Precision calculated for variant calls made by Samovar’s full model and the short-read-only and no-phasing models created for illustration, stratified by mosaic allele fraction (MAF) in whole genome sequencing data (WGS). (a) Autosomes (b) Genomic region not filtered by MosaicHunter or Samovar’s region filters.

42 group were provided when MosaicHunter was run in paired-mode as tumor and normal,

respectively. We used the default recommended parameters when possible, except we did not use the misaligned_reads_filter because it was extremely slow. In addition, because wehave

simulated far more mosaic sites than would be expected in a normal genome, we do not want to penalize MosaicHunter because it deliberately filters mosaic sites that are close to each

other so we changed the following parameters: clustered_filter.inner_distance=2000 [default 20000] clustered_filter.outer_distance=2000 [default 20000] We also adjusted MosaicHunter’s supporting read threshold since Samovar requires at least

4 minor (mosaic) allele reads using base_number_filter.min_minor_allele_number=4 [default 3] We used liftOver to transfer the provided WGS.error_prone.b37.bed and all_repeats.b37.bed to GRCh38 coordinates, and downloaded dbsnp_human_9606_b150_GRCh38p7 bed files for the common_site_filter, repetitive_region_filter, mosaic_filter.dbsnp_file respec- tively. CNVNATOR was used to predict regions of copy number variation and this BED file was provided as the indel_region_filter.bed_file parameter. Note that the homopolymers_filter, common_site and repetitive_region BED files leave visible only 32.2% of bases in the GRCh38 autosomes (34.4% including X and Y) to call mosaic variants. For comparison, Samovar considers about 73% of GRCh38 visible. We also compared Samovar to MuTect2 from GATK v. 4.0.12.0 [84]. We ran MuTect2 in tumor-only mode and tumor/normal paired-mode on the same data described above, which involved the standard GATK workflow of the Mutect2 program followed by FilterMutect- Calls. Tumor-only mode calls mosaic and germline mutations simultaneously but does not differentiate between the categories; hence the number of calls is much higher and the preci-

43 sion suffers at higher MAF where germline heterozygous variants comprise most of thecall set. Figure 2.10 shows each tool’s precision and recall, stratified by MAF in the tumor WGS. Precision is calculated as the fraction of variant calls made that were bamsurgeon synthetic mutations and recall is calculated as the fraction of bamsurgeon synthetic mutations that were in each tool’s variant call set. Samovar achieves consistently higher precision than the tumor-only modes of MuTect2 and MosaicHunter. Importantly, Samovar’s precision is also comparable to those tools in their trio and paired modes, with MosaicHunter’s paired and trio modes achieving slightly higher precision at MAFs ≥ 0.2 and MuTect2’s paired mode achieving higher precision at MAFs ≥ 0.3.

Note that in all cases, the original 10x Genomics BAM file was used. This means that all three Samovar models (as well as MuTect2 and MosaicHunter) benefited from the improved alignment accuracy of the linked-read-aware Lariat aligner, giving the short-read-only and no-phasing models and the other two methods a somewhat artificial advantage.

44 Samovar MuTect2 MosaicHunter Full Model Short No Phasing Tumor-Only Paired Tumor-Only Paired Trio 30X Coverage Prec Rec F Prec Rec F Prec Rec F Prec Rec F Prec Rec F Prec Rec F Prec Rec F Prec Rec F Autosomes 84.0 30.1 44.4 83.7 2.0 3.9 3.4 68.3 6.4 3.0 83.2 5.7 60.8 91.4 73.0 31.5 5.1 8.8 79.2 20.7 32.8 70.4 20.7 32.0 Exons 84.0 28.3 42.4 85.5 1.8 3.5 4.6 70.7 8.6 3.6 85.3 7.0 60.1 92.0 72.7 35.0 7.1 11.8 82.1 30.8 44.8 73.7 30.8 43.4 Genes 84.9 30.1 44.4 84.5 1.8 3.6 3.9 69.2 7.5 3.2 84.4 6.2 63.0 92.0 74.8 32.6 5.7 9.7 79.9 22.7 35.4 71.2 22.7 34.5 Enhancer 88.5 31.0 45.9 90.9 2.1 4.1 4.4 61.8 8.2 3.9 86.7 7.5 72.9 92.3 81.4 37.8 5.9 10.1 85.5 29.5 43.8 80.2 29.5 43.1 Promoter 83.3 26.1 39.8 76.9 1.4 2.7 4.0 65.2 7.5 3.0 83.2 5.8 59.4 90.9 71.9 35.3 6.1 10.4 80.5 25.1 38.3 73.7 25.1 37.5 Alu 82.0 28.6 42.4 81.1 2.3 4.4 2.7 73.1 5.3 2.3 78.2 4.5 54.5 88.4 67.4 8.6 0.0 0.1 56.5 0.3 0.6 53.1 0.3 0.6 RepeatMasker 84.2 29.6 43.9 82.3 2.0 3.9 2.9 67.0 5.5 2.8 81.5 5.3 58.9 90.1 71.2 20.2 0.3 0.6 72.3 1.4 2.7 61.3 1.4 2.7 Seg. Dup. 25.6 10.4 14.8 51.9 0.8 1.5 0.6 25.5 1.2 1.3 56.9 2.5 18.4 62.8 28.5 6.6 0.5 0.9 39.3 1.7 3.2 29.1 1.7 3.2 60X Coverage Prec Rec F Prec Rec F Prec Rec F Prec Rec F Prec Rec F Prec Rec F Autosomes 84.6 43.0 57.1 87.8 2.0 4.0 3.2 67.9 6.1 3.6 76.0 7.0 32.4 15.5 20.9 46.8 27.2 34.4 Exons 84.3 41.8 55.9 87.3 1.7 3.2 4.6 69.4 8.7 4.7 79.6 8.8 38.5 25.3 30.5 54.0 45.5 49.4 Genes 85.6 43.4 57.6 89.1 2.0 3.8 4.0 68.9 7.5 3.9 77.2 7.5 33.1 17.0 22.4 47.7 30.0 36.8 Enhancer 90.8 47.8 62.6 93.3 2.2 4.2 4.4 61.1 8.1 4.8 77.9 9.0 36.9 22.7 28.1 51.6 40.0 45.1 Promoter 85.4 40.7 55.2 83.1 1.5 2.9 4.0 64.5 7.5 4.0 76.8 7.6 38.5 21.1 27.3 56.4 40.5 47.2 Alu 81.1 42.9 56.1 84.6 2.5 4.8 2.6 72.7 5.0 3.0 68.0 5.7 16.5 0.2 0.5 31.7 0.5 1.0 RepeatMasker 84.2 42.2 56.2 87.1 2.1 4.1 2.6 66.6 5.0 3.4 74.1 6.4 24.7 1.0 1.9 38.3 1.8 3.4 Seg. Dup. 28.0 13.1 17.8 64.3 0.7 1.3 0.5 23.6 1.0 1.6 48.5 3.1 9.8 1.5 2.6 18.5 2.7 4.7 Table 2.5: Precision (Prec), recall (Rec), and F score of each tool for the synthetic mosaic variants inserted by bamsurgeon. 45 Samovar MuTect2 MosaicHunter Full Model Short No Phasing Tumor-Only Paired Tumor-Only Paired Trio 30X Coverage Prec Rec F Prec Rec F Prec Rec F Prec Rec F Prec Rec F Prec Rec F Prec Rec F Prec Rec F Autosomes 89.6 42.1 57.3 86.9 2.7 5.2 3.6 94.1 7.0 3.2 85.6 6.2 66.1 93.2 77.4 31.6 14.8 20.2 79.3 59.4 67.9 70.5 59.4 64.5 Exons 93.8 39.6 55.7 83.7 2.1 4.2 5.6 94.7 10.6 4.1 87.2 7.9 64.7 93.4 76.4 34.9 12.5 18.4 82.4 54.2 65.3 73.9 54.2 62.5 Genes 90.9 42.4 57.8 87.9 2.5 4.8 4.6 94.8 8.7 3.4 86.7 6.6 67.1 93.7 78.2 32.7 15.1 20.6 80.0 60.1 68.6 71.2 60.2 65.2 Enhancer 94.2 42.0 58.1 91.7 2.5 4.9 5.5 96.1 10.4 4.0 87.7 7.7 70.1 93.6 80.2 36.4 11.3 17.3 85.9 58.7 69.7 80.1 58.7 67.8 Promoter 91.4 36.3 52.0 76.7 1.7 3.4 4.6 93.9 8.8 3.2 85.8 6.2 60.5 92.4 73.1 35.4 11.6 17.5 80.7 48.7 60.7 74.2 48.7 58.8 Alu 30.2 18.6 23.0 25.0 1.2 2.4 0.9 60.5 1.8 1.4 61.4 2.8 27.1 61.4 37.6 9.7 4.3 5.9 52.6 28.6 37.0 50.0 28.6 36.4 RepeatMasker 68.5 33.1 44.7 73.7 2.3 4.4 0.7 73.9 1.3 2.6 72.3 5.0 45.6 75.8 57.0 24.4 10.1 14.2 75.5 45.0 56.4 65.3 45.0 53.3 Seg. Dup. 6.8 4.4 5.3 28.6 0.6 1.2 0.1 17.0 0.2 0.8 42.5 1.6 11.1 39.6 17.4 7.8 4.1 5.4 37.6 11.9 18.1 27.7 12.3 17.0 60X Coverage Prec Rec F Prec Rec F Prec Rec F Prec Rec F Prec Rec F Prec Rec F Autosomes 89.7 60.3 72.1 89.3 2.7 5.3 3.4 94.0 6.6 4.0 78.7 7.6 32.4 44.5 37.5 46.8 78.3 58.5 Exons 91.7 58.5 71.4 89.2 2.1 4.1 5.9 95.0 11.2 5.4 81.8 10.1 38.6 44.7 41.4 54.0 80.6 64.7 Genes 90.8 60.8 72.8 91.3 2.7 5.2 5.0 95.1 9.4 4.3 79.8 8.1 33.1 45.0 38.1 47.6 79.4 59.5 Enhancer 94.5 63.3 75.8 94.4 2.8 5.5 5.8 94.8 10.9 4.9 79.6 9.3 36.5 45.1 40.4 51.1 79.6 62.3 Promoter 91.2 57.3 70.4 92.2 2.3 4.5 4.9 94.7 9.4 4.2 78.2 8.0 38.6 40.5 39.5 56.5 78.1 65.6 Alu 45.9 30.0 36.3 57.1 3.1 5.8 0.7 53.1 1.4 2.1 56.2 4.1 17.0 20.8 18.7 32.4 43.8 37.3 RepeatMasker 73.6 43.9 55.0 82.7 2.5 4.8 0.4 69.4 0.8 3.2 62.8 6.0 27.5 31.2 29.2 41.5 54.9 47.3 Seg. Dup. 9.2 6.0 7.3 15.4 0.4 0.7 0.1 13.3 0.2 1.0 32.9 2.0 10.1 10.3 10.2 18.8 18.5 18.7 Table 2.6: Precision (Prec), recall (Rec), and F score of each tool for the synthetic mosaic variants inserted by bamsurgeon in the region of the genome not filtered out by MosaicHunter or Samovar. In addition to performance genome-wide we evaluated precision and recall across different annotated genomic regions: genes, exons, all repeats, Alu repeats, segmental duplications, enhancers and promoters listed in the UCSC Genome Browser and Ensembl, shown in Table 2.5. Recall is calculated as the fraction of bamsurgeon synthetic mutations with at least four mosaic allele reads that were in the variant call set since both Samovar and MosaicHunter re- quire at least four reads to support a variant call. In practice, many tools including Samovar and MosaicHunter apply filters that exclude portions of the genome that lack sufficient ev- idence or that are inherently difficult to analyze, such as highly repetitive portions, which particularly contributes to MosaicHunter’s poor performance in these genomic regions. Fur- thermore, 66% of the Samovar false negative sites over which recall was evaluated in the

30X coverage experiment and 38% of false negatives in the 60X experiment had fewer than four haplotype-discordant reads, which is the default requirement for Samovar. Relaxing this parameter can boost recall, although may also impact precision.

In Table 2.5 and Figure 2.10, Samovar and MosaicHunter use their respective default filters but we have treated the tools as though they are interrogating roughly thesame portion of the genome. Table 2.6 and Figure 2.12 attempt to normalize the differences by reporting just those sites that pass both tools’ filters. In GRCh38, this is 32.8% ofthe autosomal sequence, containing MosaicHunter’s simple sequence repeat filter and repetitive region bed files, and Samovar’s simple sequence repeat filter, as well as any CNV regions identified by CNVNATOR.

2.4 Pediatric cancer dataset

We next studied a collection of 13 pediatric cancer cases for which both tumor and normal samples were sequenced using 10x Genomics Chromium Whole-Genome Sequencing (WGS) and Whole-Exome Sequencing (WES). One of these cases was studied previously [98], and the other twelve are novel to this work. Experimental methods for sample preparation and

46 linked read sequencing are described in [30].

Mosaic variant analysis

Cases using reference genome GRCh38 2.1.0 (1, 2, 7, 10, 11) were processed with Long Ranger 2.1.6 and GATK HaplotypeCaller 3.8-0. Samples using reference genome b37 2.1.0

(3, 4, 5, 6, 8, 9, 10, 12) were processed with Long Ranger 2.1.3 and GATK HaplotypeCaller 3.5-0. The sequencing coverage and fraction of the genome identified by the CNVNATOR [95] calls is recorded in Table 2.8. See Table S8 of [30] for the oncology diagnosis of each case. We ran Samovar, MosaicHunter (in both paired and tumor-only modes), and MuTect2 (in both paired and tumor-only modes) on each of the 13 tumor WGS datasets. When running

MosaicHunter or MuTect2 in paired mode, we also provided the paired normal WGS. To estimate accuracy of the different approaches, we used the WES sequencing asa validation dataset as it provides independent and deeper coverage over candidate variants within the exome. We first identified the calls from each tool within the exome capture region. The number and precision of the exome-coincident calls made by each tool are shown in Table 2.7.

We then examined the corresponding WES tumor data for evidence of the mosaic call made in the WGS data. We considered a mosaic variant call to be validated if (a) the corresponding WES tumor sample had at least 50 aligned reads at the locus with at least 4 reads supporting the mosaic allele, and (b) the mosaic variant was not found to be germline by Long Ranger in both the tumor and normal WGS data from that patient. Figure 2.13 stratifies the validation rate by MAF in the WGS data and Table 2.7 shows each tool’s overall precision for the calls in the exome capture region. The bar graph shows the number of variants in each MAF bin. MosaicHunter paired called 3 times as many variants as Samovar, and MuTect2 paired called 11 times as many variants. This is because Samovar

47 1.0 (a) 1.0 (b) 0.8 0.8 0.6 0.6 0.4 0.4 Precision Precision 0.2 0.2 0.0 0.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 WGS MAF WGS MAF

MosaicHunter Trio (24,823 30X / 68,395 60X) MuTect2 Paired (44,836 / NA) MosaicHunter Paired (22,033 / NA) MuTect2 Tumor−Only (866,637 / 814,895) MosaicHunter Tumor−Only (13,785 / 56,126) Samovar (13,837 / 27,453)

Figure 2.12: Precision calculated in the genomic region not filtered by MosaicHunter or

Samovar’s region filters,● calculated for Samovar, MuTect2, and● MosaicHunter variant calls stratified by mosaic allele fraction (MAF) in whole genome sequencing data (WGS). (a)30X coverage (b) 60X coverage.

Samovar MuTect2 MosaicHunter Case Full Model Tumor-Only Paired Tumor-Only Paired Calls Prec Calls Prec Calls Prec Calls Prec Calls Prec 1 22 0.71 23,216 0.03 406 0.45 202 0.63 144 0.62 2 23 0.75 23,960 0.02 341 0.20 258 0.25 124 0.27 3 42 0.74 23,866 0.02 359 0.34 177 0.45 68 0.66 4 37 0.72 24,317 0.02 285 0.28 159 0.46 81 0.59 5 21 0.91 24,036 0.01 321 0.33 170 0.45 69 0.70 6 50 0.95 23,978 0.01 265 0.36 234 0.41 108 0.56 7 23 0.80 23,905 0.02 245 0.29 88 0.63 58 0.78 8 28 0.74 23,949 0.02 322 0.24 187 0.44 86 0.47 9 25 0.62 24,893 0.02 276 0.31 185 0.46 78 0.56 10 29 0.53 25,290 0.01 313 0.28 344 0.33 144 0.49 11 22 0.70 24,043 0.02 284 0.41 105 0.75 83 0.80 12 21 0.58 23,875 0.02 278 0.48 178 0.58 72 0.81 13 15 0.71 23,663 0.02 268 0.35 112 0.76 66 0.80 Total 358 312,991 3,963 2,399 1,181

Table 2.7: Number of variant calls in the exome capture regions and precision (Prec) based on supporting reads found in WES. Samovar has the highest validation rate in 10 out of the 13 cases.

48 requires phasing-based evidence to make a call which makes it more stringent, and because tumor/normal callers can identify variants that are homozygous or heterozygous in the tumor sample but have a different genotype compared to normal. Additionally, MuTect2 doesnot filter out CNV regions like MosaicHunter and Samovar, allowing it to call variants inalarger region of the genome. However, Samovar’s validation rate is comparable to the paired callers across a range of MAF, indicated by the comparable precision of Samovar in Figure 2.13e compared to other tools’ paired modes in a and c. Against tumor-only modes of other tools,

Samovar has superior precision especially at MAF ≥ 0.15: MuTect2 tumor-only mode is not designed to differentiate heterozygous from high-MAF mosaic variants, and MosaicHunter makes few calls with a low validation rate.

As Samovar demonstrated high single sample precision in simulation, comparable to the other tools’ paired analysis, we are also able to run it on the normal control available for each of these cases. Sensitivity was measured in the same fashion using WES of the normal sample; across all 13 samples, 732 variants were in the exome capture region and the validation rate was 65% (see Table 2.9 for per-sample statistics). More mutations were found in normal samples because a larger fraction of the genome was excluded by CNVNATOR calls in tumor samples, as shown in Table 2.8. Interestingly, using ANNOVAR [99], we determined 11 of these mosaic mutations across 7 cases were nonsynonymous (amino-acid- changing) in one of the 299 cancer driver genes identified in [100]. The extent of mosaicism in normal tissue and how this may relate to pediatric cancer are interesting avenues of future study now possible with Samovar.

49 Tumor Normal WES WGS WES WGS Case CNVNATOR % CNVNATOR % coverage coverage coverage coverage 1 549 45 9.3 617 42 8.7 2 504 41 16.8 529 41 9.3 3* 271 35 23.6 255 34 11.0 4* 223 34 12.4 232 34 11.9 5* 207 34 15.1 268 35 10.9 6* 226 40 11.8 223 38 11.5 7 472 35 10.3 445 38 8.4 8* 330 35 11.1 319 34 10.9 346 (Blood) 36 (Blood) 10.8 (Blood) 9* 411 36 16.1 400 (Tissue) 36 (Tissue) 11.0 (Tissue) 10* 500 40 11.0 392 37 22.0 11 669 37 10.5 579 35 10.5 12* 618 37 11.4 726 37 10.9 13 777 37 20.1 681 37 8.7

Table 2.8: Cases using reference genome GRCh38 2.1.0 (1, 2, 7, 10, 11) were processed with Long Ranger 2.1.6 and GATK HaplotypeCaller 3.8-0. Samples using reference genome b37 2.1.0 (3, 4, 5, 6, 8, 9, 10, 12) were processed with Long Ranger 2.1.3 and GATK HaplotypeCaller 3.5-0.

Case Calls Sensitivity 1 58 0.70 2 85 0.61 3 70 0.57 4 73 0.68 5 51 0.58 6 73 0.58 7 61 0.63 8 43 0.62 9 39 0.48 10 50 0.45 11 30 0.73 12 59 0.78 13 70 0.88 Total 762

Table 2.9: Samovar analysis of normal WGS dataset for pediatric cancer cases. Number of calls shown is for the WES capture region, and validation is performed as described in the preceding section.

50 1.0 (a) MuTect2 Paired 1.0 (b) MuTect2 tumor−only ● 1400 25000 1200 0.8 ● 0.8

● ● 20000 1000

● 0.6 0.6

● 800

● 15000 ● ● 0.4 0.4

● 600

● ● ● 10000 ● 400 ● 0.2 0.2 5000 ● 200 ● ● ● ● 0.0 0.0 0 0

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5

● 700 1000

1.0 (c) MosaicHunter paired 1.0 (d) MosaicHunter tumor−only

● 600 ● 800 0.8 0.8 500 ● ● ● ● ● ● ● 600 0.6 ● 0.6 400 ● 300 0.4 0.4 400 200 0.2 0.2

● 200 100

● ● ● ● ● 0.0 0.0 0 0

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 WGS Mosaic Allele Fraction (MAF)

● 120

1.0 (e) Samovar ●

● 100

0.8 ● ●

● 80 ● 0.6

● ● ● 60 0.4 Precision 40 0.2 20 0.0 0

0.0 0.1 0.2 0.3 0.4 0.5 Figure 2.13: Fraction of variant calls in exome capture region supported by WES data (black line, left axis ticks) and number of variant calls (gray bars, right axis ticks) stratified by mosaic allele fraction (MAF), combined for the 13 pediatric cancer cases studied. 51 2.5 Discussion

Genomic mosaicism is an important characteristic of many human diseases and conditions.

Accurately identifying mosaic variants has previously relied on paired samples or trio analy- sis, which increases sequencing costs, and may not be possible in many situations. By taking advantage of linked-read properties — particularly the ability to accurately assemble haplo- types and assign nearly all reads to germline haplotypes — Samovar is able to call mosaic SNVs for a single sample at a level of precision that is comparable to paired and trio-based methods. Samovar also achieves substantially higher precision at low MAFs (< 15%) and higher recall in more difficult-to-analyze portions of the genome such as segmental dupli- cations and repetitive elements. This opens the door to a wider range of discoveries than are possible with current methods. The common WGS sequencing depths we tested (30X,

60X) appears to fall below the detection limit of MosaicHunter in single-sample mode, and our performance is comparable when competitors MosaicHunter and HapMuC are run with access to paired normal or parental datasets. Thus far, ours is the only method to use read-level haplotype information available with linked reads to identify somatic variants.

Future work

Other datatypes Samovar requires 10x Genomics linked read WGS data, which at the time this work was completed, was commercially available and added approximately 15% to the cost of a standard paired-end Illumina sequencing experiment (not including the one time cost of the Chromium controller). Since 10x Genomics linked reads are discontinued, other linked read technologies described in the Background section of this chapter could theoretically be used for mosaic variant calling instead. The features in the Samovar model were designed and selected based on 10x Genomics technology, so new features related to the particular specifications of another technology may be required for good model perfor- mance on sequencing data from another linked read platform. Recently, high-accuracy long

52 reads have become available from Pacific Biosciences (PacBio) in their high-fidelity (HiFi)

technology which reports 13.5kb reads with 99.8% accuracy [101]. Further investigation into this technology is needed to determine if phased haplotypes of germline variants are accu- rate enough to reliably assign individual reads to haplotypes. Additionally, it is far more

expensive to generate this sequencing data compared to linked reads. Again, new features may be required to see comparable performance to the Samovar linked read model.

Paired and trio samples Though Samovar already compares favorably to tools that use

matched-normal and trio data, a possible future extension is to incorporate trio and matched- normal data (if available) directly into the model. Samovar’s recall and precision might be further improved by additional data, which would lower the false-positive rate by identifying

germline variants incorrectly identified as mosaic, and spurious calls due to sequencing or alignment artifacts. Based on the results collected here, we expect that a key benefit of this would be to improve recall at all MAFs and to extend the high precision achieved by the

existing paired- and trio-based methods into the low end of the MAF spectrum.

Indels While Samovar currently only detects SNVs, it could be extended to small indels that display the same pattern of haplotype-discordant reads. For this analysis, additional

indel-related features would also be needed in the random forest model to discriminate true indels from sequencing and alignment errors. We would also have to employ local realignment of the short reads around putative indels to eliminate false positives due to the initial read

alignments and correctly report the coordinates of the indels called.

Exome sequencing In this research, we used linked-read WES for validation only and not variant calling, leveraging the increased sequencing coverage in the capture region compared to the WGS samples. Another extension to Samovar could be for variant calling directly from exome linked reads. In this new model, we would have to carefully address coverage

53 bias throughout the capture region as well as other potential differences between linked read

WGS and linked read WES such as molecule length and reads-per-molecule. While the 10x Genomics linked read publication based on phasing results from Long Ranger report promising results for germline variant calling and phasing [42], further investigation into the accuracy of germline variant phasing and short-read assignment to haplotypes is needed in the context of how they affect the features we selected in Samovar for mosaic variant calling from WGS.

54 Chapter 3 scHLAcount: Allele-specific HLA expression from single-cell gene expression data

This chapter describes scHLAcount, a method for allele-specific molecule counting from single-cell RNA sequencing data, and an application of the method to cancer and non-disease datasets. This work was completed during my internship at 10x Genomics in June–August 2019 and subsequently appeared as an Applications Note in Bioinformatics [102].

Charlotte A. Darby, Michael J. T. Stubbington, Patrick J. Marks, Álvaro Martínez Barrio, and Ian T. Fiddes. “scHLAcount: Allele-specific HLA expression from

single-cell gene expression data.” In: Bioinformatics (2020, accepted).

scHLAcount is available on Github under the MIT license [103].

3.1 Background

This background section is organized in the order experimental and computational steps would be taken in an analysis workflow. First, single-cell gene expression data is generated using an UMI-based approach (e.g. 10x Genomics). For the same individual, HLA genotypes must be obtained using a separate molecular assay. Finally, scHLAcount is run to compute

55 allele-specific molecule counts for the HLA genes based on the individual’s genotypes.

Single-cell gene expression

Recently, several biotechnologies have emerged to enable studies of mRNA expression in single cells. The resulting data has been used to study the dynamics of gene expression in cells of different types. The datatypes have also presented computational challenges in adapting methods used for bulk RNA sequencing to the characteristics of the new datatypes, or inspired the development of novel computational methods.

CEL-seq2 [104], Drop-seq [105], inDrop [106], MARS-seq [107], SCRB-seq [108], and 10x Genomics [109] protocols are based on unique molecular identifiers (UMIs). In these protocols, reads from each transcript receive the same DNA barcode which is sequenced along with the read. In addition to the molecule-level barcode, some protocols incorporate a cell-level barcode for enhanced multiplexing. UMI-based protocols sequence a few reads per transcript concentrated at one end. All listed methods capture the 3’ end of a poly-A transcript, apart from 10x Genomics which also offers a 5’-end protocol. In contrast, Smart- seq2 [110, 111] does not use UMIs and sequences reads evenly distributed throughout the transcript.

As for sequestering cells, Drop-seq, inDrop, and 10x Genomics use a microfluidic device to create droplets with individual cells and reagents; CEL-seq2 and Smart-seq2 are compatible with the Fluidigm C1 which is another microfluidic device which isolates cells into plate wells and is limited to 96 cells; individual cells can also be isolated into plate wells using fluorescence-activated cell sorting (FACS) as in the MARS-seq and SCRB-seq protocols. The throughput of these single-cell paradigms (number of cells assayed per experiment) is highly variable. Ziegenhain et al. [112] generated libraries for six methods (CEL-seq2, Drop-seq, MARS-seq, SCRB-seq, Smart-seq, and Smart-seq2), which assayed 29 to 80 cells per replicate. In contrast, 10x Genomics is much higher throughput and can assay thousands

56 (up to 10,000) of cells per library. Zhang et al. [10] compared three droplet-based protocols

(Drop-seq, inDrop, and 10x Genomics Chromium). Libraries resulted in 1000-6000 cells depending on protocol and replicate.

10x Genomics single-cell gene expression library preparation

In this work, we study single-cell gene expression data from the 10x Genomics Chromium platform, a very popular library preparation protocol for single-cell gene expression data. Challenges specific to the 3’-end and 5’-end capture protocols, as well as other technology- specific nuances, may not be applicable to other protocols. However, our algorithm may be applicable to data from similar single-cell gene expression protocols. We will describe in more detail the library preparation for 10x Genomics.

A cell and reagents for library preparation are encapsulated in a droplet using the Chromium controller, referred to as a gel bead emulsion (GEM). As more cells are loaded, the chance that a droplet contains two or more cells (known as a doublet or multiplet, respec- tively) increases. Doublets/multiplets are generally undesirable as the reads coming from each cell are indistinguishable. There are two types of gene expression protocol: a 3’-end capture that uses a poly(dT) on the gel bead sequence to capture poly-A tail of mRNA [113], and a 5’-end capture that uses a switch oligo on the gel bead to capture the 5’ end of a transcript after reverse tran- scription [114]. All reads from each original transcript molecule (there are typically 1 to 3 at recommended sequencing depth) share a unique molecular identifier (UMI) and all reads from a GEM share a cell barcode (CB). Reads are sequenced in a paired-end configuration where one read contains the UMI and CB, and the second read is 91 or 98 bases from the transcript and used to identify which gene the UMI came from based on read alignment to the reference genome. The position in the transcript of the second read varies depending on the insert size of that molecule, but is generally close to the captured end of the transcript,

57 leading to a strong coverage bias - see Figure 1b from [115] and Figure 3.5. When the analy- sis goal is simply counting how many transcripts from each gene are in a particular cell, the coverage-biased data is adequate. However, this is one technical challenge we had to address in this project.

RNA-seq read alignment and molecule counting

In the Cell Ranger pipeline, which was used as preprocessing before the analysis in this work, spliced alignment is performed to the reference genome with STAR [116]. Reads are grouped by CB and UMI and each UMI is assigned to a gene based on the alignments of the constituent read(s). The output is a matrix of cell barcodes by genes, with the count of UMIs for that gene in that cell in each matrix entry.

Some GEMs do not contain a cell, just extracellular transcripts. The next step is to identify which CBs correspond to GEMs containing cells; this is usually done based on ana- lyzing the distribution of the number of UMIs detected per CB, which is typically bimodal.

Finally, molecule counts are normalized as the CBs identified to correspond to cells have different numbers of sequencing reads due to random sampling.

Human leukocyte antigen (HLA) genes and their expression

The major histocompatibility complex (MHC) locus of human chromosome 6 contains genes for both class I and class II human leukocyte antigen (HLA). These genes have considerable sequence variability among the human population, with hundreds to thousands of cataloged alleles per gene in databases such as IMGT-HLA [117]. Alleles for a particular HLA gene (e.g. HLA-A) are named with four numeric fields. The first field indicates the allele group; the second, the protein produced; the third, particular exonic synonymous variants; the fourth, particular noncoding variants. More fields indicate a higher-resolution genotype. This nomenclature is diagrammed in Figure 3.1.

58 HLA-A * 24 : 101 : 01 : 02

Gene 1. Allele group 2. Protein 3. Exonic 4. Non-coding synonymous variants variants

Figure 3.1: HLA gene and allele nomenclature convention. Adapted from Bauer et al. [118] and http://hla.alleles.org/nomenclature/naming.html.

The proteins coded for by the classical class I and class II HLA genes are responsible for neoantigen presentation. An individual’s HLA genotype has broad implications in the immune system, such as cell or organ transplantation, response to cancer immunotherapy treatment, and autoimmune disease. All alleles with the same first two fields have the same protein product, so genotyping efforts are often limited to this resolution. However, other exonic or non-coding variants may affect expression. The presence of other variants may also affect how well reads from this region align to the human reference genome, wherethe primary assembly contains just one allele for each gene. HLA expression has specific implications in the context of immunotherapy treatments for cancer. Loss of HLA expression or function is likely a major driver of immunotherapy eva- sion. Loss of HLA class I expression has been demonstrated in relapse after immunotherapy treatment of Merkel cell carcinoma [119] and loss of HLA class II expression was observed in relapse after hematopoietic stem-cell transplantation for acute myeloid leukemia [120].

Genomic loss of heterozygosity of HLA has been detected in non-small-cell lung cancer using the LOHHLA algorithm, which uses information about the individual’s HLA genotype to determine copy number [121].

Since the two alleles present in a diploid individual may be quite divergent in DNA sequence and/or protein product, there have been efforts to determine if different alleles

59 have systematic allele-specific expression. Aguiar et al. [122] and Lee et al. [123] foundthat many HLA genes demonstrate allele-specific expression in heterozygous lymphoblastoid cell lines. Bettens, Brunet, and Tiercy [124] found through qPCR that some HLA-C alleles have consistently higher expression than others; Zajacova, Kotrbova-Kozak, and Cerna [125] observed the same for alleles of the class II genes HLA-DQB1 and DQA1. Cell type specific expression of the HLA genes has also been studied. In their studyof allele-specific expression of HLA-A, -B, and -C genes in leukocyte subsets, [126] foundno cell type specific allele preference in PBMC from two human subjects, but found alleles in the rhesus macaque with significant cell type specific expression. Studies in bulk RNA-seq data have shown that HLA genes are expressed at different levels among human tissues and immune cell types [127].

HLA genotyping

HLA genotyping can be performed with specialized biological assays, such as sequence- specific oligonucleotide probe PCR (PCR-SSOP), sequence-specific primed PCR (PCR-SSP), and Sanger-based sequence-based typing (SBT) [128]. These assays vary in resolution, which is the number of fields in the HLA genotype that can be resolved. Alternatively, computa- tional methods can be employed to perform HLA genotyping from whole-genome sequencing (WGS), whole-exome sequencing (WES) or RNA sequencing (RNA-seq) reads. Kourami performs de Bruijn graph assembly on WGS or WES data and compares as- sembled allele sequences to a database [129]. OptiType aligns reads to a database of allele sequences and performs integer linear programming (ILP) to find the most likely alleles in the sample [130]. It works with WGS, WES, or RNA-seq data and genotypes class I genes only. xHLA translates DNA sequencing reads from WGS or WES to protein sequences and aligns these translated queries to a database of protein allele sequences and uses an ILP algorithm to select genotypes at two-digit resolution [131].

60 Methods using RNA-seq by definition are limited to two-digit resolution. seq2HLA uses a short-read unspliced aligner to map RNA-seq reads against a database of exons 2 and 3 of thousands of alleles of three class I and three class II genes in two phases that narrow down the likely alleles [132]. arcasHLA uses pseudoalignment to a de Bruijn graph of allele se- quences and an expectation-maximization to determine alleles [133]. HLApers has a similar algorithm, with the option to use a RNA-seq aligner instead of pseudoalignment [122]. Al- tHapAlignR uses a database of eight haplotype reference sequences encompassing the entire

MHC region and a graph optimization algorithm to build a personalized diploid reference by combining the reference haplotypes [123]. HLApers and AltHapAlignR include expression quantification in their pipelines. HLAProfiler uses a k-mer based taxonomic classifier with an extensive allele database to identify alleles likely present in the sample, and reads are subsequently aligned to this smaller set of sequences to refine the genotype [134]. Bauer et al. [118] evaluated many computational genotyping tools, including some de- scribed above, for WGS, WES, and RNA-seq data. Interestingly, they noted that PCR-based typing methods, often used as a gold standard, sometimes disagree with each other. Tian et al. [135] attempted to genotype individual cells for HLA class I using scRNA-seq data with seq2HLA and OptiType, but found that most cells did not have adequate read coverage. Combining reads from many cells in an scRNA-seq experiment as a pseudo-bulk dataset for genotyping is an interesting avenue for further research.

Pseudoalignment with de Bruijn graphs

The de Bruijn graph is a structure generated from a set of strings. The nodes of this graph represent all k-mers (substrings of length k) present in the strings. There is a directed edge between two nodes if there is a k-1 character overlap between the k-mers represented by the nodes. If the set of strings is sequencing reads, path-finding algorithms on the graph can be used for assembly [136]. Here, we address the use of the graph for read mapping, where the

61 set of strings is reference sequences.

Mapping of reads to the de Bruijn graph representation of reference sequences does not report base-level alignments or employ an edit distance scoring function, so it is called pseudoalignment to differentiate it from base-level, edit distance based “true” alignment algorithms. (See Chapter 4 for more details on read alignment.) k-mers from the reads are mapped to nodes in the graph to identify which reference sequences contain those k-mers. [137] and Kallisto [138] are two popular tools that map the k-mers of a set of RNA-seq reads to the reference transcriptome de Bruijn graph and subsequently perform isoform quantification. If an exon is present in multiple isoforms of a gene, this exon’s k-mer nodes in the graph will correspond to all possible isoforms. The set of sequences (in this case, isoforms) a k-mer corresponds to is called the equivalence class. As previously mentioned, arcasHLA [133] maps RNA-seq reads from the HLA genes to a de Bruijn graph built from a large database of HLA gene sequences. Some alleles of the HLA genes have partly shared sequences, and some HLA genes (e.g. A,B,C) also have sequence homology. After pseudoalignment, the alleles present in the sample are then determined using a similar algorithm to isoform quantification tools based on the equivalence classes of the k-mers from RNA-seq reads. scHLAcount performs pseudoalignment of reads from scRNA-seq to a de Bruijn graph containing only the personalized HLA alleles in the sample. Since the two alleles of each gene have common sequence, some reads are not assigned to a unique allele but can be assigned to a gene if the equivalence class contains both alleles of this gene.

Steps toward single-cell ASE analysis

Deng et al. [139] used Smart-seq/Smart-seq2 to assay cells from mouse embryo of a hybrid strain. Germline variants of the parental strains were characterized and alignments to strain- specific references were compared to determine the haplotype of origin of reads overlapping

62 these marker SNPs. Reinius et al. [140] performed a similar analysis on Smart-seq2 from cells from adult hybrid mice. They extended the analysis to human T cells where germline variants were called from WES. Higher heterozygosity and more comprehensive characterization of germline variants in the mouse hybrid led to more genes that could be analyzed for ASE in their mouse experiment than the human experiment. Jiang, Zhang, and Li [141] developed a pipeline, SCALE (Single-Cell ALlelic Expression), requiring germline variant calls and Smart-seq data that characterizes ASE and transcriptional bursting dynamics.

In theory, application of such methods to UMI-based assays rather than full-length assays would be very limited genome-wide because the 1 to 3 reads per UMI would rarely overlap the germline variants present in each gene, depending on the heterozygosity. VarTrix [142] and the pipeline described in [115] determine which cells harbor a mosaic DNA mutation based on scRNA-seq reads from 10x Genomics 3’ GEX and 5’ GEX assays. Depending on the proximity of the mutation to be genotyped to the captured end of the transcript, the number of cells where the mutation can be genotyped can be quite limited (Petti et al. [115] Figure 1d, lower right panel). VarTrix aligns reads overlapping the variant site to be genotyped to reference-allele and alternate-allele genome snippets.

As mentioned, the HLA genes are highly polymorphic in the human population. The two alleles present in an individual are likely to harbor many more heterozygous variants than other regions in the genome. Despite the few reads per UMI that are not evenly spread throughout the transcript, depending on the genotypes present in the individual and the sequencing protocol many molecules can be successfully assigned to alleles.

Results

We developed the scHLAcount software, which use sa personalized reference based on the HLA alleles present in the sample to assign scRNA-seq molecules from the HLA genes to al- leles. To illustrate the applications of scHLAcount, we reanalyzed three previously published

63 datasets. First, we analyzed the dataset of CD8+ T cells from four donors for which partial

HLA genotypes were available [19]. Second, we applied our method to five acute myeloid leukemia (AML) samples [115]. Using the scHLAcount allele-specific molecule counts, we detected cell type specific allele bias. Third, we reexamined data from two Merkel cellcar- cinoma (MCC) patients [119]. We extend the original finding that HLA class I expression is lost in tumor cells compared with non-tumor cells and use scHLAcount allele-specific molecule counts to show that this expression loss may be allele-specific.

3.2 scHLAcount pipeline scHLAcount is a postprocessing workflow for single cell gene expression data that produces allele-specific molecule counts for the main HLA class I and class II genes in each cell (Figure

3.2). Users provide the specific HLA alleles present in their sample of interest. Thesecan be obtained by specialized molecular tests or algorithms for sequence-based typing from next-generation sequencing reads of the genome, exome, or transcriptome, as previously discussed.

Molecule-counting algorithm

Based on the genotypes provided, scHLAcount extracts the coding and genomic sequences of those alleles from the IMGT/HLA database [117] and builds two colored de Bruijn graphs, one containing the CDS sequences and one containing genomic sequences. In addition, scHLAcount uses the read alignments generated by scRNAseq analysis tools such as Cell Ranger. Reads associated with valid cell barcodes and reported as aligning to the region of the genome containing the HLA genes are extracted from the alignment file and pseu- doaligned to the CDS graph. This yields the set of alleles in the reference graph that could have generated the read (equivalence class, as in [138]). If there is no significant alignment to the CDS graph, pseudoalignment is attempted to the genomic sequence graph. In 5’ GEX

64 datasets, we observed up to 12% of aligned reads were only aligned to the genomic sequence graph and not the CDS graph. In 3’ GEX datasets, up to 80% of aligned reads were aligned to the genomic sequence. This genomic alignment step is intended to rescue reads that may be haplotype specific in 3’ or 5’ UTR regions. It also provides a mechanism to handle reads from pre-mRNA in single nuclei RNA-seq libraries. Reads are then collated by molecule, which in 10x Genomics data comprises the 12bp cell barcode (CB) and 10bp unique molecular identifier (UMI). Reads sharing a cell barcode and unique molecular identifier (UMI) are assumed to originate from the same RNA molecule. At recommended sequencing depths with modest sequence saturation, there are typically 1-3 reads per UMI. While all reads sharing a CB and UMI supposedly originated from the same RNA molecule, individual reads may have different equivalence classes according to the pseudoalignment. We ignore reads whose equivalence class contains more than one gene, which we observed was 15-45% of aligned reads in 5’ GEX datasets and 10% of reads in 3’

GEX. If more than half of the reads from a molecule are assigned to a particular gene, that molecule will be assigned to one of its input reference alleles (e.g. HLA-A 02:01), based on the constituent reads’ equivalence classes. In the case of ambiguity, it will be assigned to that gene (e.g. HLA-A) instead. The output is a sparse molecule count matrix where each column corresponds to a barcode in the provided cell barcode list, and each row corresponds to an allele.

Parameter considerations

For the experiments described here, we used the following parameter settings, which are customizable by users of our tool: k-mer length of 20 for de Bruijn graph; minimum pseu- doalignment length of 60 bases; and maximum 2 mismatches in pseudoalignment. These parameters were selected based on our test datasets with genotypes with two or three-field resolution, where we expect the personalized reference to have very few mismatches with

65 Overview of scHLAcount Workflow

Allele Cell Cell sequence Genotypes Ranger barcodes database BAM

Extract CDS and Extract reads genomic sequences

Cell AAA… GGG… pseudoalignment Barcode to all allele HLA-A 0 1 sequences HLA-A* 3 2 02:01 Allele-specific HLA-A* 5 6 UMI count matrix 31:01

Figure 3.2: scHLAcount takes as input an allele sequence database (e.g. IMGT/HLA), geno- types for the sample being evaluated, cell barcodes, and aligned reads (e.g. BAM file from Cell Ranger). Allele sequences and relevant reads are extracted, and pseudoalignment is used to produce an allele-specific molecule count matrix. A snippet of the output matrix is shown for two cell barcodes and one gene (HLA-A) with two alleles.

66 the allele present in the reads. scHLAcount selects an arbitrary allele from the database consistent with the provided genotypes. If the genotypes provided are lower-resolution (e.g. the one-field genotype A*02 is lower-resolution than the three-field genotype A*02:01:01), scHLAcount arbitrarily selects a representative sequence from all A*02 alleles. Therefore, when only lower-resolution genotypes are available, the pseudoalignments of reads to the personalized reference may contain more mismatches and users may want to decrease the k-mer length or decrease the minimum significant alignment length.

Missing genotypes

In all the experiments described, CDS and genomic sequences of genes HLA-A, -B, -C, DPA1, DPB1, DQA1, DQB1, and DRB1 were acquired from the IMGT/HLA database version 3.36.0 [117]. For genes where at least one genotype was available at two-field resolu- tion, an arbitrary sequence was chosen among the alleles in the database with this genotype where complete (full-length) CDS and genomic sequences were available. Where no geno- type is available, the genotype of the GRCh38 reference should be used for consistency with the initial reference genome based read alignment step. Because this information does not appear to be available in the literature, we performed a simple simulation-based analysis of the GRCh38 primary assembly using Kourami v0.9.6 [129] to determine which alleles were present. 2 million 200bp error-free reads were simulated from GRCh38 Chr6:28510120- 33480577, which is approximately 80-fold coverage of the region. Reads were aligned to the Kourami reference panel and genotypes were inferred; all listed genotypes had 100% sequence identity with respect to the corresponding database sequence. The GRCh38 geno- types are A*03:01:01G, B*07:02:01G, C*07:02:01G, DQA1*01:02:01G, DQB1*06:02:01G,

DRB1*15:01:01G, DPA1*01:03:01G, DPB1*04:01:01G. To illustrate the importance of using true genotypes, in the tables in the following sections we also compare against using the reference where for each gene, the sequence of the allele

67 in GRCh38 primary assembly is used, regardless of donor genotype.

Computational performance scHLAcount only evaluates reads aligned to the MHC region of chromosome 6, which is about 5Mb. Donor 4 from the CD4+ T cell dataset (described in the next section) had 58 million reads in this region and analysis took 19 minutes. Donor 1 from the same dataset had 1.44 billion reads and took 3.5 hours. CPU utilization averaged 99% and memory consumption was 5 GB for the larger dataset and 1.5 GB for the smaller dataset, as data for all molecules is stored in memory before writing the final output. Our software is not parallelized, but multi-threading could be implemented to parallelize pseudoalignment.

3.3 CD8+ T cell dataset

The first dataset we analyzed was comprised of CD8+ T cells from four healthy donors. Data was generated with 10x Genomics Chromium Single Cell Immune Profiling Solution with Feature Barcoding in [19]. Only the 5’ Gene Expression data from these experiments were used in our analysis. For genes HLA-A and HLA-B, at least one genotype was available for each of the four donors (Reference Table 1 from [19]). First, scHLAcount was run using a haploid reference consisting of the GRCh38 primary alleles. Then, where genotypes were available for HLA-A and HLA-B, the program was re-run using a a reference containing the true genotype(s).

Table 3.1 compares the total molecule counts from all cells for this gene from Cell Ranger to results from scHLAcount. The experiment using the GRCh38 primary alleles identifies differences between our pseudoalignment procedure and the alignments from Cell Ranger.

Molecules may have been assigned to different genes using the pseudoalignment procedure compared to Cell Ranger alignment, which uses STAR. For HLA-A and HLA-B, scHLAcount reported 80%–114% of the molecules compared to Cell Ranger. When using the custom

68 Donor HLA-A HLA-B GRCh38 % molecules GRCh38 % molecules Custom Custom primary assigned to primary assigned to reference reference alleles an allele alleles an allele 1 0.911 1.072 93.80 1.141 1.054 n/a 2 0.857 1.061 95.01 1.013 1.018 n/a 3 1.029 0.981 92.50 0.854 1.322 94.27 4 1.009 0.931 n/a 0.798 1.253 89.54

Table 3.1: Allele-resolved molecule counts for HLA-A and HLA-B for the four donors in the CD8+ T cell dataset. Using the custom diploid reference or GRCh38 allele as denoted, for each gene we report the ratio of the molecule count from scHLAcount to the Cell Ranger molecule count. Donor 4 is homozygous for HLA-A. Only one genotypes is available for donors 1 and 2 for HLA-B. reference, we observed 93%–132%. This variability could depend on many factors, such as the overall expression of the genes or the distance between the individual’s alleles and the reference alleles. In cases where two alleles for the gene were present in the custom reference, we also calculated the percent of molecules that were assigned to a specific allele. 89%–95% of molecules could be allele-resolved. Among the five genes in the four samples with allele- resolved molecules, we observed 41%–49% of molecules assigned to the less-prevalent allele.

3.4 Acute myeloid leukemia dataset

Second, we used the dataset collected in Petti et al. [115]. They performed 10x Genomics Chromium 5’ GEX scRNA-seq on five bone marrow samples from patients with the blood/bone marrow cancer acute myeloid leukemia (AML). The samples contained 11,000–21,000 cells with an average across all samples of 223,000 reads per cell, which is a high sequencing depth. Using scHLAcount, we reanalyzed the reads from these datasets that were aligned to the MHC region. Genotypes for HLA-A, -B, -C, -DRB1, and -DQB1 at two-field resolution were provided to us by the authors. Reference genotypes were used for HLA-DQA1, -DPA1,

69 and -DPB1 since these genotypes were unavailable.

Raw (un-normalized) scHLAcount molecule counts are summarized in Table 3.2 as com- pared to the molecule counts from Cell Ranger for each gene. With a few exceptions, including 4/5 samples for HLA-C, we see more molecules from scHLAcount than in Cell

Ranger. Apart from the outlier value of subject 548327 gene HLA-DQB1, where only 2% of molecules could be assigned to an allele because the alleles differ by 1 mutation in the coding sequence, 60%–99% of molecules were assigned to an allele among the five genes for which genotypes were available. Molecule counts were then normalized with the following formula:

median molecule count × raw molecule count/cell molecule count

Normalization and dimensionality reduction of the gene expression matrix generated by Cell Ranger was performed using Seurat v3.0.2 [143]. For all the biallelic genes in each subject, we calculated the average normalized expression per gene. Based on the cell types determined using marker genes by [115], which we reproduce in Figure 3.3e, we also calculated the fraction of the normalized expression for each allele for the nine cell types with at least 100 cells.

Some genes had more expression of one allele than the other. Results for subject 809653 with the class II gene HLA-DRB1 are listed in Table 3.3 and visualized on a t-SNE dimen- sionality reduction plot in Figure 3.3a,b. Depending on cell type, we observe 42% to 54% allelic bias for the DRB*01:03 allele. This allele preference does not show a trend with aver- age expression. For the same subject, we also observe a 27% to 41% allelic bias for C*07:02 depending on cell type (Figure 3.3c,d; Table 3.4).

70 Custom % molecules Custom % molecules Custom % molecules Custom % molecules Subject diploid assigned to diploid assigned to diploid assigned to diploid assigned to reference an allele reference an allele reference an allele reference an allele HLA-A HLA-B HLA-C HLA-DQB1 508084 1.039 95.13 1.066 87.22 0.885 60.77 1.028 95.89 548327 1.165 86.26 1.061 93.09 1.032 n/a 2.721 2.27 721214 1.180 69.44 1.137 90.09 0.908 93.63 3.319 98.95 782328 1.154 n/a 0.880 63.95 0.957 89.77 1.010 99.15 809653 1.083 87.21 1.154 96.53 0.911 91.74 1.070 n/a Custom % molecules GRCh38 GRCh38 GRCh38

71 Subject diploid assigned to allele allele allele reference an allele HLA-DRB1 HLA-DPA1 HLA-DPB1 HLA-DQA1 508084 1.641 74.60 1.135 1.024 1.086 548327 1.920 89.52 1.180 1.172 2.087 721214 1.745 89.05 1.217 1.050 2.058 782328 1.125 92.12 1.276 1.078 1.274 809653 1.066 95.43 1.136 1.050 1.455

Table 3.2: Using the custom diploid reference or GRCh38 allele as denoted, for each gene we report the ratio of the molecule count from scHLAcount to the Cell Ranger molecule count. Subject 548327 is homozygous for HLA-C, Subject 782328 is homozygous for HLA-A, and Subject 809653 is homozygous for HLA-DQB1. % of DRB1 molecules % of DRB1 molecules Avg. HLA-DRB1 Cell type # cells assigned to 01:03 allele assigned to 11:01 allele normalized expression ERY 3,728 41.9 58.1 0.238 T-CELL 10,942 44.8 55.2 0.741 PRE-B-CELL 336 47.4 52.6 1.162 B-CELL 868 47.4 52.6 14.185 HSC 2,261 52.1 47.9 5.247 MEP 560 53.0 47.0 3.411 DEND (M) 620 53.7 46.3 17.602 ERY (CD34+) 432 53.9 46.1 2.153 MONO 1,366 54.0 46.0 7.390

Table 3.3: Normalized expression and allele-specific expression of HLA-DRB1 for subject 809653, stratified by cell type. Average is taken over all cells assigned to a particular cell type. 72

% of HLA-C molecules % of HLA-C molecules Avg. HLA-C Cell type # cells assigned to 07:02 allele assigned to 08:02 allele normalized expression B-CELL 868 26.7 73.3 5.184 MONO 1,366 32.7 67.3 5.813 PRE-B-CELL 336 33.9 66.1 3.266 DEND (M) 620 35.1 64.9 3.890 T-CELL 10,942 37.0 63.0 8.926 HSC 2,261 38.8 61.2 3.281 MEP 560 40.3 59.7 2.578 ERY (CD34+) 432 40.9 59.1 2.429 ERY 3,728 41.0 59.0 0.386

Table 3.4: Normalized expression and allele-specific expression of HLA-C for subject 809653, stratified by cell type. Average is taken over all cells assigned to a particular cell type. 3.5 Merkel cell carcinoma dataset

Paulson et al. [119] studied two patients with Merkel cell carcinoma (MCC), a type of

skin cancer usually caused by viral infection, whose disease relapsed after immunotherapy treatment. scRNA-seq was obtained from one patient (discovery) with 10x Genomics 3’ GEX at multiple time points and the other patient (validation) with 10x Genomics 5’ GEX at one

time point. Genotypes for genes HLA-A, -B, and -C for the discovery and validation subjects were provided to us by the authors. Here, alleles not explicitly reported in their publication

are given a placeholder name (e.g. A1 /A2 ) for confidentiality. Using scHLAcount with a custom reference for the diploid genotype of genes HLA-A, -B, and -C (and GRCh38 primary assembly alleles for the class II genes) we calculated allele-resolved molecule counts. Raw molecule counts were normalized as described above. For the discovery subject, we used the filtered expression matrices for tumor and PBMC samples available at GEO accession GSE117988; for the validation subject, the matrix is available at GSE118056. Normalization, dimensionality reduction, and clustering was performed using Seurat v3.0.2 [143] following the original study [119].

Discovery subject

For this subject, the tumor dataset is made up of cells taken from two time points in treat- ment; the PBMC dataset contains blood cells taken from four time points in treatment. Within a dataset, cells from all time points were combined for analysis (tumor and PBMC cells were not combined with each other). We identified normal cells in the tumor dataset as described in Paulson et al. [119]. Unsupervised clustering of the tumor dataset resulted in 15 clusters and we identified 11 of these clusters comprising 7,131 cells as putative tumor cells using the tumor marker genes NCAM1, KRT20, CHGA, and ENO2 and the non-tumor marker genes CD3D, CD34, CD61, and Fibronectin. The remaining four clusters contained 300 putative normal cells.

73 Figure 3.3: (a), (c) For each cell, color indicates log2(1 + normalized expression) of the gene. (b), (d) For each cell, color indicates the fraction of molecules assigned to an allele that are assigned to HLA-DRB1*01:03 or HLA-C*07:02. Overall, 95.4% of HLA-DRB1 molecules and 91.7% of HLA-C molecules are assigned to an allele. Cells with no molecules assigned to an allele are not plotted. (e) Cell types, determined by Petti et al. [115].

74 Table 3.5) compares the un-normalized scHLAcount molecule counts to those from Cell

Ranger. We observed fewer molecules than Cell Ranger, between 39% and 86% depending on the gene. For HLA-A, very few (5%–6%) of molecules could be assigned to an allele, but HLA-B and HLA-C had at least 40% and 64% of molecules assigned to an allele. This is due to the 3’-end bias of the data, which is described in detail in the next section. As previously reported, HLA-B expression is markedly less in the tumor compared to non- tumor cells and PBMC (Table 3.6). Additionally, HLA-A and HLA-C expression appears to be reduced in tumor cells.

Validation subject

For this subject, the tumor dataset and the PBMC dataset are made up of cells taken from a single time point after the patient’s relapse. Unsupervised clustering of tumor and PBMC cells together resulted in 18 clusters. As described in Paulson et al. [119], we identified seven of these clusters comprising 4,682 cells as putative tumor cells using the tumor marker genes

NCAM1, KRT20, Large T Antigen, and Small T Antigen. (Only 17 of these cells originated from the PBMC dataset.) The remaining 6,209 cells were designated putative normal cells and comprised 5,731 cells from the PBMC dataset and 478 cells from the tumor dataset, which Paulson et al. [119] identified as tumor-infiltrating leukocytes and tumor-associated macrophages (Figure 3.3e). Compared to Cell Ranger molecule counts, we inferred more molecules for the PBMC dataset and fewer molecules for the tumor dataset. At least 80% of scHLAcount molecules were assigned to an allele for class I genes (Table 3.5). Dividing cells into tumor and normal as described above, we corroborate the observation from [119] that HLA-A expression is greatly reduced in tumor cells compared to infiltrating immune cells (Figure 3.4a). No marked allele-specific bias in expression is observed in cells in either category. Additionally, we observe decreased expression of HLA-B and HLA-C in tu-

75 mor cells (Figure 3.4c,e). While non-tumor cells display approximately balanced expression

of the two alleles of these genes, tumor cells have only 13% of allele-resolved HLA-B ex-

pression from allele 35:01 and 6% of allele-resolved HLA-C expression from allele C1 (Table 3.7).

76 Custom % molecules Custom % molecules Custom % molecules Subject Assay type diploid assigned to diploid assigned to diploid assigned to reference an allele reference an allele reference an allele HLA-A HLA-B HLA-C Discovery 3’ GEX 0.866 5.34 0.391 40.76 0.639 64.31 (Tumor) Discovery 3’ GEX 0.855 6.42 0.449 45.98 0.767 67.94 (PBMC) Validation 5’ GEX 0.878 81.17 0.896 91.69 0.745 80.68 (Tumor) Validation 5’ GEX 1.050 87.71 1.073 94.41 1.033 89.65 (PBMC)

Table 3.5: scHLAcount analysis of discovery patient tumor (2 time points) and PBMC (4 time points) and validation patient tumor and PBMC (1 time point each) [119]. Using the custom diploid reference or GRCh38 allele as denoted, for

77 each gene we report the ratio of the molecule count from scHLAcount to the Cell Ranger molecule count. GEX = gene expression.

Gene Tumor cells Non-tumor cells PBMC Genotype (n=7,131) (n=300) (n=12,874) Average % molecules Average % molecules Average % molecules normalized assigned to normalized assigned to normalized assigned to expression alleles expression alleles expression alleles HLA-A 0.724 24.98/75.02 3.392 43.78/56.22 1.958 40.83/59.17 A1 /A2 HLA-B 0.115 76.11/23.89 3.156 61.70/38.30 1.713 63.97/36.03 35:02/B2 HLA-C 0.209 49.54/50.46 3.802 59.58/40.42 1.918 59.17/40.83 C1 /C2 Table 3.6: Average overall and allele-specific expression of HLA class I genes in the discovery subject of [119]. Gene Tumor cells Non-tumor cells Genotype (n=4862) (n=6209) Average % molecules Average % molecules normalized assigned to normalized assigned to expression allele 1 expression allele 1 HLA-A 0.060 39.7/60.3 4.154 56.8/43.2 02:01/A2 HLA-B 0.511 13.4/86.6 5.172 50.4/49.6 35:01/B2 HLA-C 0.327 6.3/93.7 4.991 46.8/53.2 C1 /C2 Table 3.7: Average overall and allele-specific expression of HLA class I genes in the validation subject of [119].

3.6 3’ versus 5’ GEX data

Due to the nature of 3’ GEX data, nearly all reads are sequenced from the opposite end of the HLA-A transcript from the variable sites used to define HLA types. Figure 3.5 shows the sequencing coverage (normalized so that the maximum coverage of each dataset is 1.0) for a 3’ GEX sample (red line) and 5’ GEX sample (black line). Below the coverage distributions, the exons of the genes are shown with black boxes and the untranslated regions in white boxes.

The variable sites are mostly located in exons 2 and 3, while the 3’ end of the transcripts are mostly homologous between the class I genes [127]. As a result of the coverage distribution of 3’ GEX data, we observed very few HLA-A molecules could be assigned to an allele in the two 3’ GEX samples evaluated (Table 3.5). This is because UMIs that only contain reads from the 3’ end of the transcript will be pseudoaligned to an equivalence class including both HLA-A alleles. UMIs typically contain 1-3 reads, so this is not uncommon. Comparing the 3’ GEX molecule assignment percentages (Table 3.5) to the 5’ GEX results (Tables 3.2 and 3.5) indicates that 5’ GEX data is preferable to 3’ GEX data for assigning molecules to alleles, because the sequencing coverage is not as limited to one end of the transcript.

Tables 3.1, 3.2 and 3.5 also compared the total (pre-normalization) scHLAcount molecule

78 Figure 3.4: (a)-(c) log2(1 + normalized expression) of class I genes. (d)-(f) Allele bias for HLA-A*02:01, HLA-B*35:01, and HLA-C C1 for the validation subject. Cells with no molecules assigned to an allele are not plotted; aggregate statistics are shown in Table 3.7. (g) Cell types inferred using marker genes.

79 counts to the molecule counts from Cell Ranger for the HLA genes with genotypes available.

Depending on the gene, assay type (5’ GEX versus 3’ GEX), and sample, we sometimes observe more molecules from scHLAcount and sometimes more from Cell Ranger. We con- sistently observe fewer molecules from scHLAcount in 3’ GEX data, possibly due to read distribution along the transcript (Figure 3.5) and the fact that most reads must be pseu- doaligned to the genomic sequence graph. Total molecule count could also be affected by the sequence similarity of the specific alleles in the sample or expression level of the HLAgenes in the sample. As more samples become available and are analyzed with scHLAcount, these potential factors contributing to the inconsistencies we observe between reference genome alignment based methods for molecule counting (e.g. Cell Ranger) and the personalized genome pseudoalignment based method of scHLAcount can be further investigated.

80 HLA−A Read Coverage chr6:29942532−29945870 (+ strand) 1.0 3' GEX

0.8 5' GEX 0.6 0.4 0.2 0.0

0 500 1000 1500 2000 2500 3000 HLA−B Read Coverage chr6:31353875seq(1, 3338) −31357179 (− strand) 1.0 0.8 0.6 0.4 0.2 0.0

3000 2500 2000 1500 1000 500 0 HLA−C Read Coverage chr6:31268749seq(1, 3304) −31272092 (− strand) 1.0 0.8 0.6 0.4 0.2 0.0

3000 2500 2000 1500 1000 500 0 Figure 3.5: Read coverage of HLA Class I genes for 3’ GEX and 5’ GEX. Minimum and maximum coverage for each assay in the region shown is normalized to 0 and 1 respectively. The value in thousands of reads for the maximum coverage for genes A, B, C is (3’/5’) 47/192, 97/288, 68/286. The 3’ dataset is merged from SRR7722937-SRR7722942 and the 5’ dataset is SRR7692286, all from Paulson et al. [119]. GEX = gene expression

81 3.7 Discussion scHLAcount provides a simple way to assign reads from scRNA-seq experiments to HLA alleles given genotypes, and is a powerful tool for investigating allele-specific expression, loss of heterozygosity, and mutational or epigenetic suppression of HLA expression in tumor immune-evasion. Additionally, using a personalized reference and counting with scHLAcount sometimes recovers more molecules than using the standard reference and counting with Cell Ranger. Allele-specific resolution and more molecules recovered could both improve gene expression based clustering in cells where MHC genes are a major component of the expression profile.

Limitations of the study

While we were able to reproduce general findings about tumor vs. normal expression from the MCC study, we did not apply scHLAcount to a dataset where the exact number of molecules from each gene is known to determine whether the number from Cell Ranger or scHLAcount is more correct, in situations where they differ. Such a ground-truth dataset would need to be generated through simulation. scHLAcount currently requires HLA genotypes for the sample to be provided by the user, although allele sequences are automatically retrieved from the IMGT/HLA database based on the alleles provided. Further research into HLA genotyping directly from scRNA-seq, either by repurposing existing methods for HLA genotyping from bulk RNA-seq or design of novel methods, would allow scHLAcount to be applied to more datasets. As demonstrated by [135] reads from a single cell rarely contain enough information to determine the genotype. However, if reads from all cells are combined into a pseudo-bulk sample, there could be greater success. Even when reads are combined across cells, the 3’ or 5’ coverage bias may be another challenge to directly repurposing bulk genotyping methods.

82 Future work

More than two genotypes The first version of the scHLAcount software assumed that there were at most two genotypes for each HLA gene (as the human genome is diploid). In response to a request from a user wishing to analyze a dataset from a transplant where the

HLA genotypes of donor and recipient differed, we extended the model to accomodate many genotypes as input. Further application of this mode is another interesting direction for this research, especially as it could be applied to differentiating donor from recipient cells on the basis of HLA genotype.

Other genomic regions The ideas from scHLAcount could be extended to also apply to any other locus where there is common structural variation present in the human population.

This variation could be cataloged in a database as with IMGT/HLA, determined directly from the scRNA-seq data, or determined from other data collected from the same sample. Just as in scHLAcount, molecules could be assigned to alleles based on pseudoalignment of reads to a graph containing the allele sequences. This could rescue reads that do not align well (or at all) to the reference sequence, identify a subset of cells in a sample with a structural variant mutation, or calculate allele-specific expression.

Analysis of other datasets Christopher et al. [120] reported that disease relapse of a dif- ferent cohort of AML patients after stem cell transplant therapy was sometimes accompanied by lower expression of HLA class II genes. Acquiring single-cell GEX data from patients with this condition at time points throughout treatment and could elucidate cell-type-specific and allele-specific expression effects in this relapse phenomenon, now that analysis ofthissort is enabled with scHLAcount. This type of analysis could be extended to other instances where HLA gene expression is modulated in normal and disease states simultaneously in a cell type-specific and allele-specific fashion.

83 Chapter 4

Vargas: heuristic-free alignment for assessing linear and graph read aligners

This chapter describes Vargas, a method for computing optimal alignment of sequencing reads to a graph or linear reference genome, and its applications in assessing heuristic short- read alignment algorithms and exploring the edit distance alignment problem. This work appeared in Bioinformatics [144].

Charlotte A. Darby, Ravi Gaddipati, Michael C. Schatz, and Ben Langmead. “Vargas: heuristic-free alignment for assessing linear and graph read aligners.”

In: Bioinformatics (2020, accepted).

The software (under MIT license) [145] and supplementary files/data [146] are available on Github.

4.1 Background

Vargas is a highly parallelized and optimized software that implements an exact edit distance alignment algorithm, and supports reads up to a few hundred bases long and linear or directed acyclic graph references on the order of billions of bases. In this section, we describe

84 the computational problem of read/reference alignment under an edit distance model and mention strategies to solve this problem exactly and approximately. We also outline ways that these algorithms can be parallelized and describe approaches for evaluating their speed and accuracy.

The edit distance alignment problem

Given a text and a pattern, the alignment problem seeks to find the best match of the pattern in the text. In this case, the text is the reference genome and the pattern is the sequencing read. The “best match” is defined based on edit distance, which is a weighted function summarizing a pairwise alignment between read and reference in a single numerical value.

Each column in the pairwise alignment contains a character or gap from the read and a character or gap from the reference. The scoring function specifies a penalty for mismatching characters, a bonus for matching characters, and a function that penalizes gaps based on their length. Match and mismatch values can be scaled based on the base quality reported by the sequencer. A constant penalty can be applied to each character in the gap; alternatively, an affine [147] or more complex function [2, 148] can be used. In general, these functions penalize longer gaps less per-character than shorter gaps by setting the penalty to start a gap greater than the penalty to extend the gap. This encourages larger consecutive gaps rather than multiple smaller gaps.

Based on the expectations of how the text and pattern should align, there are a few alignment paradigms for edit distance alignment that have different optimization functions and corresponding algorithms. If the entire pattern is expected to align to the entire text, global alignment is used. The Needleman-Wunsch algorithm, originally developed for protein alignment, applies an algorithmic strategy known as dynamic programming to global align- ment [149]. If some substring of the pattern is expected to align to a substring of the text,

85 local alignment is employed using the Smith-Waterman algorithm [150]. The chief modifica- tion here is that gaps at the end of the pattern and text are not penalized. Semiglobal (or end-to-end) alignment is a hybrid of the two strategies in which the entirety of the pattern is expected to align to a substring of the text; thus, gaps at the end of the pattern are penalized and gaps at the ends of the text are not. Figure 4.1 shows the alignment of a pattern and text under global, semiglobal and local alignment. The use case of edit distance alignment we address here is of a short sequencing read aligned to a long reference genome. Both local and semiglobal alignment have been proposed and usefully employed to solve this strategy. Possible advantages of using local alignment over semiglobal is if the read happens to align to a region of structural variation between the reference and donor genomes, multiple “split” alignments can be reported; or if the sequencing technology is more prone to errors at the ends of the read, a partial-length alignment of the accurate characters may be more appropriate.

In addition to Vargas, the software we present in this chapter, Table 4.1 lists several other software and libraries that implement exact dynamic-programming-based algorithms for computing edit distance, and key features of the implementations. Some of these algo- rithms support graph genomes, which are described later in this section.

86 text CCCACGTTTT Global alignment Score = 1 ||||||| | 8 matches: +8 1 mismatch: -2 pattern -CCACGTTAT 1 gap: -5

Scoring function CCCACGTTTT Semiglobal alignment Match +1 Score = 6 Mismatch -2 ||||||| | 8 matches: +8 Gap -5 CCACGTTAT 1 mismatch: -2 CCCACGTTTT Local alignment ||||||| Score = 7 7 matches: +7 CCACGTTAT

Figure 4.1: A text (upper string) and a pattern (lower string) are aligned using an edit distance scoring function with constant gap penalty. In global alignment (top), gaps at the end of pattern and text are penalized. In semiglobal alignment (middle), gaps at the end of the pattern are not penalized, as indicated by the gray characters in the pattern. In local alignment, some characters of the pattern are unaligned to achieve a higher score, and gaps at the end of pattern and text are not penalized, as indicated by the gray characters in both strings. Matching characters in text and pattern are denoted by a vertical line.

87 Tool Reference Availability Type Graph? Vargas this work https://github.com/langmead-lab/vargas Software DAG PaSGAL [151] https://github.com/ParBLiSS/PaSGAL Software DAG vg align [152] https://github.com/vgteam/vg Software DAG GraphAligner [153] https://github.com/maickrau/GraphAligner Software Any graph SSW [154] https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library Library no GSSW https://github.com/vgteam/gssw Library DAG seqan [155] https://github.com/seqan/seqan3 Library no Parasail [156] https://github.com/jeffdaily/parasail Library no Tool SIMD instructions Vectorization strategy Local? Semiglobal? Scoring function match/baseq-scaled mismatch Vargas SSE, AVX2, AVX512BW query-parallel yes yes 88 affine insertion/affine deletion PaSGAL AVX512BW query-parallel yes no match/mismatch/insertion/deletion vg align SSE striped yes no match/mismatch/affine gap GraphAligner no bit-parallel no yes unit cost SSW SSE striped yes no match/mismatch/affine gap GSSW SSE striped yes no match/mismatch/affine gap seqan SSE, AVX2, AVX512BW query-parallel yes yes match/mismatch/affine gap diagonal, blocked, Parasail SSE, AVX2 yes no match/mismatch/affine gap striped, prefix-scan

Table 4.1: Summary of exact dynamic programming pairwise alignment algorithms available in the literature and some key features. Heuristic read alignment algorithms

Numerous software programs have been developed for heuristic alignment of next-generation sequencing reads. In the context of algorithm design, a heuristic is a strategy that is not guaranteed to find the optimal solution but is much faster than an exact algorithm. Itis often optimized to return near-optimal results on commonly encountered instances of the problem. Many heuristic read alignment algorithms follow a two-step paradigm: first, find short exact matches using a pre-built index of the reference genome; then, extend or join these “seed” matches into full alignments using dynamic programming. While the dynamic pro- gramming algorithms in the second step are exact algorithms, as previously discussed, they are not applied universally to compute every possible alignment of read and reference; in- stead, they are applied selectively guided by seeds found in the first step. As a result of selective dynamic programming or other simplifications and restrictions, these heuristic al- gorithms may return an alignment with a suboptimal edit distance score or fail to align a read altogether. Canzar and Salzberg [8] Figure 16 categorizes 29 alignment algorithms based on the algorithmic strategies, data structures, and compute hardware employed. See also Reinert et al. [7] for another review of alignment algorithms and heuristic strategies.

Variant graphs

Most currently available exact and heuristic read alignment algorithms assume that the reference genome is linear, or represented by one or many strings (e.g. one string for each chromosome). With greater understanding of genetic diversity has come increasing focus on alternatives to the linear reference genome. Various solutions have been proposed that incorporate information about genetic variation in the population, including graph-shaped reference genomes [157], pan-genomes [158], and a genome that contains the most common

89 (major) allele at each variable site [159, 160]. The most recent human reference genome

assembly, GRCh38, includes alternate assemblies for hypervariable loci [161]. Dynamic programming edit distance algorithms have been extended to support local alignment of a linear pattern to a directed acyclic graph (DAG) reference [162]. Genetic

variants induce “forks” and “joins” in the DAG (Figure 4.2). The classical dynamic pro- gramming recurrence is computed for each column of each vertex in a topological-sort order, taking into account cases where a column has more than one predecessor in the graph struc-

ture (e.g. column 7 in Figure 4.2). Heuristic alignment algorithms that account for genetic variants have likewise been pro- posed. In this work we consider the graph aligners HISAT2 [163] and vg [152], and many

other strategies are available in the literature [153, 164–166].

Parallelism

Whether used as the core engine of an exact edit distance algorithm or as a subroutine in a

heuristic algorithm, there have been many attempts to parallelize the edit distance dynamic programming computation. The central data structure of this algorithm is a matrix of size pattern-length × text-length (without loss of generality, suppose the rows correspond to characters in the pattern and the columns correspond to characters in the text, as in each node of Figure 4.2). The value in each cell in the matrix is the edit distance between the prefixes of pattern and text indicated by that row and column. The value in a particular cell can be computed using the values of some neighboring cells. This causes dependencies among cells, meaning that parallelization must be done with care. Within a single matrix - one pattern and one text - Wozniak [167] proposed that cells along the antidiagonal could be computed simultaneously since they have no interdependen- cies. Rognes and Seeberg [168] computed adjacent cells in the pattern in parallel, which required some postprocessing because these cells are dependent. Farrar [169] computed non-

90 A ACGT CGT T

Q3 Q2 C A G Q1 G 5 T G T T T C Reference A C G T 1 2 3 4 T 6 Figure 4.2: The multiple alignment of three sequences (ACGTACGT, ACGTTCGT, ACGTCGT) can be represented as a four-node graph. Suppose we want to compute the dynamic programming alignment matrix for three length-3 queries to the graph-shaped ref- erence. The columns of the matrices correspond to characters in the reference (text) and are numbered 1 through 9 in the order they will be computed in the dynamic programming algorithm. Characters in the read (pattern) label the rows of each matrix. Matrices are shown stacked because SIMD instructions operate on the same row and column in multiple matrices simultaneously (e.g. the cells connected by the blue arrow). The optimal score for read Q1 is in column 7 (shaded cells show the alignment traceback); for Q2 the optimal score is in column 6; for Q3 two equally good alignments end in columns 4 and 9 (possible traceback is shown).

91 adjacent cells in the pattern in parallel in what is called a “striped” approach, which reduced the dependencies that had to be fixed. Given multiple matrices, which is quite a common occurrence because many reads are usually aligned at once, there are other opportunities for parallelization that do not require resolving or avoiding interdependencies among cells in the same matrix. Given multiple sequences aligned to the same reference, Rognes [170] computed the same cell (row and column) in several matrices at once in the SWIPE program. Seqan 2.4 Rahn et al. [155] extends this multi-sequence approach to support different references for each alignment; Jain et al. [151] use this approach when aligning to a directed acyclic graph in PaSGAL. This strategy is also known as “query-parallel.”

Both the intra-sequence and inter-sequence parallelization approaches just described are implemented using single-instruction multiple-data (SIMD) “vector” instruction sets, such as SSE and AVX, which are available on many modern processors. These are specialized hardware instructions that partition the bits of a register into several smaller units and perform the same operation (e.g. bitwise OR, addition, etc.) on all units simultaneously. For example, the AVX512BW (byte and word) instruction set performs element-wise operations on a 512-bit register partitioned into 64 8-bit or 32 16-bit units [171]. Another form of parallelization is achieved with the use of modern multi-core processors which support tens to hundreds of simultaneous threads of execution. Each thread can align batches of reads, potentially also employing instruction-level parallelism. Other hardware advances such as GPU and FPGA have also been used to accelerate dynamic programming algorithms.

Performance of a dynamic programming algorithm is measured in GCUPS (giga cell up- dates per second) which is the number of cells in the dynamic programming matrix computed per second. This standard metric used extensively in the literature and normalizes perfor- mance to enable comparisons between experiments conducted using reads and references

92 of different lengths [151, 155, 156, 172, 173]. Practically, it is calculated by the following equation: GCUPS = read length × number of reads × reference length× 2 (if reverse complement) ÷ runtime (seconds) ÷ 109

Benchmarking and parameter optimization

The question of which heuristic alignment algorithm to use is quite complex to answer comprehensively and has led to many benchmarking efforts. For example, Hatem et al. [174] use several simulated datasets to comprehensively evaluate accuracy and throughput of nine aligners. Simulated reads are generated by taking substrings of the reference genome and adding sequencing errors and/or genetic variation. When using simulated reads, alignment accuracy is typically measured based on comparing the genomic coordinate from which the read was simulated to the coordinate reported by the alignment algorithm. If the coordinates are the same or within a short distance, the alignment is correct; if the read is unaligned or the coordinates differ, the alignment is incorrect. We refer to this definition of alignment correctness as “correct-by-location.” We will also consider an alternative definition of alignment correctness based on the edit distance scoring function. When the optimal alignment score for a read is equal to the score reported by the heuristic algorithm, we call this alignment “correct-by-score.” Optimal alignments of real sequencing reads could be considered a computational gold standard for read alignment algorithms. Holtgrewe et al. [175] propose that heuristic aligners could be evaluated using such a computational gold standard generated with their tool Rabema, which enumerates all possible alignments of a read to a reference within a fixed edit distance. This procedure can be performed with real or simulated sequencing reads, as the correct location is determined by a slow but exact algorithm to enumerate all eligible alignments. We propose a similar workflow in which heuristic-free alignments can be used to evalu-

93 ate alignment algorithms using real data. To serve as a computational gold standard, the optimal alignment should be calculated with respect to the same alignment mode (local or semiglobal), scoring function, and reference genome of the algorithm being evaluated. Using our method, there is no limit on the maximum edit distance between read and reference, and a variety of edit distance scoring functions are supported. The issue of dataset-specific parameter optimization has also received attention. Teaser [176] uses simulated sequencing reads based on the user’s dataset to quickly measure the performance of several alignment programs and parameter settings. TAPAS [177] is specifi- cally designed for ancient DNA sequencing samples that contain contaminants and are being aligned to the reference genome of a related species. This method uses simulated sequencing reads from a variety of reference genomes to limit the alignment of contaminant reads while optimizing aligner parameters.

Results

First, we describe the implementation, computational performance and thread scaling of the Vargas software. Second, we used Vargas to study the performance and accuracy of heuristic read aligners on real sequencing data. While most benchmarks use simulated reads, using a heuristic-free aligner allows us to determine whether the alignment for a real read is correct- by-score, as others have observed [175]. We explored how alignment settings can affect which reads are incorrect-by-location, incorrect-by-score, or completely fail to align due to heuristics. We evaluated the time-accuracy tradeoff of the Bowtie 2 and HISAT2 effort presets and propose a comparable set of parameters for BWA-MEM and BWA aln. Third, based on the correct-by-location definition, we compare BWA-MEM and Bowtie 2 mapping quality to the mathematical ideal before and after adjustment with Qtip [178]. Finally, we show how a small set of reads annotated with optimal alignment score using Vargas can be used to optimize Bowtie 2, BWA-MEM, and vg alignment parameters for whole-genome

94 sequencing and ChIP-seq reads. Importantly, in all cases, the analyses are guided by real reads. Data and scripts to reproduce these experiments are available on Github [146].

4.2 Vargas software

Vargas performs dynamic programming edit distance alignment of short (up to a few hundred bases) sequencing reads to a linear or directed acyclic graph reference genome. Parallelization is achieved by using single-instruction multiple-data (SIMD) hardware instructions and mul- tithreading. In this section, we describe the software implementation, evaluate throughput performance with different SIMD instruction sets and variant graphs of increasing complex- ity, and show thread scaling trends.

Graph Alignment

Before alignment, Vargas constructs the DAG from a linear reference sequence and optionally a set of genetic variants in a Variant Call Format (VCF) file. The graph can be computed once and is stored on disk in a format that includes the sequence represented by each node and links between nodes. To reconcile coordinate shifts introduced by insertions and deletions, alignments are anchored to the reference sequence. Nodes representing parallel paths with different sequence lengths are right-aligned to the reference sequence. Vargas produces read alignments in SAM format. For the best and second-best alignment scores, Vargas reports the reference position of the rightmost aligned base of the read and the count of equally- scoring alignment locations at least one read-length apart in custom SAM tags. For linear genomes, the alignment traceback can optionally be computed to populate the CIGAR and POS fields.

95 Implementation

Vargas is implemented in C++ using SIMD instructions and supports the SSE4.1, AVX2, and AVX512BW instruction sets, which can be compiled for many architectures including Intel Xeon Phi (Knights Landing/KNL) and Xeon Platinum (Skylake/SKX). The KNL ar-

chitecture supports 256–288 threads across 64–72 cores [179, 180] and SKX can be configured with up to 28 cores, each with two AVX (advanced vector extensions) processors [181]. In Vargas, to maximize throughput, each SIMD word (vector) is split into 8-bit operands al-

lowing for the simultaneous alignment of 16, 32, and 64 sequences with 128-bit, 256-bit, and 512-bit vectors for SSE4.1, AVX2, and AVX512BW respectively. If the difference between the maximum and minimum possible alignment scores exceeds 255 based on the read length

and scoring function, 16-bit operands are selected at runtime.

Vargas memory consumption

The only sizable in-memory data structure used by Vargas is the reference genome. For

the linear reference genome GRCh38, the representation consists of the string itself with one node per contig (199 total). On disk it is 2.9 Gb and alignment uses maximum 3.16 Gb. A graph genome requires a more complex representation, including the nodes and edges

that represent various choices of alleles. The graph for GRCh38 plus all 1000 Genomes variants is 230 million nodes and 17 Gb on disk, with a total length linearized increase of 3.36% compared to the linear genome sequence. When the entire graph is loaded into

memory, alignment uses maximum 101.3 Gb, although this could be reduced by changing the implementation to process the graph by chromosome or smaller chunks. The more practical graph used in the experiments in Section 3 and Section 4 with HISAT2 and vg contains the

1000 Genomes variants with MAF > 10%. It is 18.7 million nodes and 4 Gb on disk (0.24% length increase), and Vargas uses maximum 11 Gb memory in alignment.

96 Computational performance

Evaluation was performed on an Intel Xeon Phi 7250 (Knights Landing / KNL) computer with 68 cores and four threads per core, and an Intel Xeon Platinum 8160 (Skylake / SKX) computer with two 24-core processors and two threads per core. GCUPS and runtime results for semiglobal alignment are shown in Figure 4.4 and Figure 4.3; local alignment results are shown in Figure 4.6 and Figure 4.5. The local alignment algorithm is slightly slower than the semiglobal algorithm because the optimal value in the dynamic programming matrix can occur in any row for local alignment, while in semiglobal alignment, the optimal value always occurs in the last row (because the entirety of the pattern must be included in the alignment) which requires fewer numerical comparisons. These figures show results of a weak scaling experiment, where the input number of reads is scaled in proportion to the number of threads, so that each thread aligns a full vector of 100bp reads against the reference 8 times.

In the top two panels in each figure, we measured performance using the GCUPS (giga cell updates per second) statistic, described in the previous section. In these panels, ideal scaling would be a linear increase. Instead we see less-than-ideal (sublinear) scaling owing to hyperthreading and, more generally, to contention for shared resources. SKX employs two-way hyperthreading on 48 physical cores, and speedup is observed up to 64 threads with AVX512BW instructions. KNL employs four-way hyperthreading on each of its 68 physical cores, allowing a maximum of 272 simultaneous threads. With AVX2 instructions, the GCUPS performance doubles from 68 threads (one per core) to 136 threads (two per core) but does not continue to double when there are three or four threads per core.

In the bottom two panels in each figure, we plot wall time. Ideally, this measurement would remain constant, but it increases with more threads for the same reasons preventing ideal scaling in the GCUPS plots.

The left two panels compare different reference genomes. Reads were aligned to chromo-

97 some 19 with no variants (Linear, 1 node), 1000 Genomes Project Phase 3 variants with minor

allele frequency >10% (MAF >10% 1KGP, 436K nodes), or all 2504 individuals’ variants from 1000 Genomes Project Phase 3 (All 1KGP, 5.1M nodes). The 1000 Genomes Project’s variant calls for the GRCh38 reference are limited to biallelic SNVs and short indels and are

described in Lowy-Gallego et al. [182]. While the linearized genome size increases by 0.3% and 3.66% in the MAF >10% and All 1KGP graphs respectively, the GCUPS performance decreases and the wall time increases by more than this percentage. This is due to additional overhead associated with processing graph nodes. The right two panels compare different vector instruction sets. KNL supports SSE4.1 and AVX2 vector instruction sets, which allow for 16 and 32 reads per vector, respectively

(with 8-bit elements in the vectors). SKX additionally supports AVX512BW which supports 64 reads. On KNL, the wall time does not increase when using the AVX2 instruction set with twice the capacity; thus, performance doubles compared to SSE4.1 instructions. In contrast, wall time increases when using AVX2 and AVX512BW instruction sets compared to SSE4.1 on SKX. However, because the increase in runtime is less than 2x when the vector size doubles, we still see a performance increase using vectors with larger capacity.

The best speed we observed using Vargas is 456 GCUPS, which was observed for semiglobal alignment to the linear reference of chromosome 19 using AVX512BW instructions with 64- way vectorization on 48 threads of the SKX computer (Figure 4.3). With the All 1KGP graph genome, we observed 237 GCUPS with the same configuration. When aligning 150bp reads to chromosome 10, SeqAn [155] reported 420 GCUPS using 40 threads on SKX with AVX512BW; on the same dataset, Parasail [156] recorded 74 GCUPS, but used AVX2 in- structions that offer only half the throughput per instruction. PaSGAL [151], reported 317 GCUPS with 48 threads on SKX with AVX512 instructions when aligning 100bp reads to a graph genome of the 1Mbp Leukocyte Receptor Complex locus including all variants from

1000 Genomes Project Phase 3, not including traceback.

98 Vargas semiglobal alignment to the linear genome on Xeon Phi (KNL) with AVX2 in- structions and 271 threads achieved a maximum speed of 194 GCUPS. Alignment to the all-variant graph described above achieved 110 GCUPS with the same configuration. This compares favorably to previous Xeon Phi-based efforts such as SWAPHI (58.8 GCUPS) [172], as well as to GPU based aligners such as CUDASW++3.0 (119 GCUPS, GTX680) [173]. While several exact dynamic programming pairwise alignment algorithms are available in the literature (summarized in 4.1), Vargas offers the most flexibility in terms of scoring function and options for local, semiglobal, linear, and DAG alignment with comparable speed and scaling to the state of the art.

4.3 Alignment accuracy

Since Vargas has the features required to calculate optimal alignments for semiglobal and local alignment to linear and graph genomes, with affine gap penalties and base-quality- dependent mismatch penalties, we can systematically evaluate the behavior of many heuris- tic aligners with respect to the correct-by-score definition. We align the same real sequenc- ing reads, naturally containing sequencing errors and genetic variation, with the heuristic and then with Vargas, using the same configuration (scoring function, reference genome, semiglobal/local optimization function). Then we compare the alignment scores. Since Vargas also reports the coordinate of one possible optimal-scoring alignment, we can also compare the heuristic alignment position and the Vargas coordinate and calculate correctness with the correct-by-location definition. First, we describe results using 100bp and 250bp WGS sequencing reads. 100bp WGS reads are from the 1000 Genomes Project sample NA18505, SRA accession ERR239486, and 250bp WGS reads are from the 1000 Genomes Project sample NA19017, SRA accession SRR1295544. The first 100,000 (unpaired) reads were used from each dataset. For the

100,000 100bp read set, we evaluated Bowtie 2 [183] in semiglobal and local alignment modes;

99 SKX AVX512−BW Weak Scaling SKX Instruction Sets − Linear Genome

● ● ● ● ● ● ● ●

400 400

300 ● 300 ●

200 200 GCUPS GCUPS ● ●

● Linear ● AVX512−BW (64) 100 100 MAF>10% 1KGP AVX2 (32)

● All 1KGP ● SSE4.1 (16) 0 0 1 16 32 48 64 80 95 1 16 32 48 64 80 95 Threads Threads

900 900

● ● 600 600 ● ●

● ●

● ● ● ● 300 ● 300 ● Wall Time (seconds) Wall Time (seconds) Wall

● ●

0 0 1 16 32 48 64 80 95 1 16 32 48 64 80 95 Threads Threads

Figure 4.3: Weak scaling for Skylake (SKX), semiglobal alignment. Vector size is shown in parentheses after the instruction set name in the top right panel. GCUPS = giga cell updates per second.

100 KNL AVX2 Weak Scaling KNL Instruction Sets − Linear Genome

200 ● 200 ● ● ● ● ●

150 150

● ● 100 100 GCUPS GCUPS

50 ● Linear 50 ● AVX2 (32) MAF>10% 1KGP SSE4.1 (16) All 1KGP 0 ● 0 ● 1 68 136 204 271 1 68 136 204 271 Threads Threads

4000 4000

3000 3000

● ● 2000 2000 ● ●

● ●

Wall Time (seconds) Wall 1000 ● Time (seconds) Wall 1000 ● ● ●

0 0 1 68 136 204 271 1 68 136 204 271 Threads Threads

Figure 4.4: Weak scaling for Knight’s Landing (KNL), semiglobal alignment. Vector size is shown in parentheses after the instruction set name in the top right panel. Note, KNL does not support the AVX512BW instruction set. GCUPS = giga cell updates per second.

101 SKX AVX512−BW Weak Scaling SKX Instruction Sets − Linear Genome

● ● ● ● ● ●

400 ● 400 ●

● ● 300 300

200 200 GCUPS GCUPS ● ●

● Linear ● AVX512−BW (64) 100 100 MAF>10% 1KGP AVX2 (32)

● All 1KGP ● SSE4.1 (16) 0 0 1 16 32 48 64 80 95 1 16 32 48 64 80 95 Threads Threads

900 900

● ● 600 600 ● ●

● ●

● ● 300 ● ● 300 ● ● Wall Time (seconds) Wall Time (seconds) Wall ● ●

0 0 1 16 32 48 64 80 95 1 16 32 48 64 80 95 Threads Threads

Figure 4.5: Weak scaling for Skylake (SKX), local alignment. Vector size is shown in paren- theses after the instruction set name in the top right panel. GCUPS = giga cell updates per second.

102 KNL AVX2 Weak Scaling KNL Instruction Sets − Linear Genome

● ●

● ● ● ●

150 150

● ● 100 100 GCUPS GCUPS

50 ● Linear 50 ● AVX2 (32) MAF>10% 1KGP SSE4.1 (16) All 1KGP 0 ● 0 ● 1 68 136 204 271 1 68 136 204 271 Threads Threads

4000 4000

3000 3000

● ● 2000 2000 ● ●

● ●

Wall Time (seconds) Wall 1000 ● Time (seconds) Wall 1000 ● ● ●

0 0 1 68 136 204 271 1 68 136 204 271 Threads Threads

Figure 4.6: Weak scaling for Knight’s Landing (KNL), local alignment. Vector size is shown in parentheses after the instruction set name in the top right panel. GCUPS = giga cell updates per second.

103 BWA aln [88]; BWA-MEM [184]; HISAT2 [163] with linear and graph references, and vg with linear and graph references [152]. We also performed Vargas alignments of 100,000 250bp reads for all aligners except BWA aln. Figure 4.7 and Table 4.2 show the performance of all aligners evaluated on the 100bp read set with respect to the correct-by-score definition.

Bowtie 2 and HISAT2 were run multiple times using their “preset” parameters that trade between runtime and alignment accuracy. We also determined a sequence of settings that create similar tradeoffs for BWA-MEM and BWA aln. Figure4.8 and Table 4.2 show results for the 250bp read set.

Bowtie2 semiglobal Bowtie 2’s default alignment mode performs semiglobal alignment.

It offers four “presets” which are settings of values forthe -D (extension effort), -R (re- seeding), -L (seed length), and -i (seed spacing) parameters. We found that these presets effectively trade between time and accuracy when evaluated using correct-by-score. Forthe settings that are faster but less accurate, the total number of reads failing to align increases and the number of reads aligned correctly decreases. Correctness decreases approximately linearly with optimal alignment score for 100bp and 250bp reads.

Bowtie2 local Bowtie 2 also supports local alignment with the same four preset parame- ters. For 100bp reads, correctness generally decreases with optimal alignment score, except it remains constant fat ∼50% for reads with alignment score 100–130. The relationship between alignment score and correctness is approximately linear for 250bp reads.

BWA aln Tradeoff between accuracy and performance for BWA aln was achieved by changing the parameters for maximum number of mismatches -n and number of gaps -o in the alignment. Therefore, alignments where the optimal alignment score includes more than the maximum number of mismatches or gaps will not be found. For example, this leads to no correct-by-score alignments below -30 in the o1,n5 parameter setting.

104 BWA-MEM While BWA-MEM does not offer presets like Bowtie 2 and HISAT2, we selected four settings for -k (seed length) and -r (re-seeding) that trade between speed and accuracy. For 100bp reads, with the default parameter settings (k19,r1.5) and our proposed presets, correctness decreases with optimal score for reads with score greater than 50, but then increases with optimal score for reads with optimal score between 50 and 30, the minimum alignment score reported. This appears to be because most of the aligned reads with optimal score below 50 have an exact or near-exact alignment that does not include all the bases in the read, sometimes referred to as “soft clipping.” The BWA-MEM heuristic algorithm for local alignment is often able to correctly identify optimal alignments of this sort, even though they have low alignment score. In contrast, illustrating a fundamental difference between the local and semiglobal alignment problems, the Bowtie 2 and HISAT2 strategies for semiglobal alignment seldom identify the optimal low-scoring alignments which have many mismatches and gaps because every base of the read must be included in the alignment. For the 250bp reads, alignments with optimal score greater than ∼150 are nearly all correct-by-score; and alignments with a lower optimal score decrease linearly, but with different slope depending on the parameter settings.

BWA-MEM2 The tables also report results for BWA-MEM2 [185], an accelerated version of the BWA-MEM algorithm that returns identical alignments. Since the alignments are identical, the correctness statistics are also the same as for BWA-MEM. Using release Bwa- mem2-2.0pre2, we noticed that memory consumption of BWA-MEM2 is nearly ten times higher than BWA-MEM, and runtime is longer in experiments with shorter reads and/or faster presets. This is due to a significant amount of time spent loading data structures into memory. Perhaps due to the small size of our datasets (100,000 reads) we do not see an advantage in using BWA-MEM2 due to the significant index loading time, and note that the greatly increased memory footprint is an important consideration.

105 HISAT2 The latest version of HISAT2 (version 2.2.0-beta [186]) provides two modes that change the --score-min (minimum score for reporting alignments) and --bowtie2-dp pa- rameters to align more reads. very-sensitive uses “unconditional dynamic programming” and sensitive uses “conditional dynamic programming.” For both 100bp and 250bp read sets, compared to the default ‘fast’ mode, more reads are correct-by-score using the very- sensitive and sensitive modes, but more reads are incorrect-by-score. This is possible because more reads are aligned. In contrast, the slowest Bowtie 2 and BWA-MEM presets have the least incorrect reads. More incorrect alignments are likely reported because the HISAT2 presets also alter the minimum score of alignments reported, which is an independent pa- rameter in Bowtie 2/BWA-MEM. HISAT2 was run in semiglobal alignment mode, so there is linear relationship between alignment score and correctness as with Bowtie 2 semiglobal alignment. Vargas alignments show that 7,398 100bp reads and 10,032 250bp reads had a higher optimal alignment score with the graph genome compared to the linear genome. vg Like BWA-MEM, vg performs local alignment, so for 100bp reads, correctness decreases with optimal score for reads with score greater than 50 but increases with optimal score for reads with optimal score between 50 and 20. Again, this appears to be due to low-scoring local alignments frequently being soft-clipped to involve a short region of high similarity and underscores the differences between local and semiglobal alignment. For 250bp reads, the trend is also like BWA-MEM: near-perfect correctness until alignment score 150, then a linear decrease. Vargas alignments show that 7,439 100bp reads and 9,652 250bp reads had a higher optimal alignment score with the graph genome compared to the linear genome.

106 Bowtie 2 semiglobal Bowtie 2 local 1.00 ● 1.00 ●

0.75 0.75

● ● 0.50 very−sensitive 0.50 very−sensitive sensitive (default) sensitive (default) 0.25 fast 0.25 fast very−fast very−fast Fraction correct (score) Fraction 0.00 correct (score) Fraction 0.00 −60 −40 −20 0 50 100 150 200 Optimal alignment score Optimal alignment score BWA aln BWA−MEM 1.00 ● 1.00 ●

0.75 0.75

● 0.50 0.50 k16,r1.2 ● o5,n15 k19,r1.5 (default) 0.25 o3,n10 0.25 k22,r3 o1,n5 k25,r4 Fraction correct (score) Fraction 0.00 correct (score) Fraction 0.00 −40 −20 0 40 60 80 100 Optimal alignment score Optimal alignment score HISAT2 graph HISAT2 linear 1.00 ● 1.00 ● ● very−sensitive ● very−sensitive sensitive sensitive 0.75 0.75 fast (default) fast (default) 0.50 0.50

0.25 0.25

Fraction correct (score) Fraction 0.00 correct (score) Fraction 0.00 −75 −50 −25 0 −75 −50 −25 0 Optimal alignment score Optimal alignment score vg graph vg linear 1.00 ● 1.00 ●

0.75 0.75

0.50 0.50

0.25 0.25 ● vg_graph ● vg_linear

Fraction correct (score) Fraction 0.00 correct (score) Fraction 0.00 20 40 60 80 100 20 40 60 80 100 Optimal alignment score Optimal alignment score

Figure 4.7: Correct-by-score plots for all aligners tested on the 100,000 100bp read set. Alignments are binned by the optimal alignment score calculated by Vargas, shown on the horizontal axis, which is truncated at the point after which no alignments are reported by the heuristic. A line is fitted to the scatterplot of fraction of reads that are correct-by-score. Points are more transparent when representing fewer reads with a given optimal alignment score. Only primary alignments were evaluated for HISAT2 and BWA-MEM.

107 Bowtie 2 semiglobal Bowtie 2 local 1.00 ● 1.00 ●

0.75 0.75

● ● 0.50 very−sensitive 0.50 very−sensitive sensitive (default) sensitive (default) 0.25 fast 0.25 fast very−fast very−fast Fraction correct (score) Fraction 0.00 correct (score) Fraction 0.00 −150 −100 −50 0 100 200 300 400 500 Optimal alignment score Optimal alignment score BWA−MEM HISAT2 graph 1.00 ● 1.00 ●

0.75 0.75

● 0.50 k16,r1.2 0.50 k19,r1.5 (default) ● very−sensitive 0.25 k22,r3 0.25 sensitive k25,r4 fast (default) Fraction correct (score) Fraction 0.00 correct (score) Fraction 0.00 50 100 150 200 250 −250 −200 −150 −100 −50 0 Optimal alignment score Optimal alignment score HISAT2 linear vg graph 1.00 ● 1.00 ●

0.75 0.75

0.50 0.50 ● very−sensitive 0.25 sensitive 0.25 ● vg_graph fast (default)

Fraction correct (score) Fraction 0.00 correct (score) Fraction 0.00 −250 −200 −150 −100 −50 0 50 100 150 200 250 Optimal alignment score Optimal alignment score vg linear 1.00 ●

0.75

0.50

0.25 ● vg_linear

Fraction correct (score) Fraction 0.00 50 100 150 200 250 Optimal alignment score

Figure 4.8: Correct-by-score plots for all aligners tested on the 250bp read dataset. Points are more transparent when representing fewer reads with a given optimal alignment score.

108 All reads Highest-scoring 95% Lowest-scoring 5% Tool Mode Time (s) Mem. (GB) %U %I %C %U %I %C %U %I %C Bowtie 2 [183] very-sensitive 100.89 3.38 1.03 0.63 98.35 0.01 0.15 99.84 20.59 12.09 67.32 semiglobal sensitive (default) 48.93 3.37 1.17 0.71 98.12 0.01 0.21 99.78 23.38 13.24 63.38 fast 43.35 3.37 1.37 0.84 97.80 0.05 0.27 99.69 26.75 15.78 57.47 very-fast 34.81 3.37 1.74 0.89 97.37 0.14 0.34 99.52 32.59 16.40 51.01

Bowtie 2 [183] very-sensitive 85.42 3.38 0.32 1.08 98.60 0.00 0.18 99.82 6.51 19.64 73.85 local sensitive (default) 64.28 3.38 0.33 1.26 98.41 0.00 0.26 99.74 6.71 21.84 71.44 fast 38.32 3.37 0.38 1.52 98.11 0.00 0.32 99.68 7.65 26.51 65.84 very-fast 32.74 3.37 0.49 1.85 97.66 0.01 0.48 99.52 9.84 31.29 58.87

BWA-MEM [184] k16 r1.2 87.51 5.51 0.35 0.43 99.23 0.00 0.04 99.96 7.05 8.52 84.43 k19 r1.5 (default) 59.73 5.49 0.35 0.46 99.19 0.00 0.03 99.97 7.11 9.27 83.63 k22 r3 42.81 5.47 0.36 0.60 99.04 0.00 0.04 99.96 7.31 12.17 80.52 k25 r4 39.40 5.46 0.38 0.67 98.95 0.00 0.05 99.95 7.61 13.50 78.88

BWA-MEM2 (AVX2) [185] k16 r1.2 65.43 49.77 0.35 0.43 99.23 0.00 0.04 99.96 7.05 8.52 84.43 k19 r1.5 (default) 75.29 49.66 0.35 0.46 99.19 0.00 0.03 99.97 7.11 9.27 83.63 109 k22 r3 69.51 49.62 0.36 0.60 99.04 0.00 0.04 99.96 7.31 12.17 80.52 k25 r4 67.78 49.60 0.38 0.67 98.95 0.00 0.05 99.95 7.61 13.50 78.88

BWA aln [187] o5 n15 245.67 4.25 1.38 0.42 98.21 0.01 0.03 99.97 28.24 11.11 60.65 o3 n10 253.87 3.87 1.63 0.33 98.03 0.01 0.02 99.97 33.48 9.50 57.01 o1 n5 (default) 62.69 3.19 2.70 0.15 97.15 0.02 0.02 99.97 55.42 6.21 38.37

HISAT2 [163] very-sensitive 44.09 4.47 0.64 1.38 97.98 0.69 1.48 97.83 12.16 24.82 63.01 linear sensitive 25.43 4.47 1.56 0.96 97.48 1.56 1.04 97.40 29.66 17.55 52.79 fast (default) 17.50 4.47 3.66 0.41 95.93 3.46 0.51 96.03 67.99 3.47 28.53

HISAT2 [163] very-sensitive 52.73 6.73 0.69 1.37 97.95 0.08 0.26 99.66 12.58 22.81 64.60 graph sensitive 58.06 6.73 1.56 0.93 97.51 0.12 0.29 99.59 29.64 13.40 56.96 fast (default) 32.97 6.72 3.46 0.40 96.14 0.25 0.31 99.44 66.02 2.23 31.75

vg [152] linear 397.20 24.16 0.19 0.56 99.25 0.00 0.04 99.96 3.81 10.59 85.60 graph 381.63 26.58 0.19 0.51 99.31 0.00 0.03 99.97 4.24 10.89 84.87

Table 4.2: Alignment and correctness for the 100,000 100bp reads. Reported runtime is the median of three consecutive trials, and reported memory usage is the maximum memory footprint during alignment. U = unaligned; I = incorrect-by-score; C = correct-by-score. Time for bwa aln is reported for ‘aln’ only, not ‘samse’ which converts the intermediate output into SAM format. All reads Highest-scoring 95% Lowest-scoring 5% Tool Mode Time (s) Mem. (GB) %U %I %C %U %I %C %U %I %C Bowtie 2 [183] very-sensitive 193.59 3.38 5.01 1.18 93.81 1.22 1.09 97.69 77.13 3.01 19.86 semiglobal sensitive (default) 97.23 3.37 6.11 1.39 92.50 2.25 1.29 96.46 79.56 3.47 16.97 fast 69.15 3.37 7.42 1.89 90.70 3.50 1.79 94.71 81.92 3.73 14.35 very-fast 47.98 3.37 8.38 2.45 89.17 4.40 2.40 93.20 84.17 3.49 12.35

Bowtie 2 [183] very-sensitive 237.80 3.40 2.14 2.26 95.60 0.54 1.15 98.31 32.93 23.71 43.35 local sensitive (default) 177.94 3.39 2.44 2.60 94.96 0.73 1.44 97.83 35.45 24.87 39.69 fast 126.44 3.38 3.83 3.03 93.14 1.72 1.98 96.30 44.66 23.25 32.09 very-fast 109.29 3.38 4.79 3.84 91.37 2.46 2.83 94.71 49.62 23.29 27.08

BWA-MEM [184] k16 r1.2 237.04 5.55 1.67 0.72 97.61 0.00 0.41 99.59 33.73 6.84 59.43 k19 r1.5 (default) 119.74 5.45 1.84 0.86 97.30 0.01 0.46 99.53 36.92 8.64 54.44 k22 r3 97.70 5.43 2.27 1.11 96.63 0.60 0.60 98.80 43.88 10.82 45.30 k25 r4 89.74 5.43 2.78 1.23 95.99 0.26 0.70 99.04 51.12 11.32 37.56

110 BWA-MEM2 (AVX2) [185] k16 r1.2 187.52 50.04 1.67 0.72 97.61 0.00 0.41 99.59 33.73 6.84 59.43 k19 r1.5 (default) 133.10 49.78 1.84 0.86 97.30 0.01 0.46 99.53 36.92 8.64 54.44 k22 r3 116.98 49.70 2.27 1.11 96.63 0.60 0.60 98.80 43.88 10.82 45.30 k25 r4 109.56 49.66 2.78 1.23 95.99 0.26 0.70 99.04 51.12 11.32 37.56

HISAT2 [163] very-sensitive 124.43 4.48 3.46 3.53 93.01 1.36 2.46 96.18 43.33 23.99 32.69 linear sensitive 61.76 4.48 6.95 2.82 90.23 2.37 2.92 94.71 94.05 0.88 5.07 fast (default) 20.98 4.47 17.40 1.52 81.08 13.06 1.60 85.34 100.00 0.00 0.00

HISAT2 [163] very-sensitive 171.59 6.74 3.47 4.09 92.43 1.37 3.00 95.63 43.58 24.88 31.54 graph sensitive 96.25 6.73 6.96 3.24 89.80 2.40 3.37 94.23 94.23 0.87 5.00 fast (default) 35.03 6.72 17.22 1.71 81.08 12.89 1.79 85.31 100.00 0.00 0.00

vg [152] linear 633.29 24.15 1.31 1.29 97.41 0.06 0.44 99.50 25.34 17.47 57.19 graph 692.07 26.55 1.29 1.30 97.41 0.06 0.44 99.50 24.78 17.68 57.54

Table 4.3: Alignment and correctness for the 100,000 250bp reads. Reported runtime is the median of three consecutive trials, and reported memory usage is the maximum memory footprint during alignment. U = unaligned; I = incorrect- by-score; C = correct-by-score. Correctness definitions

We can compare the simulation-based definition of correct-by-location, which is calculated by comparing the genomic coordinate where the read is simulated from to the coordinate where it is aligned, and the correct-by-score definition, which is calculated using Vargas matched to the scoring function of the alignment algorithm, using a simulation experiment. Three sets of 100,000 unpaired reads of length 50, 100, and 150 bases respectively, were simulated with Mason2 [188] using the Illumina error model, human chromosome 19, with variants from the 1000 Genomes Project individual NA18505 [182]. Reads were aligned with Bowtie 2 and BWA-MEM to the chromosome 19 reference. The optimal alignment score for each aligner’s scoring function was calculated by Vargas to compute correct-by-score.

The simulated coordinate was compared to the aligned coordinate to calculate correct-by- location. For alignments with Bowtie 2 to the chromosome 19 reference, we saw that 7.79%, 2.80%, and 1.31% of reads of length 50, 100, and 150 respectively, were correct-by-score but incorrect-by-location. BWA-MEM produces similar results. Table 4.4 lists the number of reads in each category. Our results illustrate that most of the time the definitions agree for simulated reads, but repetitive alignments can deflate the correct-by-location statistic.

Very rarely, there are alignments that are incorrect-by-score but correct-by-location. For BWA-MEM, we determined that the optimal alignment computed by Vargas is shorter and applied soft-clipping to achieve a better score. For Bowtie 2, we determined that the optimal alignment computed by Vargas either has gaps near the end of the alignment which are disallowed by the Bowtie 2 --gbar parameter, or has mismatches at lower-base-quality bases than the mismatching bases in the simulated location, which led to a better score due to

Bowtie 2’s base quality scaled mismatch penalty. We can also compare these correctness definitions using Vargas alignments. Vargas re- ports the location of one optimal-scoring alignment, as well as the number of optimal align- ments. Alignments must be at least 1 read-length apart to be counted as separate optimal

111 alignments. Figure 4.9 compares the correct-by-score and correct-by-location definitions for the genomic alignments of the 100bp real sequencing reads with unique and repetitive opti- mal alignments for Bowtie 2 semiglobal, Bowtie2 local, and BWA-MEM. When we consider only the reads that align uniquely (right column), there is little difference between the score and location based definitions of correctness. When we consider reads that align repetitively (left column) which comprise about 7.8% of alignments for all the aligners, noticeably more reads are correct-by-score than are correct-by-location, as expected based on the simula- tion results. Of note is that BWA-MEM has more correct-by-location reads when using a 30bp buffer size compared to a 5bp buffer size. This could be due to BWA-MEM selecting an alternative soft-clipping location with the same alignment score as the location Vargas reports.

112 Correct-by-score + Incorrect-by-score + Correct-by-score + Incorrect-by-score + Read length Algorithm Aligned reads Correct-by-location Correct-by-location Incorrect-by-location Incorrect-by-location 50 Bowtie2 99561 91502 34 7790 235 50 BWA-MEM 99947 91734 2 8080 131 100 Bowtie2 99948 96961 23 2800 164

113 100 BWA-MEM 100000 97096 1 2871 32 150 Bowtie2 99962 98513 17 1313 119 150 BWA-MEM 100000 98565 1 1431 3

Table 4.4: For simulated reads from human chromosome 19, comparison of correct-by-score and correct-by-location definitions. Optimal alignment score is calculated by Vargas; correct-by-location is calculated by comparing whetherthe simulated read location is exactly equal to the aligned location. Salmon

The Salmon pseudoaligner [137] version 1.1.0 will optionally perform semiglobal dynamic programming alignment and produce a SAM file with alignment scores based on anedit distance scoring function with the command-line flags

--writeMappings and --validateMappings. While the alignments themselves are not de- signed to be used in downstream analysis (e.g. the CIGAR string does not correspond to the reported alignment score) the alignment scores are used in this mode to select which

alignments will be used in expression quantification. Using an RNA-seq dataset and Var- gas alignments to the reference transcriptome, we evaluated the alignment scores reported by Salmon with the default algorithm which performs alignment only between and beyond

the maximal exact matches (MEMs), and with the --fullLengthAlignment flag, which performs alignment along the whole read. We used 100,000 75bp RNA-seq reads from the GEUVADIS project, from 1000 Genomes

Project cell line NA18505, SRA accession ERR204952. Vargas alignments of the reads to the transcriptome reference, (comprising 227,912 sequences totaling 359,080,835 bases) took 17.5 hours using 48 cores on a computer with AVX2 instructions, which allow for 32 reads per vector, but due to the scoring function and read length, 16-bit lanes were required which allowed for 16 reads per vector. We ran Salmon (custom version provided to us by the authors, which corrected a bug in alignment scoring present in release version 1.1.0) with and without the --fullLengthAlignment command line parameter which adjusts the heuristic strategy employed. The --skipQuant flag was also used which skips the quantification algorithm. The default algorithm had more correct-by-score alignments compared to the full-length algorithm (Table 4.5, Figure 4.10).

114 Bowtie 2 semiglobal − non−unique alignments Bowtie 2 semiglobal − unique alignments 1.00 ● 1.00 ●

0.75 0.75

0.50 0.50 ● by_score ● by_score by_loc5 by_loc5 0.25 0.25 Fraction correct Fraction by_loc30 correct Fraction by_loc30

0.00 0.00 −60 −40 −20 0 −60 −40 −20 0 Optimal alignment score Optimal alignment score

Bowtie 2 local − non−unique alignments Bowtie 2 local − unique alignments 1.00 ● 1.00 ●

0.75 0.75

0.50 0.50 ● by_score ● by_score by_loc5 by_loc5 0.25 0.25 Fraction correct Fraction by_loc30 correct Fraction by_loc30

0.00 0.00 50 100 150 200 50 100 150 200 Optimal alignment score Optimal alignment score

BWA−MEM − non−unique alignments BWA−MEM − unique alignments 1.00 ● 1.00 ●

0.75 0.75

0.50 0.50 ● by_score ● by_score by_loc5 0.25 0.25 by_loc5 Fraction correct Fraction by_loc30 correct Fraction by_loc30 0.00 0.00 40 60 80 100 40 60 80 100 Optimal alignment score Optimal alignment score

Figure 4.9: For the 100bp read dataset, comparing the correct-by-score and correct-by- location measurements, with different buffer sizes (5, 30) for location of the left alignment coordinate. Since Vargas indicates whether the optimal alignment is unique within one read- length, the right column considers only reads that have a unique optimal alignment. Points are more transparent when representing fewer reads with a given optimal alignment score.

115 Figure 4.10: Correct-by-score plots for Salmon validation mappings. Points are more trans- parent when representing fewer reads with a given optimal alignment score.

116 All reads Highest-scoring 85% Lowest-scoring 15% Tool Mode Time (s) Mem. (GB) %U %I %C %U %I %C %U %I %C Salmon [137] default 5.82 1.50 12.12 5.43 82.46 0.05 5.77 94.19 81.00 18.39 15.51

117 –fullLengthAlignment 6.32 1.50 12.22 5.97 81.81 0.05 6.08 93.88 81.71 29.37 12.92

Table 4.5: Alignment and correctness for the 100,000 75bp RNA-seq reads aligned to the transcriptome. Reported runtime is the median of three consecutive trials with the –skipQuant flag, and reported memory usage is the maximum memory footprint during program execution. U = unaligned; I = incorrect-by-score; C = correct-by-score. 4.4 Mapping quality

The mapping quality (MAPQ) of a read alignment is defined by Li, Ruan, and Durbin [187] as:

MAPQ = −10 · log10 Pr[read is incorrectly mapped]

An accurate prediction for MAPQ requires an accurate prediction for the probability the alignment is incorrect. Heuristics make this difficult by effectively “censoring” the space of alignments the aligner can find. Because of this, heuristic aligners work with only partial information when making a MAPQ prediction. This leads to errors, such as predicting a high MAPQ for an incorrect alignment or for an alignment that truly deserves a low one.

Since downstream tools such as variant callers depend on MAPQs to make decisions about how to weigh and filter evidence, it is important to predict accurately. Aligners predict MAPQ based on features such as the alignment score of the best and second-best alignments found or how repetitive the seed hits are. It may also depend on the number of hits with the same score as the reported alignment [187]. To condense such features into a single score, BWA-MEM and vg use a formula, whereas Bowtie 2 and HISAT2 use a decision tree-like approach. Some aligners do not attempt to estimate mapping quality at all. Qtip [178] adjusts mapping quality using tandem simulation and extra output from the heuristic during the alignment algorithm. Using Vargas, we can assess how well mapping quality reflects alignment correctness by grouping reads by their aligner-assigned orQtip- adjusted mapping quality and calculating average correctness. In this case, it is important to use the correct-by-location definition (within a 5 bp buffer), to match the definition of MAPQ. Figure 4.11 shows results for Bowtie 2 with semiglobal or local alignment and BWA- MEM (red lines), and Qtip-adjusted (blue line) along with a black line indicating where the points would lie if they conformed perfectly to the mathematical definition of MAPQ, for

118 the 100bp and 250bp read sets. Consistent with past experiments, Qtip-adjusted mapping qualities fall into a smaller numerical range than aligner-calculated MAPQ and are generally more monotonic and closer to the ideal.

119 1.00 1.00 1.00

0.75 0.75 0.75

0.50 0.50 0.50 sensitive (default) sensitive (default) k19 (default)

sensitive + Qtip sensitive + Qtip k19 + Qtip

120 0.25 0.25 0.25

0.00 (a) Bowtie 2 semiglobal 0.00 (b) Bowtie 2 local 0.00 (c) BWA−MEM Frac. correct−by−location +/− 5 +/− 5 correct−by−location correct−by−location Frac. Frac. +/− 5 +/− 5 correct−by−location correct−by−location Frac. Frac. +/− 5 +/− 5 correct−by−location correct−by−location Frac. Frac. 0 10 20 30 40 0 10 20 30 40 0 20 40 60 MAPQ MAPQ MAPQ

Figure 4.11: The 100bp read set (a-c) and 250bp read set (d-f) are binned by mapping quality, shown on the horizontal axis, and correctness is measured using the correct-by-location definition within 5bp. The black line reflects the mathematical definition of mapping quality. 4.5 Optimizing alignment correctness of WGS reads

As an example of how Vargas alignments can be used to improve the heuristic alignment workflow by a user who is not necessarily a tool developer, we further examined the100bp reads described previously, specifically the 8,365 reads that were unaligned by Bowtie 2or had an alignment with >1 mismatch or at least 1 gap, with default parameters. All further analysis was performed on this subset of the original dataset, which we refer to as ‘difficult reads’. We wanted to determine whether more accurate alignments of difficult reads could be obtained by tuning command-line parameters without much increase in runtime, so we varied the seed length parameter of each aligner from 10 to 32 (See Supplementary Excel File 1 of [144]). In Table 4.6 we report results for the default seed length and the seed length that minimized the average difference between the optimal alignment score (first column for each aligner) and the heuristic alignment score, for aligned reads (second column). Notably, the optimal seed length for Bowtie 2 in both semiglobal (SG) and local (L) alignment modes was faster than the default seed length. The optimal seed length for BWA-MEM and vg was slower than the optimal seed length, so we also included results for the optimal parameter that had runtime less than 1.5 times slower than the default parameters. This case study demonstrates how a set of real reads annotated with the optimal align- ment score can be used to tune heuristic alignment parameters that are exposed to the user on the command line. While a 1-2% increase in correctly-scored alignments may seem marginal, this would have a significant impact on a dataset with millions or billions ofreads and on use cases with a low signal-to-noise ratio such as cell-free DNA analysis or somatic variant calling. The parameters enabling this increase in accuracy can be identified with Vargas alignments of just a few thousand reads.

121 Bowtie2 SG Bowtie2 L BWA-MEM vg Seed length 28 32 14 17 10 12 % aligned (ratio) 1.00 0.95 1.00 1.00 1.02 1.02 % correct score (ratio) 1.01 1.02 1.00 1.00 1.01 1.01 Mean AS difference, all 0.82 0.90 0.93 0.98 0.61 0.66 aligned reads (ratio) Mean AS difference, incorrect-by-score 0.89 1.02 0.97 1.00 0.81 0.85 aligned reads (ratio) Alignment time (ratio) 0.76 0.56 1.97 1.20 1.99 1.41

Table 4.6: Bowtie 2 semiglobal alignment (SG), Bowtie 2 local alignment (L), BWA-MEM, and vg using MAF > 10% graph genome, were run on 8,365 difficult reads from the 100bp dataset with seed length varying from 10 to 35. The value in the table equals the ratio between the measurement (row) when using the stated parameter setting (column) versus when using default parameters. For BWA-MEM and vg, the second parameter setting is the one that gave the lowest mean AS difference without taking more than 1.5 times as longas the default.

4.6 Optimizing alignment correctness of ChIP-seq reads

We also performed a similar optimization experiment for Bowtie 2, BWA-MEM and vg with graph genome using ChIP-seq reads. By varying command-line parameters on a small test set of 10,000 reads, we observed increased alignment rate and correctness-by-score on 570,000 difficult reads from the dataset, at the cost of increased runtime (Table 4.7; seealso Supplementary Excel File 2 of [144]).

As an example of how Vargas alignments can be used to improve the heuristic alignment workflow for Bowtie 2, BWA-MEM and vg, we examined a set of 41 million unpaired 36-bp ChIP-seq reads from SRA accession SRR901802 [189]. In the study of [190], these reads were aligned with Bowtie 2 in an attempt to find novel QTLs. For initial alignment, we also used Bowtie 2 in the very-sensitive mode (runtime 6.5 minutes with 16 threads) and identified 570,000 (1.4%) reads that were unaligned or had an alignment with >1 mismatch or at least

1 gap. All further analysis was performed on this subset of the original dataset, which we refer to as difficult reads.

122 All of the difficult reads were realigned three times with Vargas, which was configured to match the default Bowtie 2 (semiglobal alignment), BWA-MEM, and vg scoring functions. vg alignment was performed to the graph of GRCh38 containing all 1000 Genomes Project variants with minor allele frequency > 10%; other alignments were performed to the linear

GRCh38 reference. Alignment with Bowtie 2 parameters took 81 hours on 48 Skylake cores. From these, we then selected 10,000 reads and aligned them with a selection of parameter settings for each of the three heuristic aligners. For Bowtie 2, we changed the seed length (-L) and dynamic programming extensions (-D) parameters; for BWA-MEM and vg we changed the seed length and reseeding parameters. Based on examining summary statistics comparing each parameter setting to the Vargas scores we selected the combination of parameters for each tool that aligned at least as many reads as the default parameters and minimized the average difference in alignment score of all aligned reads. For BWA-MEM and Bowtie2,we also selected a parameter setting that took at most 1.5 times as long to run as the default setting. For vg, no parameter setting we explored met the additional runtime criteria. We realigned all the difficult reads with the new parameter settings. Results are summarized in Table 4.7.

4.7 Discussion

We presented Vargas, a heuristic-free read alignment tool achieving extremely high multi- threaded throughput using SIMD instructions for query-parallel vectorization. Vargas works with flexible alignment scoring functions (e.g. affine gap penalty) and parameters (e.g.local and semi-global alignment), and with both linear and graph references. Read alignments produced by Vargas can be used as a computational gold standard for evaluating short-read alignment algorithms, including with real sequencing datasets. The Rabema study of Holtgrewe et al. [175] highlighted the disadvantages of evaluating aligners using only simulated reads and the correct-by-location definition. They developed

123 Bowtie2 SG BWA-MEM vg Parameters L14,D100 L22,D100 k11,r1.2 k15,r1.5 k10,r1.2 % aligned (ratio) 1.27 1.01 1.06 1.05 1.25 % correct score (ratio) 1.32 1.04 1.12 1.09 1.28 Mean AS difference, all 0.80 0.80 0.31 0.51 0.67 aligned reads (ratio) Mean AS difference, incorrect-by-score 0.99 0.93 0.76 0.87 1.12 aligned reads (ratio) Alignment time (ratio) 14.63 1.76 8.14 2.33 14.59

Table 4.7: Bowtie 2 semiglobal alignment, BWA-MEM, and vg using MAF > 10% graph genome, were run on 570,000 difficult reads from the ChIP-seq dataset with various pa- rameter settings. The value in the table denotes the ratio between the parameter setting shown and the the default parameters. For Bowtie 2 and BWA-MEM, the second parameter setting is the one that gave the lowest mean AS difference without aligning fewer reads or taking more than 1.5 times as long as the default setting in the 10,000-read experiment (See Supplementary Excel File 2 of [144]).

the concept of the “trace tree” to enumerate mapping locations and a tool, Rabema, for computing all mappings of real or simulated sequencing reads less than a certain Hamming

or edit distance. Aligners were evaluated on their ability to return all matches within the distance threshold, all best matches, or any best match. However, using a single truth set of the optimal locations of all matches leaves alignment heuristics and scoring functions

as confounding factors. Also, fixing a maximum distance k disregards reads where the optimal match is further than k from the reference, which we show are the most error- prone in semiglobal alignment algorithms. Future benchmarking efforts comparing alignment algorithms, parameter settings, and graph versus linear reference genome paradigms can now be based on aligner-specific, real data computational gold standards generated using Vargas. Alignment methods developers can use Vargas alignments to identify particular reads where the heuristic fails to find the optimal solution to the optimization problem posed, and can revise the heuristic strategies accordingly. Knowing optimal alignments for a subset of the input reads can serve as training data for identifying optimal alignment parameters, as

124 tools like Teaser [176] do using simulated reads. Such customization of parameters should be particularly effective for short (e.g. ChIP-seq) and/or error-prone (e.g. ancient DNA) reads.

Limitations of the study

Many short-read datasets use paired-end reads, where a DNA fragment is sequenced from both ends, typically with a few hundred bases between the read pairs. Heuristic aligners account for the fact that pairs should align concordantly to the reference, i.e. in a particular expected configuration based on the library preparation. Since concordance is not defined by the scoring function per se, and since checking for concordance of paired-end alignments can be implemented as a post-pass after each end has been aligned individually, we left paired-end alignment to future work. We evaluated Salmon’s RNA-seq alignments to the transcriptome, but Vargas’ optimization functions do not extend to spliced short-read align- ment, such as aligning a RNA-seq read to a genome. Possible optimization functions for penalizing splicing events could depend on intron length, genomic nucleotides at the donor and acceptor sites, and a transcriptome annotation. Extending the dynamic programming model to find optimal solutions would require that every position in the read could bespliced to any pair of coordinates in the genome with a corresponding alignment penalty, exponen- tially increasing the possibilities to be explored. Vargas does not currently support the minimum-of-two-affine-functions gap scoring function used by Minimap2 [191] or variation graphs that are not DAGs, which would limit its evaluation of an aligner that worked with variation graphs containing cycles, for example.

Future work

Scoring function Vargas opens the door to comprehensive study of the effects of align- ment heuristics and, distinct from that, the effects of alignment scoring functions. Though the default scoring functions of tools like BWA-MEM and Bowtie 2 are widely used, they are

125 not very well studied, and this is in large part because it is difficult to separate the effect of

the scoring function from the closely related effects of the heuristics. To investigate the im- pact of scoring functions, we could calculate correctness-by-score or correctness-by-location of a heuristic aligner using many different scoring functions. We would compute Vargas

alignments for each new scoring function. However, even if a heuristic alignment algorithm can consistently compute the optimal read alignment scores for a particular scoring function, this may not be the “best” scoring function. An alternate experiment could use simulated

sequencing reads and Vargas optimal alignments to test which scoring function most often has the optimal alignment equal to the simulated location. It would also be useful to study how many optimal alignments are possible for each scoring function; for example, scoring

functions that scale the mismatch penalty based on the base quality may have more reads with unique optimal alignments. Alternatively, we could use real sequencing reads. While real reads do not have a known location of origin, we could use technologies with some lo-

cality constraints; for example, paired-end reads align in a certain orientation and distance based on the library insert size distribution, and linked reads or synthetic long reads are made up of short reads from a few long molecules. These Vargas-only experiments would

show which scoring function aligns reads to the correct genomic location, independent of a particular heuristic alignment algorithm.

Graph references Vargas alignments could also be used to evaluate the effects of different reference genomes on alignment accuracy. As variants are added to a variant graph reference, mismatches between the read and the standard linear reference at the true point of origin due to genetic variation are eliminated. In our experiments with HISAT2 and vg, we observed thousands of reads with a higher optimal alignment score to the graph genome compared to the linear genome. However, the number of optimally-scoring or high-scoring alignments may also increase when more variants are added. The graph also becomes larger and more

126 complex with more variants, leading to increased computational demands to build and align to it (and index it, in the case of heuristic algorithms that use an index). We could use Vargas alignments to investigate these tradeoffs by comparing graph genomes containing different variant sets to each other and to linear references, as investigated using simulation in the FORGe study [159].

Transcriptome alignment In addition to the experiment we performed evaluating Salmon alignment scores, Vargas could be used to evaluate other (unspliced) alignment of RNA-seq reads to the reference transcriptome. Furthermore, Vargas’ graph alignment capabilities could be used to assess the impact of variant graph references on transcriptome alignment.

Enumerating all mappings Because Vargas calculates all possible alignments of read to reference in the course of filling the dynamic programming matrix, every match above a certain minimum score could be reported. This could be useful in CRISPR guide RNA off-target analysis: current approaches are limited to a few mismatches; using Vargas would allow for a full edit distance scoring function and enumeration of distant alignments, includ- ing in the presence of genetic variation. Fully profiling all possible alignments of a read, ora substring extracted from the genome, could also be applied to the problem of characterizing mappability as explored in Lee and Schatz [192] and Wilson et al. [193].

127 Chapter 5

Conclusion

Summary

This thesis presented three computational methods for established and emerging genome- analysis datatypes based on next-generation sequencing. Each of these bioinformatics tools addresses the genetic variations within and between human individuals in a different way. In Chapter 2, we described the mosaic variant caller Samovar [30]. Our novel method finds single-nucleotide variants present in some but not all cells in a bulk sequencing sample, harnessing germline variant phasing from linked reads to achieve high accuracy. Chapter 3 describes scHLAcount, a software pipeline that computes allele-specific molecule counts for the HLA genes from single-cell gene expression data [102]. Our method uses a personal- ized reference genome based on the individual’s genotypes to provide data that can reveal allele-specific and cell type-specific gene expression patterns. In Chapter 4, we analyzed and optimized short-read heuristic alignment accuracy using Vargas, a heuristic-free graph genome aligner we developed [144]. Read alignment is the first step in most next-generation sequencing workflows, so our analysis of popular alignment algorithms, especially thosethat use a variant graph reference genome, has broad impact.

128 Future directions

Samovar Although the linked read technology Samovar is designed to work with (10x Ge- nomics) has been deprecated, many samples were sequenced using this technology during its lifetime. Samovar could be applied to the 10x Genomics linked read samples available in repositories in a comprehensive study of samples with potential heterogeneity. Samples where the biological material comes from cell lines are assumed to have very little genetic variation between cells. Mosaic variants detected by Samovar in these samples could be used to inform the filtering steps we take to eliminate false positives due to sequencing artifacts or read alignment errors. Cells within a healthy individual may be genetically different due to somatic mutations during development and cell differentiation [54, 194]. Applying Samovar to linked read samples taken from healthy but potentially genetically heterogeneous tissues could continue to elucidate the extent of somatic mutation in development and aging. An- other promising area of future study with Samovar is in cancer, where the genetic differences between cancer and normal cells and even among different cancer cells in the same affected individual are known to affect disease progression and treatment [58]. In contrast tosome other available methods, Samovar does not require normal control sequencing data, so it can be applied to studies where this was not available. scHLAcount The HLA genes are involved in antigen presentation to the immune system. These genes are extremely variation-dense with thousands of cataloged alleles [117]. In bulk

RNA sequencing studies, the expression of these genes has been shown to vary depending on cell type and genotype. Further, loss of expression and loss of heterozygosity of these genes has been demonstrated in cancer [119–121]. scHLAcount could be used to simultaneously explore allele-specific expression and cell type-specific expression in the increasing single- cell gene expression data that is publicly available. Past studies have characterized HLA gene expression based on allele [124, 125], and tissue [127]. Applying scHLAcount to single-

129 cell gene expression datasets that contain different cell types and that come from individuals with different HLA genotype could reveal simultaneous trends in allele-specific and celltype- specific expression of these genes. This could offer insight into how expression is modulated normally, in diseases, and by therapeutic interventions, and the role of a person’s specific pair of alleles in this dynamic. One limitation of scHLAcount is that genotypes are required. The de Bruijn graph pseudoalignment strategy we use for molecule-counting could be applied to HLA genotyping directly from single-cell RNA-seq, as done for bulk RNA-seq by Orenbuch et al. [133]. If genotyping can be done reliably directly from the single-cell dataset, scHLAcount could be applied more broadly. If targeted single-cell gene expression assays were designed that captured more HLA transcripts, this could be an especially useful development for genotyping.

Vargas The performance of many other algorithms could be assessed using the strategies we presented for evaluating alignment correctness based on alignment score and alignment location. The study could also be expanded to include cases where the genetic divergence or sequencing error model between read and reference is different than our experiment with the human genome and Illumina short reads. Computing and distributing a “computational gold standard” set of pre-aligned reads from common benchmarking datasets would be a valuable resource to alignment tool authors. Our investigation of graph genomes with Vargas was limited to a graph using SNPs and indels with greater than 10% minor allele frequency from the 1000 Genomes Project. The question of which variants to include in a graph is an important one, as graph reference genomes are gaining traction in the genomics community and more catalogs of genetic variation in different global populations are continually being collected. This topic was explored using simulated reads in [159], but could be explored further using heuristic-free graph alignment. Adding more variants to a graph may increase the number of optimal alignments, but it can also increase the alignment score of the optimal

130 alignment because mismatches due to genetic variation are eliminated. We can measure both of these phenomena with Vargas.

Long reads Reads of tens of thousands of bases or more are an important trend in ge- nomics that could be applied to the future directions of these three research areas. While available technologies offer reads of tens of thousands of bases or longer, the errorrates are still significantly higher than short reads. However, methods are emerging for germline variant calling from these reads [195], and haplotype assembly has also been performed [50]. These developments suggest that a Samovar-like model incorporating phasing-based features extracted from long reads could be successful at mosaic variant calling. Mosaic variants could also be phased into those occurring in the same cell and those in different cells, if they are close enough to be spanned by a read. Single cell analysis is also expanding into long reads, with methods such as ScISOr-Seq [196] combining 10x Genomics with PacBio Iso-seq, and a recent preprint combining 10x Genomics with Oxford Nanopore [197]. While long-read sequencing of the genome or transcriptome (for bulk samples) is less common than short- read sequencing, it would be useful to develop HLA genotyping methods specifically for long reads using databases such as IMGT-HLA, as currently available database-guided methods are limited to short reads and the long-read initiatives haven been based on assembly. Com- bining database-guided and assembly methods, entire haplotypes can be resolved to unravel cis versus trans effects, large (structural) variants, and the evolutionary history of theMHC region which includes numerous pseudogenes. Long read heuristic alignment applies some of the same strategies as short-read alignment, but these algorithms additionally have to deal with an increased sequencing error rate and fragmented alignments due to structural varia- tion. Assessing the accuracy of long-read alignments using Vargas would require addressing several logistical challenges, but could help analyze and improve alignment heuristics in the same way we demonstrated for short reads.

131 Sequencing data from more individuals, large-scale catalogs of genetic variations in pop- ulations, and different types of sequencing reads will continue to increase the discoveries that can be made using genomics. Methods like those presented here, which consider the characteristics and scale of the data at hand, are necessary to gain those insights.

132 References

[1] W. Richard McCombie, John D. McPherson, and Elaine R. Mardis. “Next-generation sequencing technologies”. In: Cold Spring Harbor Perspectives in Medicine 9.11 (2019). [2] Fritz J. Sedlazeck, Hayan Lee, Charlotte A. Darby, and Michael C. Schatz. “Piercing the dark matter: Bioinformatics of long-range sequencing and mapping”. In: Nature Reviews Genetics 19.6 (2018). [3] Edward S. Rice and Richard E. Green. “New Approaches for Genome Assembly and Scaffolding”. In: Annual Review of Animal Biosciences 7.1 (2019). [4] Eric S. Lander et al. “Initial sequencing and analysis of the human genome”. In: Nature 409.6822 (2001). [5] J. et al. “The sequence of the human genome”. In: Science 291.5507 (2001). [6] Karen H. Miga et al. “Telomere-to-telomere assembly of a complete human X chro- mosome”. In: bioRxiv (2019). [7] Knut Reinert, Ben Langmead, David Weese, and Dirk J. Evers. “Alignment of Next- Generation Sequencing Reads”. In: Annual Review of Genomics and Human Genetics 16.1 (2015). [8] Stefan Canzar and Steven L. Salzberg. “Short Read Mapping: An Algorithmic Tour”. In: Proceedings of the IEEE. Vol. 105. 3. Institute of Electrical and Electronics Engi- neers Inc., 2017. [9] Dale Muzzey, Eric A. Evans, and Caroline Lieber. “Understanding the Basics of NGS: From Mechanism to Variant Calling”. In: Current Genetic Medicine Reports 3.4 (2015). [10] Xiannian Zhang et al. “Comparative Analysis of Droplet-Based Ultra-High-Throughput Single-Cell RNA-Seq Systems”. In: Molecular Cell 73.1 (2019). [11] Ben Hayes. “Overview of Statistical Methods for Genome-Wide Association Studies (GWAS)”. In: Methods in molecular biology (Clifton, N.J.) Ed. by C Gondro, J van der Werf, and B Hayes. Vol. 1019. Humana Press, 2013. [12] Imad Abugessaisa, Takeya Kasukawa, and Hideya Kawaji. “Genome annotation”. In: Methods in Molecular Biology. Ed. by J Keith. Vol. 1525. Humana Press, 2017.

133 [13] Lea M. Starita et al. “Variant Interpretation: Functional Assays to the Rescue”. In: American Journal of Human Genetics 101.3 (2017). [14] Feng Zhang and James R Lupski. “Non-coding genetic variants in human disease.” In: Human Molecular Genetics 24.R1 (2015). [15] Charles Gawad, Winston Koh, and Stephen R. Quake. “Single-cell genome sequencing: Current state of the science”. In: Nature Reviews Genetics 17.3 (2016). [16] Valentine Svensson, Roser Vento-Tormo, and Sarah A. Teichmann. “Exponential scal- ing of single-cell RNA-seq in the past decade”. In: Nature Protocols 13.4 (2018). [17] Lia Chappell, Andrew J.C. Russell, and Thierry Voet. “Single-Cell (Multi)omics Tech- nologies”. In: Annual Review of Genomics and Human Genetics 19.1 (2018). [18] Marlon Stoeckius et al. “Simultaneous epitope and transcriptome measurement in single cells”. In: Nature Methods 14.9 (2017). [19] 10X Genomics. A New Way of Exploring Immunity - Linking Highly Multiplexed Antigen Recognition to Immune Repertoire and Phenotype. Tech. rep. 2019. [20] Illumina Inc. Illlumina sequencing platforms. url: https://www.illumina.com/ systems/sequencing-platforms.html (visited on 03/16/2020). [21] Kris A. Wetterstrand. DNA Sequencing Costs: Data from the NHGRI Genome Se- quencing Program (GSP). url: https://www.genome.gov/sequencingcostsdata (visited on 03/16/2020). [22] Zachary D. Stephens et al. “Big Data: Astronomical or Genomical?” In: PLoS Biology 13.7 (2015). [23] National Center for Biotechnology Information. Sequence Read Archive Overview. url: https : / / trace . ncbi . nlm . nih . gov / Traces / sra / sra . cgi? (visited on 03/16/2020). [24] Camille Marchet et al. “Data structures based on k -mers for querying large collections of sequencing datasets”. In: bioRxiv (2019). [25] Will P.M. Rowe. “When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data”. In: Genome Biology 20.1 (2019). [26] Ruibang Luo. LRSIM. url: https://github.com/aquaskyline/LRSIM (visited on 05/28/2020). [27] Ruibang Luo et al. “LRSim: A Linked-Reads Simulator Generating Insights for Better Genome Partitioning.” In: Computational and Structural Biotechnology Journal 15 (2017). [28] IBM Research. Consortium for Sequencing the Food Supply Chain. url: https:// researcher.watson.ibm.com/researcher/view_group.php?id=9635 (visited on 05/28/2020).

134 [29] RECOMB-SEQ 2019. Awards - RECOMB-SEQ 2019. url: https://recombseq. recomb2019.org/awards/ (visited on 03/22/2020). [30] Charlotte A. Darby et al. “Samovar: Single-Sample Mosaic Single-Nucleotide Variant Calling with Linked Reads”. In: iScience 18 (2019). [31] Charlotte Darby. samovar. url: https://github.com/cdarby/samovar (visited on 03/30/2020). [32] Volodymyr Kuleshov et al. “Whole-genome haplotyping using long reads and statis- tical methods”. In: Nature Biotechnology 32.3 (2014). [33] Brock A. Peters et al. “Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells”. In: Nature 487.7406 (2012). [34] Ayelet Voskoboynik et al. “The genome sequence of the colonial chordate, Botryllus schlosseri”. In: eLife 2013.2 (2013). [35] Alex Bishara et al. “Read clouds uncover variation in complex regions of the human genome.” In: Genome research 25.10 (2015). [36] Volodymyr Kuleshov, Michael P. Snyder, and . “Genome assembly from synthetic long read clouds”. In: Bioinformatics 32.12 (2016). [37] Noah Spies et al. “Genome-wide reconstruction of complex structural variants using read clouds”. In: Nature Methods 14.9 (2017). [38] Zhoutao Chen et al. “Ultra-low input single tube linked-read library method en- ables short-read NGS systems to generate highly accurate and economical long-range sequencing information for de novo genome assembly and haplotype phasing”. In: bioRxiv (2019). [39] Fan Zhang et al. “Haplotype phasing of whole human genomes using bead-based barcode partitioning in a single tube”. In: Nature Biotechnology 35.9 (2017). [40] Ou Wang et al. “Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, hap- lotyping, and de novo assembly”. In: Genome Research 29.5 (2019). [41] Grace X. Y. Zheng et al. “Haplotyping germline and cancer genomes with high- throughput linked-read sequencing”. In: Nature Biotechnology 34.3 (2016). [42] Patrick Marks et al. “Resolving the full spectrum of human genome variation using Linked-Reads”. In: Genome Research 29.4 (2019). [43] 10x Genomics. An Introduction to Linked-Read Technology for a More Comprehensive Genome and Exome Analysis. Tech. rep. 2016. [44] David C. Danko et al. “Minerva: An alignment- and reference-free approach to de- convolve Linked-Reads for metagenomics”. In: Genome Research 29.1 (2019).

135 [45] Atiya Shajii, Ibrahim Numanagić, Christopher Whelan, and . “Statis- tical Binning for Barcoded Reads Improves Downstream Analyses”. In: Cell Systems 7.2 (2018). [46] Aaron McKenna et al. “The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data”. In: Genome Research 20.9 (2010). [47] Erik Garrison and Gabor Marth. “Haplotype-based variant detection from short-read sequencing”. In: (2012). arXiv: 1207.3907. [48] Sangtae Kim et al. “Strelka2: fast and accurate calling of germline and somatic vari- ants”. In: Nature Methods 15.8 (2018). [49] Daniel P. Cooke, David C. Wedge, and Gerton Lunter. “A unified haplotype-based method for accurate and comprehensive variant calling”. In: bioRxiv (2018). [50] Peter Edge, , and Vikas Bansal. “HapCUT2: Robust and accurate haplo- type assembly for diverse sequencing technologies”. In: Genome Research 27.5 (2017). [51] Li C Xia et al. “Identification of large rearrangements in cancer genomes with barcode linked reads”. In: Nucleic Acids Research 46.4 (2018). [52] Donald Freed et al. “The Contribution of Mosaic Variants to Autism Spectrum Dis- order”. In: PLoS Genetics 12.9 (2016). [53] Naoto Usuyama et al. “HapMuC: somatic mutation calling using heterozygous germ line variants near candidate mutations.” In: Bioinformatics 30.23 (2014). [54] Yanmei Dou, Heather D. Gold, Lovelace J. Luquette, and Peter J. Park. “Detecting Somatic Mutations in Normal Cells”. In: Trends in Genetics 34.7 (2018). [55] Leslie G. Biesecker and Nancy B. Spinner. “A genomic view of mosaicism and human disease”. In: Nature Reviews Genetics 14.5 (2013). [56] A.S.A. Cohen, S.L. Wilson, J. Trinh, and X.C. Ye. “Detecting somatic mosaicism: considerations and clinical implications”. In: Clinical Genetics 87.6 (2015). [57] Hagop Youssoufian and Reed E. Pyeritz. “Mechanisms and consequences of somatic mosaicism in humans”. In: Nature Reviews Genetics 3.10 (2002). [58] Donald Freed, Eric L. Stevens, and Jonathan Pevsner. “Somatic Mosaicism in the Human Genome”. In: Genes 5.4 (2014). [59] Marzena Gajecka. “Unrevealed mosaicism in the next-generation sequencing era.” In: Molecular Genetics and Genomics 291.2 (2016). [60] Cathy C. Laurie et al. “Detectable clonal mosaicism from birth to old age and its relationship to cancer.” In: Nature Genetics 44.6 (2012). [61] Scott R. Kennedy, Lawrence A. Loeb, and Alan J. Herr. “Somatic mutations in aging, cancer and neurodegeneration”. In: Mechanisms of Ageing and Development 133.4 (2012).

136 [62] Klaasjan G. Ouwens et al. “A characterization of postzygotic mutations identified in monozygotic twins”. In: Human Mutation 39.10 (2018). [63] Bert Vogelstein et al. “Cancer genome landscapes.” In: Science 339.6127 (2013). [64] Ian R. Watson, Koichi Takahashi, P. Andrew Futreal, and Lynda Chin. “Emerging patterns of somatic mutations in cancer”. In: Nature Reviews Genetics 14.10 (2013). [65] Annapurna Poduri, Gilad D. Evrony, Xuyu Cai, and Christopher A. Walsh. “Somatic mutation, genomic variation, and neurological disease.” In: Science 341.6141 (2013). [66] Michael J. McConnell et al. “Intersection of diverse neuronal genomes and neuropsy- chiatric disease: The Brain Somatic Mosaicism Network.” In: Science (New York, N.Y.) 356.6336 (2017). [67] Alissa M. D’Gama and Christopher A. Walsh. “Somatic mosaicism and neurodevel- opmental disease”. In: Nature Neuroscience 21.11 (2018). [68] Matthew D. Shirley et al. “Sturge-Weber Syndrome and Port-Wine Stains Caused by Somatic Mutation in GNAQ”. In: New England Journal of Medicine 368.21 (2013). [69] Lee S. Weinstein et al. “Activating Mutations of the Stimulatory G Protein in the McCune-Albright Syndrome”. In: New England Journal of Medicine 325.24 (1991). [70] Marjorie J. Lindhurst et al. “A Mosaic Activating Mutation in AKT1 Associated with the Proteus Syndrome”. In: New England Journal of Medicine 365.7 (2011). [71] Derrick E. Wood et al. “A machine learning approach for somatic mutation discovery”. In: Science Translational Medicine 10.457 (2018). [72] Yuichi Shiraishi et al. “An empirical Bayesian framework for somatic mutation detec- tion from cancer genome sequencing data.” In: Nucleic Acids Research 41.7 (2013). [73] Remi Torracinta et al. “Adaptive Somatic Mutations Calls with Deep Learning and Semi-Simulated Data”. In: bioRxiv (2016). [74] Fabien Campagne. “http://dx.doi.org/10.1101/079087 CONTINUATION: Evalua- tion of adaptive somatic models in a gold standard whole genome somatic dataset”. In: bioRxiv (2016). [75] Irina Kalatskaya et al. “ISOWN: accurate somatic mutation identification in the ab- sence of normal tissue controls”. In: Genome Medicine 9.1 (2017). [76] Andrew Roth et al. “JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data”. In: Bioinformatics 28.7 (2012). [77] Giuseppe Narzisi et al. “Genome-wide somatic variant calling using localized colored de Bruijn graphs”. In: Communications Biology 1.1 (2018).

137 [78] Subhajit Sengupta et al. “Ultra-fast local-haplotype variant calling using paired-end DNA-sequencing data reveals somatic mosaicism in tumor and normal blood sam- ples.” In: Nucleic Acids Research 44.3 (2016). [79] Andreas Wilm et al. “LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets”. In: Nucleic Acids Research 40.22 (2012). [80] Yanmei Dou et al. “Accurate detection of mosaic variants in sequencing data without matched controls”. In: Nature Biotechnology (2020). [81] August Y. Huang et al. “Postzygotic single-nucleotide mosaicisms in whole-genome sequences of clinically unremarkable individuals”. In: Cell Research 24.11 (2014). [82] August Yue Huang et al. “MosaicHunter: Accurate detection of postzygotic single- nucleotide mosaicism through next-generation sequencing of unpaired, trio, and paired samples”. In: Nucleic Acids Research 45.10 (2017). [83] Jiarui Ding et al. “Feature-based classifiers for somatic mutation detection in tumour- normal paired sequencing data.” In: Bioinformatics 28.2 (2012). [84] Kristian Cibulskis et al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples”. In: Nature Biotechnology 31.3 (2013). [85] David Benjamin et al. “Calling Somatic SNVs and Indels with Mutect2”. In: bioRxiv (2019). [86] Jean-François Spinella et al. “SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing.” In: BMC Genomics 17.1 (2016). [87] David E. Larson et al. “SomaticSniper: identification of somatic point mutations in whole genome sequencing data”. In: Bioinformatics 28.3 (2012). [88] Heng Li and Richard Durbin. “Fast and accurate long-read alignment with Burrows- Wheeler transform.” In: Bioinformatics) 26.5 (2010). [89] Christopher T. Saunders et al. “Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs”. In: Bioinformatics 28.14 (2012). [90] Zhongwu Lai et al. “VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research.” In: Nucleic Acids Research 44.11 (2016). [91] Daniel C. Koboldt et al. “VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing”. In: Genome Research 22.3 (2012). [92] Sangwoo Kim et al. “Virmid: accurate detection of somatic mutations with sample impurity inference”. In: Genome Biology 14.8 (2013). [93] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Journal of Ma- chine Learning Research 12 (2011).

138 [94] Han Fang et al. “Reducing INDEL calling errors in whole genome and exome sequenc- ing data”. In: Genome Medicine 6.10 (2014). [95] Alexej Abyzov, Alexander E Urban, Michael Snyder, and Mark Gerstein. “CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing.” In: Genome Research 21.6 (2011). [96] Adam D. Ewing et al. “Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection”. In: Nature Methods 12.7 (2015). [97] Justin M. Zook et al. “Extensive sequencing of seven human genomes to characterize benchmark reference materials”. In: Scientific Data 3 (2016). [98] Katherine E. Miller et al. “Genome sequencing identifies somatic BRAF duplica- tion c.1794_1796dupTAC;p.Thr599dup in pediatric patient with low-grade gangli- oglioma”. In: Molecular Case Studies 4.2 (2018). [99] Kai Wang, Mingyao Li, and Hakon Hakonarson. “ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data”. In: Nucleic Acids Research 38.16 (2010). [100] Matthew H. Bailey et al. “Comprehensive Characterization of Cancer Driver Genes and Mutations.” In: Cell 173.2 (2018). [101] Aaron M. Wenger et al. “Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome”. In: Nature Biotechnology 37.10 (2019). [102] Charlotte A. Darby et al. “scHLAcount: Allele-specific HLA expression from single- cell gene expression data”. In: Bioinformatics (2020). btaa264. [103] Charlotte Darby, Ian Fiddes, and Patrick Marks. scHLAcount. url: https://github. com/10XGenomics/scHLAcount (visited on 03/30/2020). [104] Tamar Hashimshony et al. “CEL-Seq2: Sensitive highly-multiplexed single-cell RNA- Seq”. In: Genome Biology 17.1 (2016). [105] Evan Z. Macosko et al. “Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets”. In: Cell 161.5 (2015). [106] Allon M. Klein et al. “Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells”. In: Cell 161.5 (2015). [107] Diego Adhemar Jaitin et al. “Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types”. In: Science 343.6172 (2014). [108] Magali Soumillon et al. “Characterization of directed differentiation by high-throughput single-cell RNA-Seq”. In: bioRxiv (2014).

139 [109] Grace X. Y. Zheng et al. “Massively parallel digital transcriptional profiling of single cells.” In: Nature Communications 8 (2017). [110] Simone Picelli et al. “Smart-seq2 for sensitive full-length transcriptome profiling in single cells”. In: Nature Methods 10.11 (2013). [111] Simone Picelli et al. “Full-length RNA-seq from single cells using Smart-seq2”. In: Nature Protocols 9.1 (2014). [112] Christoph Ziegenhain et al. “Comparative Analysis of Single-Cell RNA Sequencing Methods”. In: Molecular Cell 65.4 (2017). [113] 10x Genomics. Technical Note - Assay Scheme and Configuration of Chromium™Single Cell 3’ v2 Libraries. Tech. rep. 2017. url: https://support.10xgenomics.com/ permalink/oMSpEjxU0SAwiqM8cu4us. [114] 10x Genomics. Chromium™Single Cell V(D)J Reagent Kits User Guide. Tech. rep. 2017. url: https://support.10xgenomics.com/permalink/S2GnjMyBEGm6eQyiceGYQ. [115] Allegra A. Petti et al. “A general approach for detecting expressed mutations in AML cells using single cell RNA-sequencing”. In: Nature Communications 10.1 (2019). [116] Alexander Dobin et al. “STAR: ultrafast universal RNA-seq aligner”. In: Bioinfor- matics 29.1 (2013). [117] James Robinson et al. “The IPD and IMGT/HLA database: allele variant databases.” In: Nucleic Acids Research 43.Database issue (2015). [118] Denis C. Bauer, Armella Zadoorian, Laurence O. W. Wilson, and Natalie P. Thorne. “Evaluation of computational programs to predict HLA genotypes from genomic se- quencing data”. In: Briefings in Bioinformatics 19.2 (2018). [119] K. G. Paulson et al. “Acquired cancer resistance to combination immunotherapy from transcriptional loss of class I HLA”. In: Nature Communications 9.1 (2018). [120] Matthew J. Christopher et al. “Immune Escape of Relapsed AML Cells after Allo- geneic Transplantation”. In: New England Journal of Medicine 379.24 (2018). [121] Nicholas McGranahan et al. “Allele-Specific HLA Loss and Immune Escape in Lung Cancer ”. In: Cell 171.6 (2017). [122] Vitor R. C. Aguiar et al. “Expression estimation and eQTL mapping for HLA genes with a personalized pipeline”. In: PLoS Genetics 15.4 (2019). [123] Wanseon Lee, Katharine Plant, Peter Humburg, and Julian C Knight. “AltHapAlignR: improved accuracy of RNA-seq analyses through the use of alternative haplotypes”. In: Bioinformatics 34.14 (2018). [124] F. Bettens, L. Brunet, and J.-M. Tiercy. “High-allelic variability in HLA-C mRNA expression: association with HLA-extended haplotypes”. In: Genes & Immunity 15.3 (2014).

140 [125] M. Zajacova, A. Kotrbova-Kozak, and M. Cerna. “Expression of HLA-DQA1 and HLA-DQB1 genes in B lymphocytes, monocytes and whole blood”. In: International Journal of Immunogenetics 45.3 (2018). [126] Justin M Greene et al. “Differential MHC class I expression in distinct leukocyte subsets”. In: BMC Immunology 12.1 (2011). [127] Sebastian Boegel et al. “HLA and proteasome expression body map”. In: BMC Med- ical Genomics 11.1 (2018). [128] H. Erlich. “HLA DNA typing: past, present, and future”. In: Tissue Antigens 80.1 (2012). [129] Heewook Lee and Carl Kingsford. “Kourami: graph-guided assembly for novel human leukocyte antigen allele discovery”. In: Genome Biology 19.1 (2018). [130] András Szolek et al. “OptiType: precision HLA typing from next-generation sequenc- ing data”. In: Bioinformatics 30.23 (2014). [131] Chao Xie et al. “Fast and accurate HLA typing from short-read next-generation sequence data with xHLA”. In: Proceedings of the National Academy of Sciences of the of America 114.30 (2017). [132] Sebastian Boegel et al. “HLA typing from RNA-Seq sequence reads”. In: Genome Medicine 4.12 (2012). [133] Rose Orenbuch et al. “arcasHLA: high resolution HLA typing from RNAseq”. In: Bioinformatics (2019). [134] Martin L. Buchkovich et al. “HLAProfiler utilizes k-mer profiles to improve HLA calling accuracy for rare and common alleles in RNA-seq data”. In: Genome Medicine 9.1 (2017). [135] Rui Tian et al. “Extraordinary diversity of HLA class I gene expression in single cells contribute to the plasticity and adaptability of human immune system”. In: bioRxiv (2019). [136] Phillip E.C. Compeau, Pavel A. Pevzner, and Glenn Tesler. “How to apply de Bruijn graphs to genome assembly”. In: Nature Biotechnology 29.11 (2011). [137] Rob Patro et al. “Salmon provides fast and bias-aware quantification of transcript expression”. In: Nature Methods 14.4 (2017). [138] Nicolas L Bray, Harold Pimentel, Páll Melsted, and . “Near-optimal probabilistic RNA-seq quantification.” In: Nature biotechnology 34.5 (2016). [139] Qiaolin Deng, Daniel Ramsköld, Björn Reinius, and Rickard Sandberg. “Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells”. In: Science 343.6167 (2014).

141 [140] Björn Reinius et al. “Analysis of allelic expression patterns in clonal somatic cells by single-cell RNA-seq”. In: Nature Genetics 48.11 (2016). [141] Yuchao Jiang, Nancy R. Zhang, and Mingyao Li. “SCALE: modeling allele-specific gene expression by single-cell RNA sequencing”. In: Genome Biology 18.1 (2017). [142] Ian Fiddes and Patrick Marks. vartrix. url: https://github.com/10XGenomics/ vartrix (visited on 03/30/2020). [143] Tim Stuart et al. “Comprehensive Integration of Single-Cell Data.” In: Cell 177.7 (2019). [144] Charlotte A. Darby, Ravi Gaddipati, Michael C. Schatz, and Ben Langmead. “Vargas: heuristic-free alignment for assessing linear and graph read aligners”. In: Bioinformat- ics (2020). btaa265. [145] Charlotte Darby, Ravi Gaddipati, Daniel Baker, and Ben Langmead. vargas. url: https://github.com/langmead-lab/vargas (visited on 03/30/2020). [146] Charlotte Darby. vargas-experiments. url: https://github.com/cdarby/vargas- experiments (visited on 03/30/2020). [147] Osamu Gotoh. “An improved algorithm for matching biological sequences”. In: Jour- nal of Molecular Biology 162.3 (1982). [148] Heng Li et al. “A synthetic-diploid benchmark for accurate variant-calling evaluation”. In: Nature Methods 15.8 (2018). [149] Saul B. Needleman and Christian D. Wunsch. “A general method applicable to the search for similarities in the amino acid sequence of two proteins”. In: Journal of Molecular Biology 48.3 (1970). [150] T.F. Smith and M.S. Waterman. “Identification of common molecular subsequences”. In: Journal of Molecular Biology 147.1 (1981). [151] Chirag Jain et al. “Accelerating Sequence Alignment to Graphs”. In: bioRxiv (2019). [152] Erik Garrison et al. “Variation graph toolkit improves read mapping by representing genetic variation in the reference”. In: Nature Biotechnology 36.9 (2018). [153] Mikko Rautiainen, Veli Mäkinen, and Tobias Marschall. “Bit-parallel sequence-to- graph alignment”. In: Bioinformatics (2019). [154] Mengyao Zhao, Wan-Ping Lee, Erik P. Garrison, and Gabor T. Marth. “SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications”. In: PLoS ONE 8.12 (2013). [155] René Rahn et al. “Generic accelerated sequence alignment in SeqAn using vectoriza- tion and multi-threading”. In: Bioinformatics 34.20 (2018). [156] Jeff Daily. “Parasail: SIMD C library for global, semi-global, and local pairwise se- quence alignments”. In: BMC Bioinformatics 17.1 (2016).

142 [157] Benedict Paten, Adam M. Novak, Jordan M. Eizenga, and Erik Garrison. “Genome graphs and the evolution of genome inference”. In: Genome Research 27.5 (2017). [158] Xiaofei Yang, Wan-Ping Lee, Kai Ye, and Charles Lee. “One reference genome is not enough”. In: Genome Biology 20.1 (2019). [159] Jacob Pritt, Nae-Chyun Chen, and Ben Langmead. “FORGe: prioritizing variants for graph genomes”. In: Genome Biology 19.1 (2018). [160] Sara Ballouz, Alexander Dobin, and Jesse A. Gillis. “Is it time to change the reference genome?” In: Genome Biology 20.1 (2019). [161] Deanna M Church et al. “Extending reference assembly models”. In: Genome Biology 16.1 (2015). [162] C. Lee, C. Grasso, and M. F. Sharlow. “Multiple sequence alignment using partial order graphs”. In: Bioinformatics 18.3 (2002). [163] Daehwan Kim et al. “Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype”. In: Nature Biotechnology 37.8 (2019). [164] Korbinian Schneeberger et al. “Simultaneous alignment of short reads against multiple genomes”. In: Genome Biology 10.9 (2009). [165] Ravi Vijaya Satya, Nela Zavaljevski, and Jaques Reifman. “A new strategy to reduce allelic bias in RNA-Seq readmapping”. In: Nucleic Acids Research 40.16 (2012). [166] L. Huang, V. Popic, and S. Batzoglou. “Short read alignment with populations of genomes”. In: Bioinformatics 29.13 (2013). [167] A Wozniak. “Using video-oriented instructions to speed up sequence comparison.” In: Computer applications in the biosciences : CABIOS 13.2 (1997). [168] Torbjørn Rognes and Erling Seeberg. “Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors.” In: Bioin- formatics 16.8 (2000). [169] Michael Farrar. “Striped Smith-Waterman speeds database searches six times over other SIMD implementations”. In: Bioinformatics 23.2 (2007). [170] Torbjørn Rognes. “Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation.” In: BMC bioinformatics 12 (2011). [171] Intel Corporation. Intel architecture instruction set extensions programming reference. 2015. url: https://software.intel.com/sites/default/files/managed/07/ b7/319433-023.pdf. [172] Yongchao Liu and Bertil Schmidt. “SWAPHI: Smith-waterman protein database search on Xeon Phi coprocessors”. In: 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors. IEEE, 2014. url: http: //ieeexplore.ieee.org/document/6868657/.

143 [173] Yongchao Liu, Adrianto Wirawan, and Bertil Schmidt. “CUDASW++ 3.0: acceler- ating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions”. In: BMC Bioinformatics 14.1 (2013). [174] Ayat Hatem, Doruk Bozdaˇg,Amanda E Toland, and Ümit V Çatalyürek. “Bench- marking short sequence mapping tools”. In: BMC Bioinformatics 14.1 (2013). [175] Manuel Holtgrewe, Anne-Katrin Emde, David Weese, and Knut Reinert. “A novel and well-defined benchmarking method for second generation read mapping”. In: BMC Bioinformatics 12.1 (2011). [176] Moritz Smolka et al. “Teaser: Individualized benchmarking and optimization of read mapping results for NGS data”. In: Genome Biology 16.1 (2015). [177] Ulrike H. Taron, Moritz Lell, Axel Barlow, and Johanna L.A. Paijmans. “Testing of alignment parameters for ancient samples: Evaluating and optimizing mapping parameters for ancient samples using the TAPAS tool”. In: Genes 9.3 (2018). [178] Ben Langmead. “A tandem simulation framework for predicting mapping quality”. In: Genome Biology 18.1 (2017). [179] Avinash Sodani. “Knights landing (KNL): 2nd Generation Intel® Xeon Phi proces- sor”. In: 2015 IEEE Hot Chips 27 Symposium (HCS). IEEE, 2015. [180] James Jeffers, James Reinders, and Avinash Sodani. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann, 2016. [181] Simon M. Tam et al. “SkyLake-SP: A 14nm 28-Core xeon® processor”. In: 2018 IEEE International Solid - State Circuits Conference - (ISSCC). IEEE, 2018. [182] Ernesto Lowy-Gallego et al. “Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project”. In: Wellcome Open Research 4 (2019). [183] Ben Langmead and Steven L. Salzberg. “Fast gapped-read alignment with Bowtie 2”. In: Nature Methods 9.4 (2012). [184] Heng Li. “Aligning sequence reads, clone sequences and assembly contigs with BWA- MEM”. In: (2013). arXiv: 1303.3997. [185] Md. Vasimuddin, Sanchit Misra, Heng Li, and Srinivas Aluru. “Efficient architecture- aware acceleration of BWA-MEM for multicore systems”. In: Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019. In- stitute of Electrical and Electronics Engineers Inc., 2019. [186] Daehwan Kim. hisat2. url: https://github.com/DaehwanKimLab/hisat2/tree/ hisat2_v2.2.0_beta (visited on 03/30/2020). [187] Heng Li, Jue Ruan, and Richard Durbin. “Mapping short DNA sequencing reads and calling variants using mapping quality scores.” In: Genome research 18.11 (2008).

144 [188] Manuel Holtgrewe. Mason-A Read Simulator for Second Generation Sequencing Data. Tech. rep. 2010. [189] Graham McVicker et al. “Identification of Genetic Variants That Affect Histone Mod- ifications in Human Cells”. In: Science 342.6159 (2013). [190] Bryce Van De Geijn, Graham Mcvicker, Yoav Gilad, and Jonathan K. Pritchard. “WASP: Allele-specific software for robust molecular quantitative trait locus discov- ery”. In: Nature Methods 12.11 (2015). [191] Heng Li. “Minimap2 : pairwise alignment for nucleotide sequences”. In: Bioinformatics 34.May (2018). [192] Hayan Lee and Michael C. Schatz. “Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score”. In: Bioinformatics 28.16 (2012). [193] Laurence O.W. Wilson et al. “VARSCOT: Variant-aware detection and scoring en- ables sensitive and personalized off-target detection for CRISPR-Cas9”. In: BMC Biotechnology 19.1 (2019). [194] Henne Holstege et al. “Somatic mutations found in the healthy blood compartment of a 115-yr-old woman demonstrate oligoclonal hematopoiesis.” In: Genome research 24.5 (2014). [195] Ruibang Luo et al. “Exploring the limit of using a deep neural network on pileup data for germline variant calling”. In: Nature Machine Intelligence 2.4 (2020). [196] Ishaan Gupta et al. “Single-cell isoform RNA sequencing characterizes isoforms in thousands of cerebellar cells”. In: Nature Biotechnology 36.12 (2018). [197] Kevin Lebrigand, Virginie Magnone, Pascal Barbry, and Rainer Waldmann. “High throughput, error corrected Nanopore single cell transcriptome sequencing”. In: bioRxiv (2019).

145 Candidate Biography

Charlotte Ay Darby was born in 1994 in Pennsylvania, USA. She received a Bachelor of Science in with University Honors from Carnegie Mellon University (Pittsburgh, PA) in 2015, and a Master of Science in Computational Biology with Research

Honors from Carnegie Mellon University in 2016. Her Master’s thesis “New nomenclature for horizontal gene transfer” was advised by Prof. Dannie Durand and led to a publication

“Xenolog Classification”Bioinformatics ( , 2016). During those years, she completed research internships at Carnegie Mellon University with the HHMI Summer Research Program in 2014 and Cold Spring Harbor Laboratory with the Undergraduate Research Program in 2015. Charlotte entered the Computer Science PhD program in 2016, co-advised by Prof. Ben

Langmead and Prof. Michael Schatz. She earned the Master of Science in Computer Science in 2019. She completed industry internships at IBM Research Almaden (San Jose, CA) in 2018 and 10x Genomics (Pleasanton, CA) in 2019.

146