Genasm: a High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis Damla Senol Cali†On Gurpreet S
Total Page:16
File Type:pdf, Size:1020Kb
GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis Damla Senol Cali†on Gurpreet S. Kalsion Zülal BingölO Can Firtina Lavanya Subramanian‡ Jeremie S. Kim† Rachata Ausavarungnirun Mohammed Alser Juan Gomez-Luna Amirali Boroumand† Anant Norion Allison Scibisz† Sreenivas Subramoneyon Can AlkanO Saugata Ghose?† Onur Mutlu†O †Carnegie Mellon University onProcessor Architecture Research Lab, Intel Labs OBilkent University ETH Zürich ‡Facebook King Mongkut’s University of Technology North Bangkok ?University of Illinois at Urbana–Champaign Genome sequence analysis has enabled signicant advance- amounts of genomics data at low cost [8, 118, 153], but are ments in medical and scientic areas such as personalized unable to extract an organism’s complete DNA in one piece. medicine, outbreak tracing, and the understanding of evolution. Instead, these machines extract smaller random fragments To perform genome sequencing, devices extract small random of the original DNA sequence, known as reads. These reads fragments of an organism’s DNA sequence (known as reads). then pass through a computational process known as read The rst step of genome sequence analysis is a computational mapping, which takes each read, aligns it to one or more process known as read mapping. In read mapping, each frag- possible locations within the reference genome, and nds the ment is matched to its potential location in the reference genome matches and dierences (i.e., distance) between the read and with the goal of identifying the original location of each read the reference genome segment at that location [6, 177]. Read in the genome. Unfortunately, rapid genome sequencing is cur- mapping is the rst key step in genome sequence analysis. rently bottlenecked by the computational power and memory State-of-the-art sequencing machines produce broadly one bandwidth limitations of existing systems, as many of the steps of two kinds of reads. Short reads (consisting of no more in genome sequence analysis must process a large amount of than a few hundred DNA base pairs [30, 158]) are generated data. A major contributor to this bottleneck is approximate using short-read sequencing (SRS) technologies [144, 164], string matching (ASM), which is used at multiple points during which have been on the market for more than a decade. Be- the mapping process. ASM enables read mapping to account cause each read fragment is so short compared to the entire for sequencing errors and genetic variations in the reads. DNA (e.g., a human’s DNA consists of over 3 billion base We propose GenASM, the rst ASM acceleration framework pairs [166]), short reads incur a number of reproducibility for genome sequence analysis. GenASM performs bitvector- (e.g., non-deterministic mapping) and computational chal- based ASM, which can eciently accelerate multiple steps of lenges [7, 10, 12, 52, 118, 159, 176–178]. Long reads (consist- genome sequence analysis. We modify the underlying ASM ing of thousands to millions of DNA base pairs) are gener- algorithm (Bitap) to signicantly increase its parallelism and ated using long-read sequencing (LRS) technologies, of which reduce its memory footprint. Using this modied algorithm, we Oxford Nanopore Technologies’ (ONT) nanopore sequenc- design the rst hardware accelerator for Bitap. Our hardware ing [26, 35, 40, 82, 83, 89, 97, 112, 113, 116, 143, 152] and Pa- accelerator consists of specialized systolic-array-based compute cic Biosciences’ (PacBio) single-molecule real-time (SMRT) units and on-chip SRAMs that are designed to match the rate of sequencing [18, 47, 114, 123, 145, 146, 165, 171] are the most computation with memory capacity and bandwidth, resulting widely used ones. LRS technologies are relatively new, and in an ecient design whose performance scales linearly as we they avoid many of the challenges faced by short reads. increase the number of compute units working in parallel. LRS technologies have three key advantages compared We demonstrate that GenASM provides signicant perfor- to SRS technologies. First, LRS devices can generate very mance and power benets for three dierent use cases in genome long reads, which (1) reduces the non-deterministic mapping sequence analysis. First, GenASM accelerates read alignment problem faced by short reads, as long reads are signicantly for both long reads and short reads. For long reads, GenASM out- more likely to be unique and therefore have fewer potential performs state-of-the-art software and hardware accelerators mapping locations in the reference genome; and (2) span by 116× and 3.9×, respectively, while reducing power consump- larger parts of the repeated or complex regions of a genome, tion by 37× and 2.7×. For short reads, GenASM outperforms enabling detection of genetic variations that might exist in state-of-the-art software and hardware accelerators by 111× these regions [165]. Second, LRS devices perform real-time and 1.9×. Second, GenASM accelerates pre-alignment ltering sequencing, and can enable concurrent sequencing and anal- for short reads, with 3.7× the performance of a state-of-the- ysis [111,142,146]. Third, ONT’s pocket-sized device (Min- art pre-alignment lter, while reducing power consumption by ION [133]) provides portability, making sequencing possible 1.7× and signicantly improving the ltering accuracy. Third, at remote places using laptops or mobile devices. This en- GenASM accelerates edit distance calculation, with 22–12501× ables a number of new applications, such as rapid infection and 9.3–400× speedups over the state-of-the-art software li- diagnosis and outbreak tracing (e.g., COVID-19, Ebola, Zika, brary and FPGA-based accelerator, respectively, while reducing swine u [37, 48, 64, 68, 85, 142, 167, 173]). Unfortunately, LRS power consumption by 548–582× and 67×. We conclude that devices are much more error-prone in sequencing (with a GenASM is a exible, high-performance, and low-power frame- typical error rate of 10–15% [19, 83, 165, 170]) compared to work, and we briey discuss four other use cases that can benet SRS devices (typically 0.1% [60, 61, 141]), which leads to new from GenASM. computational challenges [152]. For both short and long reads, multiple steps of read map- 1. Introduction ping must account for the sequencing errors, and for the dif- Genome sequencing, which determines the DNA sequence ferences caused by genetic mutations and variations. These of an organism, plays a pivotal role in enabling many medi- errors and dierences take the form of base insertions, dele- cal and scientic advancements in personalized medicine tions, and/or substitutions [121,125,154,163,169,174]. As a re- [6, 20, 34, 53, 59], evolutionary theory [46, 139, 140], and sult, read mapping must perform approximate (or fuzzy) string forensics [17, 25, 179]. Modern genome sequencing ma- matching (ASM). Several algorithms exist for ASM, but state- chines [77–79, 132–135, 152] can rapidly generate massive of-the-art read mapping tools typically make use of an expen- 1 sive dynamic programming based algorithm [100, 126, 154] eectively accelerate the read alignment step of read map- that scales quadratically in both execution time and required ping (Section 10.2). Second, we illustrate that GenASM can storage. This ASM algorithm has been shown to be the ma- be employed as the most ecient (to date) pre-alignment jor bottleneck in read mapping [8, 10, 55, 66, 75, 122, 162]. lter [9, 10] for short reads (Section 10.3). Third, we demon- Unfortunately, as sequencing technologies advance, the strate how GenASM can eciently nd the edit distance (i.e., growth in the rate that sequencing devices generate reads Levenshtein distance [100]) between two sequences of ar- is far outpacing the corresponding growth in computational bitrary lengths (Section 10.4). In addition, GenASM can be power [8, 32], placing greater pressure on the ASM bottle- utilized in several other parts of genome sequence analysis as neck. Beyond read mapping, ASM is a key technique for other well as in text analysis, which we briey discuss in Section 11. bioinformatics problems such as whole genome alignment Results Summary. We evaluate GenASM for three dif- (WGA) [27,28,41,42,70,95,102,106,115,151,160] and multiple ferent use cases of ASM in genome sequence analysis using sequence alignment (MSA) [29,45,69,98,107,127,128,136,150], a combination of the synthesized SystemVerilog model of where two or more whole genomes, or regions of multiple our hardware accelerators and detailed simulation-based per- genomes (from the same or dierent species), are compared formance modeling. (1) For read alignment, we compare to determine their similarity for predicting evolutionary re- GenASM to state-of-the-art software (Minimap2 [102] and lationships or nding common regions (e.g., genes). Thus, BWA-MEM [101]) and hardware approaches (GACT in Dar- there is a pressing need to develop techniques for genome win [162] and SillaX in GenAx [55]), and nd that GenASM is sequence analysis that provide fast and ecient ASM. signicantly more ecient in terms of both speed and power In this work, we propose GenASM, an ASM acceleration consumption. For this use case, we compare GenASM only framework for genome sequence analysis. Our goal is to de- with the read alignment steps of the baseline tools and accel- sign a fast, ecient, and exible framework for both short erators. For long reads, GenASM achieves 116× and 648× and long reads, which can be used to accelerate multiple steps speedup over 12-thread runs of the alignment steps of Min- of the genome sequence analysis pipeline. To avoid imple- imap2 and BWA-MEM, respectively, while reducing power menting more complex hardware for the dynamic program- consumption by 37× and 34×. Compared to GACT, GenASM ming based algorithm [22, 33, 49, 65, 87, 88, 147, 162], we base provides 6.6× the throughput per unit area and 10.5× the GenASM upon the Bitap algorithm [21, 174].