Optimization of Samtools Sorting Using Openmp Tasks Nathan T

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Digital Repository @ Iowa State University Computer Science Publications Computer Science 4-26-2017 Optimization of SAMtools sorting using OpenMP tasks Nathan T. Weeks Iowa State University, [email protected] Glenn R. Luecke Iowa State University, [email protected] Follow this and additional works at: http://lib.dr.iastate.edu/cs_pubs Part of the Computer Sciences Commons, and the Mathematics Commons The ompc lete bibliographic information for this item can be found at http://lib.dr.iastate.edu/ cs_pubs/10. For information on how to cite this item, please visit http://lib.dr.iastate.edu/ howtocite.html. This Article is brought to you for free and open access by the Computer Science at Iowa State University Digital Repository. It has been accepted for inclusion in Computer Science Publications by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Optimization of SAMtools Sorting Using OpenMP Tasks Nathan T. Weeks · Glenn R. Luecke Published 28 May 2017. The final publication is available at Springer via https://dx.doi.org/10.1007/s10586-017-0874-8 Abstract SAMtools is a widely-used genomics appli- 1 Introduction cation for post-processing high-throughput sequence alignment data. Such sequence alignment data are The advent of high-throughput sequencing (HTS) has commonly sorted to make downstream analysis more resulted in a rapid decline in DNA sequencing costs, efficient. However, this sorting process itself can be outpacing the growth in transistor density from Moore's computationally- and I/O-intensive: high-throughput Law [23]. As a result, genome sequencing is increas- sequence alignment files in the de facto standard Binary ing at a rapid pace. Genomics data generation is pre- Alignment/Map (BAM) format can be many gigabytes dicted to potentially dwarf Twitter, YouTube, and as- in size, and may need to be decompressed before sort- trophysics data combined by the year 2025 [19]. How- ing and compressed afterwards. As a result, BAM-file ever, performance-sensitive genomics applications have sorting can be a bottleneck in genomics workflows. This generally not even scaled with Moore's Law, mainly be- paper describes a case study on the performance analy- cause many such applications are unprepared to fully sis and optimization of SAMtools for sorting large BAM utilize increasingly-parallel processors. Consequently, files. OpenMP task parallelism and memory optimiza- algorithmic improvements, specialized computing hard- tion techniques resulted in a speedup of 5.9X versus the ware, and new storage technologies are needed to upstream SAMtools 1.3.1 for an internal (in-memory) cope with storing, processing, and analyzing increasing sort of 24.6 GiB of compressed BAM data (102.6 GiB amounts of genomics data. uncompressed) with 32 processor cores, while a 1.98X HTS workflows often include sequence alignment speedup was achieved for an external (out-of-core) sort to a reference genome. SAMtools [12] is a utility for of a 271.4 GiB BAM file. performing operations on the resulting sequence alignment data such as sorting, indexing, selecting subsets, compressing, and reporting various statistics. These sequence alignments can be represented in one of three Keywords Bioinformatics, High-Throughput Se- industry-standard formats: the Sequence Alignmen- quencing, OpenMP, Sorting, Burst Buffer t/Map (SAM) text format; its binary equivalent, the Binary Alignment/Map (BAM) format; or the more re- cent CRAM format, which utilizes reference-sequence- based compression and lossy quality scores. N. Weeks SAMtools utilizes the codeveloped HTSlib library Department of Computer Science, Iowa State University, Ames, IA, USA for reading, parsing, and compressing/decompressing E-mail: [email protected] SAM/BAM/CRAM data. As SAMtools and HTSlib G. Luecke are developed in lockstep with the SAM/BAM/CRAM Department of Mathematics, Iowa State University, specifications, many consider these to be the reference Ames, IA, USA implementations among similar software that work with E-mail: [email protected] these formats. 2 Nathan T. Weeks, Glenn R. Luecke Sequence alignments are typically sorted to make for a subset of performance-critical SAMtools function- downstream analysis more efficient. However, the sort- ality. Written from the ground up in the D program- ing operation can itself be a bottleneck in HTS work- ming language to implement parallelism, Sambamba flows [8, 18]. Sorting large data sets is well-known strives to fully exploit multi-core CPUs. to be a resource-intensive problem that can bene- elPrep [9] is a Common Lisp/Python application for fit from parallelism [7]. The SAMtools 1.3.1 sorting multi-threaded, in-memory execution of the subset of pipeline is already parallelized using the POSIX threads SAMtools functionality that focuses on preparing se- (pthreads) API. But, as the performance analysis in quence alignment data for variant calling. While el- Section 3 illustrates, this parallelization is incomplete, Prep's in-memory processing benefits performance, it and exhibits inefficiencies. Because of the widespread requires a lot of memory for sorting: the elPrep 2.5 doc- adoption of SAMtools|including by HTS workflows umentation states "As a rule of thumb, elPrep requires that run on large-scale public cloud resources, such as 6x times more RAM memory than the size of the in- Churchill [11], and massively parallel supercomputers, put file in .sam format when it is used for sorting". such as MegaSeq [16]|performance enhancements to it For comparison, SAMtools can accomplish an internal would have broad impact. sort of the 24.6 GiB BAM (∼85:5 GiB SAM) data set This work1 extends our previous effort to improve described in Sect. 3 on a 128 GiB-memory compute the performance of SAMtools for the purpose of sort- node. elPrep can sort such BAM files within an accessi- ing BAM files [22], primarily by 1) utilizing an alterna- ble memory limit at the cost of extra disk space and tive algorithm employing fine-grained tasking for com- I/O overhead: the protocol involves splitting the in- pression during BAM encoding to allow adequate load put BAM files (each containing alignments with respect balancing with a larger number of threads, 2) utiliz- to a different reference sequence/chromosome), sorting ing circular buffers to reduce calls to memcpy() and en- each BAM file independently, and merging the result- hance concurrency, 3) improving the performance of ex- ing sorted BAM files to produce a single sorted BAM ternal sorting by keeping sorted BAM records in mem- file. Since elPrep uses SAMtools as a library for decod- ory where possible instead of writing them to secondary ing/encoding BAM files, the optimizations described in storage before merging, and 4) benchmarking the exter- this paper could benefit elPrep as well. nal sort with a much larger data set, leveraging Cori's Similar to elPrep, biobambam2 [21] focuses on align- newly-available Burst Buffer to store input, output, and ment preprocessing in preparation for variant calling. intermediate BAM files. Coded in C++, biobambam2 can perform a multi- The rest of this paper is organized as follows. Sec- threaded SAM/BAM/CRAM sort by coordinate under tion 2 lists other software tools that aim to efficiently the condition that the input is already sorted by query implement performance-critical SAMtools functional- (read) name, or sort by query name when the input is ity. Section 3 characterizes the performance of SAM- in any order. tools for internal sorting of a large BAM file. Section 4 Some sequence aligners, such as Isaac [17], sort describes optimizations implemented in this work to ad- alignments in-memory before output. This strategy dress the identified performance bottlenecks. The im- avoids the I/O overhead associated with emitting align- pact of this work on performance is measured in Sec- ments from a sequence alignment process and reading tion 5. Section 6 explores other possible optimizations. them into a separate sort process. Section 7 summarizes this effort to address the perfor- DNANexus has proposed a fork of SAMtools that mance limitations of SAMtools. leverages RocksDB for improved sorting/merging performance [13], as well as a patch to HTSlib that im- proves concurrency in BGZF encoding2. 2 Related Work Both Intel and CloudFlare have created optimized versions of the zlib compression library. These have Picard [1] from the Broad Institute is a Java analog to been shown to benefit SAMtools compression perfor- SAMtools, providing similar HTS-processing function- mance3. ality. Picard currently supports only serial SAM/BAM The HTSLib library utilized by SAMtools for sorting. SAM/BAM/CRAM I/O supports multi-threaded en- Similar software exists with the primary goal of coding/decoding of CRAM data using a custom outperforming SAMtools. One such software, Sam- pthreads-based work-stealing thread pool (adapted bamba [20], aims to be a high-performance replacement from the implementation in the Scramble [4] library). 1 Source code for SAMtools optimizations available at https://doi.org/10.5281/zenodo.262169, and HTSlib opti- 2 https://github.com/samtools/htslib/pull/51 mizations at https://doi.org/10.5281/zenodo.262161 3 http://www.htslib.org/benchmarks/zlib.html Optimization of SAMtools Sorting Using OpenMP Tasks 3 As of this writing, support for multi-threaded BAM en- originally 19.9 GiB (102.6 GiB uncompressed), the file coding/decoding using this thread pool implementation size increased to 24.6 GiB. This could be attributed at has been committed to the development branch of HT- least partially to the reduced likelihood of overlapping Slib4. The alternative approach presented in this paper sequences appearing within the same 64 KiB BGZF uses OpenMP 4.0 constructs to implement concurrency. block, providing less repetition for the compression al- The OpenMP API is supported by almost all major C gorithm to leverage.

Load more