In-Memory Genomics Data Processing Using Apache Arrow
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/741843; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow Tanveer Ahmad Nauman Ahmed Johan Peltenburg Zaid Al-Ars Accelerated Big Data Systems Delft University of Technology, Delft, Netherlands ft.ahmad, [email protected] Abstract—The rapidly growing size of genomics data bases, due to disk bandwidth [2]. Each tool reads from the I/O driven by advances in sequencing technologies, demands fast disks, computes and writes back to disk. and cost-effective processing. However, processing this data • Due to the virtualized nature of some popular languages creates many challenges, particularly in selecting appropriate algorithms and computing platforms. Computing systems need used to develop genomics tools (such as Java), these tools data closer to the processor for fast processing. Traditionally, are not well suited to exploit modern hardware features due to cost, volatility and other physical constraints of DRAM, it like Single-instruction multiple-data (SIMD) vectoriza- was not feasible to place large amounts of working data sets in tion and accelerators (GPU or FPGAs). memory. However, new emerging storage class memories allow storing and processing big data closer to the processor. In this This paper proposes a new approach for representing ge- work, we show how the commonly used genomics data format, nomics data sets, based on recent developments in big data Sequence Alignment/Map (SAM), can be presented in the Apache analytics to improve the performance and efficiency of ge- Arrow in-memory data representation to benefit of in-memory processing and to ensure better scalability through shared mem- nomics pipelines. Our approach consists of the following main ory objects, by avoiding large (de)-serialization overheads in contributions: cross-language interoperability. To demonstrate the benefits of • We propose an in-memory SAM data representation, such a system, we propose ArrowSAM, an in-memory SAM format that uses the Apache Arrow framework, and integrate called ArrowSAM, created in Apache Arrow to place it into genome pre-processing pipelines including BWA-MEM, genome data in RecordBatches of immutable shared Picard and Sambamba. Results show 15x and 2.4x speedups memory objects for inter-process communication. We as compared to Picard and Sambamba, respectively. The code use DRAM for ArrowSAM placement and inter-process and scripts for running all workflows are freely available at access. https://github.com/abs-tudelft/ArrowSAM. Index Terms—Genomics, Whole Genome/Exome Sequencing, • We adapt existing widely-used genomics data pre- Big Data, Apache Arrow, In-Memory, Parallel Processing processing applications (for alignment, sorting and du- plicates removal) to use the Apache Arrow framework I. INTRODUCTION and to benefit from immutable shared memory plasma objects in inter process communication. Genomics is projected to generate the largest big data • We compare various workflows for genome pre- sets globally, which requires modifying existing tools to take processing, using different techniques for in-memory data advantage of new developments in big data analytics and new communication and placement (for intermediate applica- memory technologies to ensure better performance and high tions), and show that ArrowSAM in-memory columnar throughput. In addition, new applications using DNA data are data representation outperforms. becoming ever more complex, such as the study of large sets of complex genomics events like gene isoform reconstruction The rest of this paper is organized as follows. Section II and sequencing large numbers of individuals with the aim discusses background information on genomics tools and of fully characterizing genomes at high resolution [1]. This Apache Arrow big data format. Section III presents the new underscores the need for efficient and cost effective DNA ArrowSAM genomics data format. Section IV shows how to analysis infrastructures. integrate Apache Arrow into existing genomics tools, while Section V discusses the measurement results of these new At the same time, genomics is a young field. To process tools. Section VI presents related work in the field. Finally, and analyze genomics data, the research community is actively Section VII ends with the conclusions. working to develop new, efficient and optimized algorithms, techniques and tools, usually programmed in a variety of languages, such as C, Java or Python. These tools share com- II. BACKGROUND mon characteristics that impose limitations on the performance achievable by the genomics pipelines. This section provides a short description of DNA sequence • These tools are developed to use traditional I/O file data pre-processing tools, followed by a brief introduction to systems which incur a huge I/O bottleneck in computation the Apache Arrow framework. bioRxiv preprint doi: https://doi.org/10.1101/741843; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. a) Detailed design of ArrowSAM c) SAM ASCII file Illumnia HiSeqX/NextSeq Reference Genome r003\t204\t1\t29\t17\t6M5I5H\t*\t0\t0\tTAGGC\t*\tSA:Z:ref,9 Oxford Nanopore (FASTA Data) r004\t304\t1\t30\t17\t7M5I2H\t*\t0\t0\tTTTGC\t*\tSA:Z:ref,9 PacBio Sequence (FASTQ Data) b) In-Memory Plasma Object ………………………………………………………………… Storage ………………………………………………………………… Plasma Object(s) RecordBatch FASTQ Files FASTA File SAM In-Memory Data Memory VCF File d) ArrowSAM data r003 204 1 29 17 6M5I5H * 0 0 TAGGC * SA:Z:ref,9 r004 304 1 28 17 8I5M5D * 0 0 TTAGC * SA:Z:ref,9 ... ... ... ... ... ... ... ... ... ... ... ... Genomic Pipeline Apache Arrow APIs e) Schema information Maping Sorting MarkDuplicate BaseRecalibration ApplyBQSR Hyplotypecaller string int int int int string int int int string string string Schema QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL TAG 5 Programming Frameworks Meta Data C/C++ Python Java Fig. 1. a) Genomics pipeline using ArrowSAM format for all intermediate steps to allow in-memory intermediate data storage, which means I/O disk access is only needed to load data into memory at the beginning and to write data back to disk at the end. b) Arrow RecordBatch enclosed in plasma object store. c) SAM file in ASCII text format. d) SAM file in RecordBatch format. e) Schema specifies the data types of ArrowSAM. A. DNA pre-processing process specific computationally intensive parts of big data After DNA data is read by sequencing machines, alignment applications. On the other hand, heterogeneous components tools align reads to the different chromosomes of a reference like FPGAs and GPUs are being increasingly used in cluster genome and generate an output file in the SAM format. and cloud computing environments to improve performance BWA-MEM [3] is a widely used tools for this purpose. The of big data processing. These components are, however, pro- generated SAM file describes various aspects of the alignment grammed using very close-to-metal languages like C/C++ result, such as map position and map quality. SAM is the or even hardware-description languages. The multitude of most commonly used alignment/mapping format. To eliminate technologies used often results in a highly heterogeneous some systematic errors in the reads, some additional data pre- system stack that can be hard to optimize for performance. processing and cleaning steps are subsequently performed, However, combining processes programmed in different lan- like sorting the reads according to their chromosome name guages induces large inter-process communication overheads and position. Picard [4] and Sambamba [5] are some tools (so called data (de)serialization) whenever the data is moved commonly used for such operations. This is followed by between such processes. the mark duplicates step, where duplicate reads are removed To mitigate this problem, the Apache Arrow [6] project by comparing the reads having the same map positions and was initiated to provide an open standardized format and orientation and selecting the read with the highest quality interfaces for tabular data in-memory. Using language-specific score. Duplicate reads are generated due to the wetlab pro- libraries, multiple languages can share in-memory data without cedure of creating multiple copies of DNA molecules to make any copying or serialization. This is done using the plasma sure there are enough samples of each molecule to facilitate inter-process communication component of Arrow, that han- the sequencing process. Again, Picard and Sambamba are dles shared memory pools across different processes in a commonly used here. pipeline [7]. B. Apache Arrow III. ARROWSAM DATA FORMAT To manage and process large data sets, many different big data frameworks have been created. Some examples in- This paper proposes a new in-memory genomics SAM clude Apache Hadoop, Spark, Flink, MapReduce and Dask. format based on Apache Arrow. Such a representation can These frameworks provide highly scalable, portable and pro- benefit from two aspects to improve overall system throughout: grammable environments to improve