Delft University of Technology Arrowsam In-Memory Genomics
Total Page:16
File Type:pdf, Size:1020Kb
Delft University of Technology ArrowSAM In-Memory Genomics Data Processing Using Apache Arrow Ahmad, Tanveer; Ahmed, Nauman; Peltenburg, Johan; Al-Ars, Zaid DOI 10.1109/ICCAIS48893.2020.9096725 Publication date 2020 Document Version Accepted author manuscript Published in 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS) Citation (APA) Ahmad, T., Ahmed, N., Peltenburg, J., & Al-Ars, Z. (2020). ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow. In 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS): Proceedings (pp. 1-6). [9096725] IEEE . https://doi.org/10.1109/ICCAIS48893.2020.9096725 Important note To cite this publication, please use the final published version (if applicable). Please check the document version above. Copyright Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons. Takedown policy Please contact us and provide details if you believe this document breaches copyrights. We will remove access to the work immediately and investigate your claim. This work is downloaded from Delft University of Technology. For technical reasons the number of authors shown on this cover page is limited to a maximum of 10. © 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow Tanveer Ahmad Nauman Ahmed Johan Peltenburg Zaid Al-Ars Accelerated Big Data Systems Delft University of Technology, Delft, Netherlands ft.ahmad, [email protected] Abstract—The rapidly growing size of genomics data bases, due to disk bandwidth [2]. Each tool reads from the I/O driven by advances in sequencing technologies, demands fast disks, computes and writes back to disk. and cost-effective processing. However, processing this data • Due to the virtualized nature of some popular languages creates many challenges, particularly in selecting appropriate algorithms and computing platforms. Computing systems need used to develop genomics tools (such as Java), these tools data closer to the processor for fast processing. Traditionally, are not well suited to exploit modern hardware features due to cost, volatility and other physical constraints of DRAM, it like Single-instruction multiple-data (SIMD) vectoriza- was not feasible to place large amounts of working data sets in tion and accelerators (GPU or FPGAs). memory. However, new emerging storage class memories allow storing and processing big data closer to the processor. In this This paper proposes a new approach for representing ge- work, we show how the commonly used genomics data format, nomics data sets, based on recent developments in big data Sequence Alignment/Map (SAM), can be presented in the Apache analytics to improve the performance and efficiency of ge- Arrow in-memory data representation to benefit of in-memory processing and to ensure better scalability through shared mem- nomics pipelines. Our approach consists of the following main ory objects, by avoiding large (de)-serialization overheads in contributions: cross-language interoperability. To demonstrate the benefits of • We propose an in-memory SAM data representation, such a system, we propose ArrowSAM, an in-memory SAM format that uses the Apache Arrow framework, and integrate called ArrowSAM, created in Apache Arrow to place it into genome pre-processing pipelines including BWA-MEM, genome data in RecordBatches of immutable shared Picard and Sambamba. Results show 15x and 2.4x speedups memory objects for inter-process communication. We as compared to Picard and Sambamba, respectively. The code use DRAM for ArrowSAM placement and inter-process and scripts for running all workflows are freely available at access. https://github.com/abs-tudelft/ArrowSAM. Index Terms—Genomics, Whole Genome/Exome Sequencing, • We adapt existing widely-used genomics data pre- Big Data, Apache Arrow, In-Memory, Parallel Processing processing applications (for alignment, sorting and du- plicates removal) to use the Apache Arrow framework I. INTRODUCTION and to benefit from immutable shared memory plasma objects in inter process communication. Genomics is projected to generate the largest big data • We compare various workflows for genome pre- sets globally, which requires modifying existing tools to take processing, using different techniques for in-memory data advantage of new developments in big data analytics and new communication and placement (for intermediate applica- memory technologies to ensure better performance and high tions), and show that ArrowSAM in-memory columnar throughput. In addition, new applications using DNA data are data representation outperforms. becoming ever more complex, such as the study of large sets of complex genomics events like gene isoform reconstruction The rest of this paper is organized as follows. Section II and sequencing large numbers of individuals with the aim discusses background information on genomics tools and of fully characterizing genomes at high resolution [1]. This Apache Arrow big data format. Section III presents the new underscores the need for efficient and cost effective DNA ArrowSAM genomics data format. Section IV shows how to analysis infrastructures. integrate Apache Arrow into existing genomics tools, while Section V discusses the measurement results of these new At the same time, genomics is a young field. To process tools. Section VI presents related work in the field. Finally, and analyze genomics data, the research community is actively Section VII ends with the conclusions. working to develop new, efficient and optimized algorithms, techniques and tools, usually programmed in a variety of languages, such as C, Java or Python. These tools share com- II. BACKGROUND mon characteristics that impose limitations on the performance achievable by the genomics pipelines. This section provides a short description of DNA sequence • These tools are developed to use traditional I/O file data pre-processing tools, followed by a brief introduction to systems which incur a huge I/O bottleneck in computation the Apache Arrow framework. a) Detailed design of ArrowSAM c) SAM ASCII file Illumnia HiSeqX/NextSeq Reference Genome r003\t204\t1\t29\t17\t6M5I5H\t*\t0\t0\tTAGGC\t*\tSA:Z:ref,9 Oxford Nanopore (FASTA Data) r004\t304\t1\t30\t17\t7M5I2H\t*\t0\t0\tTTTGC\t*\tSA:Z:ref,9 PacBio Sequence (FASTQ Data) b) In-Memory Plasma Object ………………………………………………………………… Storage ………………………………………………………………… Plasma Object(s) RecordBatch FASTQ Files FASTA File SAM In-Memory Data Memory VCF File d) ArrowSAM data r003 204 1 29 17 6M5I5H * 0 0 TAGGC * SA:Z:ref,9 r004 304 1 28 17 8I5M5D * 0 0 TTAGC * SA:Z:ref,9 ... ... ... ... ... ... ... ... ... ... ... ... Genomic Pipeline Apache Arrow APIs e) Schema information Maping Sorting MarkDuplicate BaseRecalibration ApplyBQSR Hyplotypecaller string int int int int string int int int string string string Schema QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL TAG 5 Programming Frameworks Meta Data C/C++ Python Java Fig. 1. a) Genomics pipeline using ArrowSAM format for all intermediate steps to allow in-memory intermediate data storage, which means I/O disk access is only needed to load data into memory at the beginning and to write data back to disk at the end. b) Arrow RecordBatch enclosed in plasma object store. c) SAM file in ASCII text format. d) SAM file in RecordBatch format. e) Schema specifies the data types of ArrowSAM. A. DNA pre-processing process specific computationally intensive parts of big data After DNA data is read by sequencing machines, alignment applications. On the other hand, heterogeneous components tools align reads to the different chromosomes of a reference like FPGAs and GPUs are being increasingly used in cluster genome and generate an output file in the SAM format. and cloud computing environments to improve performance BWA-MEM [3] is a widely used tools for this purpose. The of big data processing. These components are, however, pro- generated SAM file describes various aspects of the alignment grammed using very close-to-metal languages like C/C++ result, such as map position and map quality. SAM is the or even hardware-description languages. The multitude of most commonly used alignment/mapping format. To eliminate technologies used often results in a highly heterogeneous some systematic errors in the reads, some additional data pre- system stack that can be hard to optimize for performance. processing and cleaning steps are subsequently performed, However, combining processes programmed in different lan- like sorting the reads according to their chromosome name guages induces large inter-process communication overheads and position. Picard [4] and Sambamba [5] are some tools (so called data (de)serialization) whenever the data is moved commonly used for such operations. This is followed by between such processes. the mark duplicates step, where duplicate reads are removed To mitigate this problem, the Apache Arrow [6] project by comparing the reads having the same map positions and