In-Memory Genomics Data Processing Through Apache Arrow Framework

bioRxiv preprint doi: https://doi.org/10.1101/741843; this version posted August 22, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. ArrowSAM: In-Memory Genomics Data Processing through Apache Arrow Framework Tanveer Ahmad Nauman Ahmed Dept. of Quantum and Computer Engineering Dept. of Quantum and Computer Engineering Delft University of Technology Delft University of Technology Delft, Netherlands Delft, Netherlands [email protected] [email protected] Johan Peltenburg Zaid Al-Ars Dept. of Quantum and Computer Engineering Dept. of Quantum and Computer Engineering Delft University of Technology Delft University of Technology Delft, Netherlands Delft, Netherlands [email protected] [email protected] Abstract—The rapidly growing human genomics data driven put. At the end of first ever Human Genome Project [1990— by advances in sequencing technologies demands fast and cost- 2003], a final draft sequence of the euchromatic portion of effective processing. However, processing this data brings some the human genome containing approximately 2.85 billion challenges particularly in selecting appropriate algorithms and computing platforms. Computing systems need data closer to nucleotides was announced [1]. This project also identified the processor for fast processing. Previously, due to the cost, 20,000—25,000 human protein-coding genes. Since then, ge- volatility and other physical constraints of DRAM, it was not nomics data has been increasing rapidly due to innovations feasible to place large amounts of working data sets in memory. in genome sequencing technologies and analysis methods. However, new emerging storage class memories allow storing and Second Generation Sequencing (NGS) technologies like Il- processing big data closer to the processor. In this work, we show how commonly used genomics data lumina’s HiSeqX and NextSeq produce whole genome, high format, Sequence Alignment/Map (SAM) can be presented in the throughput and high quality short read data at a total cost of Apache Arrow in-memory data representation to take benefits of $1K per genome, which is expected to drop down below $100 in-memory processing to ensure the better scalability through for more advanced sequencing technologies. Third generation shared memory Plasma Object Store by avoiding huge (de)- sequencing technologies are now capable of sequencing reads serialization overheads in cross-language interoperability. To demonstrate the benefits of such a system, we presented an in- of more than 10 kilo-base-pairs (kbp) in length, such as memory SAM representation, we called it ArrowSAM, Apache Oxford Nanopore technologies (ONT) and Pacific BioSciences Arrow framework is integrated into genome pre-processing (PacBio) Single Molecule Real-Time (SMRT) sequencing applications including BWA-MEM, Sorting and Picard as use technology. The ongoing pace of these technologies promises cases to show the advantages of ArrowSAM. Our implementation even more longer reads of ∼100 kbp on average. Long reads comprises three components, First, We integrated Apache Arrow into BWA-MEM to write output SAM data in ArrowSAM. produced by third generation sequencing technologies provide Secondly, we sorted all the ArrowSAM data by their coordinates the prospect to fully characterize genomes at high resolution in parallel through pandas dataframes. Finally, Apache Arrow for precision medicine [2]. Unfortunately, currently available is integrated into HTSJDK library (used in Picard for disk I/O long read sequencing technologies produce data with high handling), where all ArrowSAM data is processed in parallel error rates. ONT generates data with mean error rate of for duplicates removal. This implementation gives promising performance improvements for genome data pre-processing in ∼40% and ∼15% for PacBio as compared to Illumina short term of both, speedup and system resource utilization. Due reads with less than 2% error rate [3]. With the continued to columnar data format, better cache locality is exploited in decrease in price of DNA sequencing, the bottleneck shifts both applications and shared memory objects enable parallel from generating DNA data to the computation and analysis processing. of DNA information, which needs to keep pace of the high Index Terms—Genomics, Whole Genome/Exome Sequencing, throughput of sequencing machines, and at the same time Big Data, Apache Arrow, In-Memory, Parallel Processing account for the imperfections of generated data. New applications using DNA data are becoming ever more complex I. INTRODUCTION such as the study, large sets of complex genomics events like Genomics is projected to become the field that generates gene isoform reconstruction and sequencing large numbers of largest big data sets globally, which requires modifying exist- individuals with the aim of fully characterizing genomes at ing tools to take advantage of new developments in memory high resolution [2]. This underscores the need for efficient technologies to ensure better performance and high through- and cost effective DNA analysis infrastructures. bioRxiv preprint doi: https://doi.org/10.1101/741843; this version posted August 22, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. New storage-class memory (SCM) technologies will soon for this purpose. A brief introduction to the Apache Arrow replace the existing long latency and block-based data trans- framework, we use for in-memory SAM data representation fer hard disk drives and solid-state drive storage mediums. and its Plasma shared memory API is also given. Intel’s Phase-Change Memory (PCM) based Optane DC (Data 1) Pre-processing/Cleaning: Alignment tools align reads to Center) Persistent Memory is one of the first candidates the different chromosomes of a reference genome and generate in this paradigm to accelerate big data workloads for in- an output file in the SAM format, describing various aspects memory analytics and provide fast startup-times for legacy of the alignment result, such as map position and map quality. applications/virtual machines in cloud environments [4]. These SAM format is the most commonly used alignment/mapping memories have a higher latency as compared to DRAM format. To eliminate some systematic errors in the reads but provide huge capacity and byte addressability at lower some additional data pre-processing and cleaning steps are costs. Though this memory technology is not an immediate subsequently performed, like sorting the reads according to replacement of main memory, but the new features it provides, their chromosome name and position. This is followed by the make it usable in-conjunction with storage and memory, as an mark duplicates step, where duplicate reads are removed by additional intermediate tier of memory hierarchy. comparing the reads having the same map positions and ori- entation and selecting the read with the highest quality score. A. Problem Definition Duplicate reads are generated due to the wetlab procedure Bioinformatics is a young field. To process and analyze of creating multiple copies of DNA molecules to make sure genomics data, the research community is actively working to there are enough samples of each molecule to facilitate the develop new, efficient and optimized algorithms, techniques sequencing process. Samtools [6], Picard [7], Sambamba [8] and tools, usually programmed in a variety of languages, such and Samblaster [9] are some tools commonly used for such as C, Java or Python. These tools share common characteristics operations. that impose limitations on the performance achievable by the 2) Variant Discovery: Variant discovery is the process of genomics pipelines. identifying the presence of single nucleotide variants (SNVs) • These tools are developed to use traditional I/O file and insertions and deletions (indels) in individual genome systemswhich incur a huge I/O bottleneck in computation sequenced data, these variants may be germline variations (the due to disk bandwidth [5]. Each tool reads from the I/O variations in an individual’s DNA inherited from parents) or disks, computes and writes back to disk. somatic mutations (the variations occur in cells other than • Due to the virtualized nature of some popular languages germ cells, which can cause caner or other diseases). GATK used to develop genomics tools (such as Java), these tools and Freebayes are commonly used open-source tools for such cannot exploit modern hardware features like multi-core analysis. The output of these tools is generated in the variant parallelization, Single instruction, multiple data (SIMD) calling format (VCF) to visualize and further analyze the vectorization and accelerators (GPU or FPGAs) perfor- detected variations. mance very well. 3) Variant Calling/Analysis Tools: There are many com- mercially and open-source tools available to build whole B. Contributions genome/exome analysis pipeline. Some of these tools include: The main contributions of this work are as follows: Galaxy [10], [11], SevenBridges [12], GenomeNext [13], • In-memory SAM data representation (ArrowSAM) cre- SpeedSeq [14], BcBioNextgen [15], DNAp [16], and ated in Apache Arrow to place genome data in

In-Memory Genomics Data Processing Through Apache Arrow Framework

Mastering Machine Learning with Scikit-Learn

Installing Python Ø Basics of Programming • Data Types • Functions • List Ø Numpy Library Ø Pandas Library What Is Python?

Quick Install for AWS EMR

Delft University of Technology Arrowsam In-Memory Genomics

The Platform Inside and out Release 0.8

Cheat Sheet Pandas Python.Indd

Pandas: Powerful Python Data Analysis Toolkit Release 0.25.3

Arrow: Integration to 'Apache' 'Arrow'

Image Processing with Scikit-Image Emmanuelle Gouillart

In Reference to RPC: It's Time to Add Distributed Memory

Ray: a Distributed Framework for Emerging AI Applications

Zero-Cost, Arrow-Enabled Data Interface for Apache Spark