ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing

ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing Matt Massie Frank Nothaft Christopher Hartl Christos Kozanitis André Schumacher Anthony D. Joseph David A. Patterson Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2013-207 http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-207.html December 15, 2013 Copyright © 2013, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing Matt Massie1, Frank Austin Nothaft1, Christopher Hartl1,2, Christos Kozanitis1, AndréSchumacher3, Anthony D. Joseph1, and David Patterson1 1Department of Electrical Engineering and Computer Science, University of California, Berkeley 2The Broad Institute of MIT and Harvard 3International Computer Science Institute (ICSI), University of California, Berkeley Executive Summary mented in Apache Avro|a cross-platform/language Current genomics data formats and processing serialization format|they eliminate the need for the pipelines are not designed to scale well to large development of language-specific libraries for format datasets. The current Sequence/Binary Alignmen- decoding/encoding, which eliminates the possibility t/Map (SAM/BAM) formats were intended for sin- of library incompatibilities. gle node processing [18]. There have been attempts to A key feature of ADAM is that any application that adapt BAM to distributed computing environments, implements the ADAM schema is compatible with but they see limited scalability past eight nodes [22]. ADAM. This is important, as it prevents applications Additionally, due to the lack of an explicit data from being locked into a specific tool or pattern. The schema, there are well known incompatibilities be- ADAM stack is inspired by the \narrow waist" of tween libraries that implement SAM/BAM/Variant the Internet Protocol (IP) suite (see Figure 2). We Call Format (VCF) data access. consider the explicit use of a schema in this format To address these problems, we introduce ADAM, to be the greatest contribution of the ADAM stack. a set of formats, APIs, and processing stage imple- In addition to the advantages outlined above, mentations for genomic data. ADAM is fully open ADAM eliminates the file headers in modern ge- source under the Apache 2 license, and is imple- nomics formats. All header information is available mented on top of Avro and Parquet [5, 26] for data inside of each individual record. The variant and storage. Our reference pipeline is implemented on top genotype formats also demonstrate two significant of Spark, a high performance in-memory map-reduce improvements. First, these formats are co-designed system [32]. This combination provides the following so that variant data can be seamlessly calculated from advantages: 1) Avro provides explicit data schema a given collection of sample genotypes. Secondly, access in C/C++/C#, Java/Scala, Python, php, and these formats are designed to flexibly accommodate Ruby; 2) Parquet allows access by database systems annotations without cluttering the core variant/geno- like Impala and Shark; and 3) Spark improves perfor- type schema. In addition to the benefits described mance through in-memory caching and reducing disk above, ADAM files are up to 25% smaller on disk I/O. than compressed BAM files without losing any infor- In addition to improving the format's cross- mation. platform portability, these changes lead to signifi- The ADAM processing pipeline uses Spark as a cant performance improvements. On a single node, compute engine and Parquet for data access. Spark we are able to speedup sort and duplicate marking is an in-memory MapReduce framework which mini- by 2×. More importantly, on a 250 Gigabyte (GB) mizes I/O accesses. We chose Parquet for data stor- high (60×) coverage human genome, this system age as it is an open-source columnar store that is achieves a 50× speedup on a 100 node computing designed for distribution across multiple computers cluster (see Table 1), fulfilling the promise of scala- with high compression. Additionally, Parquet sup- bility of ADAM. ports efficient methods (predicates and projections) The ADAM format provides explicit schemas for for accessing only a specific segment or fields of a read and reference oriented (pileup) sequence data, file, which can provide significant (2-10×) additional variants, and genotypes. As the schemas are imple- speedup for genomics data access patterns. 1 1 Introduction cloud computing performance, which allows us to par- allelize read translation steps that are required be- Although the cost of data processing has not histor- tween alignment and variant calling. In this paper, ically been an issue for genomic studies, the falling we provide an introduction to the data formats (x2) cost of genetic sequencing will soon turn computa- and pipelines used to process genomes (x3), introduce tional tasks into a dominant cost [21]. The process of the ADAM formats and data access application pro- transforming reads from alignment to variant-calling gramming interfaces (APIs) (x5) and the programing ready reads involves several processing stages includ- model (x6). Finally, we review the performance and ing duplicate marking, base score quality recalibra- compression gains we achieve (x7), and outline fu- tion, and local realignment. Currently, these stages ture enhancements to ADAM on which we are work- have involved reading a Sequence/Binary Alignment ing (x8). Map (SAM/BAM) file, performing transformations on the data, and writing this data back out to disk as a new SAM/BAM file [18]. 2 Current Genomics Storage Because of the amount of time these transforma- Standards tions spend shuffling data to and from disk, these transformations have become the bottleneck in mod- The current de facto standard for the storage and ern variant calling pipelines. It currently takes three processing of genomics data in read format is BAM. days to perform these transformations on a high cov- BAM is a binary file format that implements the SAM 1 erage BAM file . standard for read storage [18]. BAM files can be Byrewriting these applications to make use of mod- operated on in several languages, using APIs either ern in-memory MapReduce frameworks like Apache from SAMtools[18] (in C++), Picard[3] (in Java), Spark [32], these transformations can now be com- and Hadoop-BAM[22] (in Hadoop MapReduce via pleted in under two hours. Sorting and duplicate Java). mapping alone can be accelerated from 1.5 days on a BAM provides more efficient access and compres- single node to take less than one hour on a commod- sion than the SAM file format, as its binary encoding ity cluster. reduces several fields into a single byte, and elimi- Table 1 previews the performance of ADAM for nates text processing on load. However, BAM has sort and duplicate marking. been criticized because it is difficult to process. For example, the three main APIs that implement the format each note that they do not fully implement Table 1: Sort and Duplicate Marking for NA12878 the format due to its complexity. Additionally, the file format is difficult to use in multiprocessing envi- Software EC2 profile Wall Clock Time ronments due to its use of a centralized header; the Picard 1.103 1 hs1.8xlarge 17h 44m Hadoop-BAM implementation notes that it's scal- ADAM 0.5.0 1 hs1.8xlarge 8h 56m ability is limited to distributed processing environ- ADAM 0.5.0 32 cr1.8xlarge 33m ments of less than eight machines. ADAM 0.5.0 100 m2.4xlarge 21m In response to the growing size of sequencing Software EC2 profile Wall Clock Time files2, a variety of compression methods have come Picard 1.103 1 hs1.8xlarge 20h 22m to light [16, 11, 27, 23, 6, 9, 30, 14, 13]. Slim- ADAM 0.5.0 100 m2.4xlarge 29m Gene [16], cSRA [28], and CRAM [11] use reference based compression techniques to losslessly represent reads. However, they advocate in favor of lossy qual- The format we present also provides many pro- ity value representations. The former two use lower gramming efficiencies, as it is easy to change evidence quantization levels to represent quality values and access techniques, and it is directly portable across CRAM uses user defined budgets to store only frac- many programming languages. tions of a quality string. In the same spirit, Illumina In this paper, we introduce ADAM, which is a pro- recently presented a systematic approach of qual- gramming framework, a set of APIs, and a set of data ity score removal in [13] which safely ignores quality formats for cloud scale genomic processing. These scores from predictable nucleotides; these are bases frameworks and formats scale efficiently to modern that always appear after certain words. It is also 1A 250GB BAM file at 30× coverage. See x7.2 for a longer 2High coverage BAM files can be approximately 250 GB for discussion a human genome. 2 worth mentioning that the standard configurations of pipeline. However, the intermediate read process- cSRA and CRAM discard the optional fields of the ing stages are responsible for the majority of exe- BAM format and also simplify the QNAME field. cution time. Table 2 shows a breakdown of stage execution time for the version 2.7 of the Genome Analysis Toolkit (GATK), a popular variant call- 3 Genomic Processing ing pipeline [19]. The numbers in the table came Pipelines from running on the NA12878 high-coverage human genome.

ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing

A Comprehensive Workflow for Variant Calling Pipeline Comparison and Analysis Using R Programming

Property Graph Vs RDF Triple Store: a Comparison on Glycan Substructure Search

Genomic Alignment (Mapping) and SNP / Polymorphism Calling

Genetics 211 - 2018 Lecture 3

Introduction to Bioinformatics (Elective) – SBB1609

Meeting Booklet

UNIVERSITY of CALIFORNIA RIVERSIDE SNP Calling Using Genotype Model Selection on High-Throughput Sequencing Data a Dissertation

Easy and Accurate Reconstruction of Whole HIV Genomes from Short-Read Sequence Data

Rapid and Comprehensive Quality Assessment of Raw Sequence Reads

Supplementary Figure 1. Seb1 Interacts with the CF-CPF Complex

BIOINFORMATICS APPLICATIONS NOTE Doi:10.1093/Bioinformatics/Btp352

N-Glycosylation Profiles As a Risk Stratification Biomarker for Type II Diabetes Mellitus and Its Associated Factors