Compression of Next-Generation Sequencing Reads Aided by Highly

Compression of next-generation sequencing reads aided by highly efficient de novo assembly Daniel C. Jones∗1, Walter L. Ruzzo†1,2,3, Xinxia Peng‡4, and Michael G. Katze§4 1Department of Computer Science and Engineering, University of Washington 2Department of Genome Sciences, University of Washington 3Fred Hutchinson Cancer Research Center 4Department of Microbiology, University of Washington November 1, 2018 Abstract to cope with the vast increase in raw data. Experi- ments that would previously have been conducted with microarrays and resulted in several megabytes of data, We present Quip, a lossless compression algorithm for are now performed by sequencing, producing many giga- next-generation sequencing data in the FASTQ and SAM/BAM formats. In addition to implementing bytes, and demanding a significant investment in com- reference-based compression, we have developed, to our putational infrastructure. While the cost of disk storage has also steadily decreased over time, it has not matched knowledge, the first assembly-based compressor, using a the dramatic change in the cost and volume of sequenc- novel de novo assembly algorithm. A probabilistic data ing. A transformative breakthrough in storage technol- structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing ogy may occur in the coming years, but the era of the millions of reads to be assembled very efficiently. Read se- $1000 genome is certain to arrive before that of the $100 petabyte hard disk. quences are then stored as positions within the assembled contigs. This is combined with statistical compression As cloud computing and software as a service become in- of read identifiers, quality scores, alignment information, creasingly relevant to molecular biology research, hours and sequences, effectively collapsing very large datasets spent transferring NGS datasets to and from off-site to less than 15% of their original size with no loss of in- servers for analysis will delay meaningful results. More formation. often researchers will be forced to maximize bandwidth Availability: Quip is freely avail- by physically transporting storage media (via the “sneak- able under the BSD license from ernet”), an expensive and logistically complicated option. http://cs.washington.edu/homes/dcjones/quip. These difficulties will only be amplified as exponentially more sequencing data is generated, implying that even moderate gains in domain-specific compression methods arXiv:1207.2424v1 [q-bio.QM] 10 Jul 2012 1 Introduction will translate into a significant reduction in the cost of managing these massive datasets over time. With the development of next-generation sequencing Storage and analysis of NGS data centers primarily (NGS) technology researchers have had to adapt quickly ∗[email protected] †[email protected] ‡[email protected] §[email protected] 1 around two formats that have arisen recently as de facto regardless of the length of the read. standards: FASTQ and SAM. FASTQ stores, in addition to nucleotide sequences, a unique identifier for each read This idea is explored also in the Goby format and quality scores, which encode estimates of the proba- (http://campagnelab.org/software/goby/), which bility that each base is correctly called. For its simplicity, has been proposed an alternative to SAM/BAM, the FASTQ is a surprisingly ill-defined format. The closest primary functional difference being that sequences of thing to an accepted specification is the description by aligned reads are not stored but looked up in a reference Cock et al. (2010), but the format arose ad hoc from mul- genome when needed (frequently they are not). For some tiple sources (primarily Sanger and Solexa/Illumina), so applications, reference-based compression can be taken a number of variations exist, particularly in how quality much further by storing only SNP information, sum- scores are encoded. The SAM format is far more complex marizing a sequencing experiment in several megabytes but also more tightly defined, and comes with a reference (Christley et al., 2009). However, even when SNP calls implementation in the form of SAMtools (Li et al., 2009). are all that is needed, discarding the raw reads would It is able to store alignment information in addition to prevent any reanalysis of the data. read identifiers, sequences, and quality scores. SAM files, While a reference-based approach typically results in which are stored in plain text, can also be converted to superior compression it has a number of disadvan- the BAM format, a compressed binary version of SAM, tages. Most evidently, an appropriate reference sequence which is far more compact and allows for relatively effi- database is not always available, particularly in the case cient random access. of metagenomic sequencing. One could be contrived by compiling a set of genomes from species expected to be Compression of nucleotide sequences has been the tar- represented in the sample. However, a high degree of ex- get of some interest, but compressing NGS data, made pertise is required to curate and manage such a project- up of millions of short fragments of a greater whole, dependent database. Secondly, there is the practical combined with metadata in the form of read identifiers concern that files compressed with a reference-based ap- and quality scores, presents a very different problem and proach are not self-contained. Decompression requires demands new techniques. Splitting the data into sepa- precisely the same reference database used for compres- rate contexts for read identifiers, sequences, and quality sion, and if it is lost or forgotten the compressed data scores and compressing them with the Lempel-Zip algo- becomes inaccessible. rithm and Huffman coding has been explored explored by Tembe et al. (2010) and Deorowicz and Grabowski Another recurring theme in the the growing literature (2011), who demonstrate the promise of domain-specific on short read compression is lossy encoding of sequence compression with significant gains over general-purpose quality scores. This follows naturally from the realiza- programs like gzip and bzip2. tion that quality scores are particularly difficult to com- press. Unlike read identifiers, which are highly redundant, Kozanitis et al. (2011) and Hsi-Yang Fritz et al. (2011) or nucleotide sequences, which contain some structure, proposed reference-based compression methods, exploit- quality scores are inconsistently encoded between proto- ing the redundant nature of the data by aligning reads to cols and computational pipelines and are often simply a known reference genome sequence and storing genomic high-entropy. It is dissatisfying that metadata (quality positions in place of nucleotide sequences. Decompression scores) should consume more space than primary data is then performed by copying the read sequences from the (nucleotide sequences). Yet, also dissatisfying to many genome. Though any differences from the reference se- researchers is the thought of discarding information with- quence must also be stored, referenced-based approaches out a very good understanding of its effect on downstream can achieve much higher compression and they grow in- analysis. creasing efficient with longer read lengths, since storing a genomic position requires the same amount of space, A number of lossy compression algorithms for 2 quality scores have been proposed, includ- 2 Materials & Methods ing various binning schemes implemented in QScores-Archiver (Wan et al., 2011) and SCALCE 2.1 Statistical Compression (http://scalce.sourceforge.net), scaling to a re- duced alphabet with randomized rounding in SlimGene (Kozanitis et al., 2011), and discarding quality scores for The basis of our approach is founded on statistical com- bases which match a reference sequence in Cramtools pression using arithmetic coding, a form of entropy cod- (Hsi-Yang Fritz et al., 2011). In SCALCE, SlimGene, ing which approaches optimality, but requires some care and Cramtools, quality scores may also be losslessly to implement efficiently (see Said (2004) for an excellent compressed. Kozanitis et al. (2011) analyzed of the review). Arithmetic coding can be thought of as a refine- effects of their algorithm on downstream analysis. Their ment of Huffman coding, the major advantage being that results suggest that while some SNP calls are affected, it is able to assign codes of a non-integral number of bits. they are primarily marginal, low-confidence calls between If a symbol appears with probability 0.1, it can be en- . hetero- and homozygosity. coded near to its optimal code length of − log2(0 1) ≈ 3 3 bits. Despite its power, it has historically seen much less use than Huffman coding, due in large part to fear of infringing on a number of patents that have now expired. Arithmetic coding is a particularly elegant means of com- Decreasing the entropy of quality scores while retaining pression in that it allows a complete separation between accuracy is an important goal, but successful lossy com- statistical modeling and encoding. In Quip, the same pression demands an understanding of what is lost. For arithmetic coder is used to encode quality scores, read example, lossy audio compression (e.g. MP3) is grounded identifiers, nucleotide sequences, and alignment informa- in psychoacoustic principles, preferentially discarding the tion, but with very different statistical models for each, least perceptible sound. Conjuring a similarly principled which gives it a tremendous advantage over general-

Load more