Parallel Decompression of Gzip-Compressed Files and Random

Parallel decompression of gzip-compressed files and random access to DNA sequences Mael¨ Kerbiriou Rayan Chikhi INRIA Lille - Nord Europe C3BI USR 3756 CRIStAL, Universite´ de Lille Institut Pasteur & CNRS Lille, France Paris, France [email protected] [email protected] Abstract—Decompressing a file made by the gzip program slowdown of 1-2 orders of magnitudes exists at the beginning at an arbitrary location is in principle impossible, due to the of many tools, and in particular bioinformatics pipelines that nature of the DEFLATE compression algorithm. Consequently, take compressed files as input. no existing program can take advantage of parallelism to rapidly decompress large gzip-compressed files. This is an unsatisfac- If one could efficiently perform random accesses to content tory bottleneck, especially for the analysis of large sequencing in gzipped files, it would unlock multi-core decompression. As data experiments. Here we propose a parallel algorithm and an a result, data would be streamed significantly faster than if it implementation, pugz, that performs fast and exact decompres- was read from an uncompressed file. However, the DEFLATE sion of any text file. We show that pugz is an order of magnitude compression algorithm in gzip works in such a way that faster than gunzip, and 5x faster than a highly-optimized sequential implementation (libdeflate). We also study the accessing a random location does in principle require to read related problem of random access to compressed data. We give the whole file up to the desired location. This is because simple models and experimental results that shed light on the (1) variable-length blocks are created without any index nor structure of gzip-compressed files containing DNA sequences. explicit block starts/ends, and (2) back-references are made Preliminary results show that random access to sequences within to strings that appear previously in the decompressed file, a gzip-compressed FASTQ file is almost always feasible at low compression levels, yet is approximate at higher compression which is the essence of Lempel-Ziv parsing. A more precise levels. description of DEFLATE is provided later in the article. Index Terms—compression, bioinformatics, DNA sequences, To build some intuition on why random access is hard, let parallel algorithms an example file contain n identical repetitions of a sentence then followed by m identical repetitions of a completely I. INTRODUCTION different sentence. Suppose that neither the sentences lengths, The compression of genomic sequencing data continues to n nor m are known. The compression algorithm in gzip uses be a busy research area, using fundamental techniques such heuristics yet is very likely to always encode repetitions of the as Lempel-Ziv parsing [16], Burrows-Wheeler transform [5], sentences using back-references and not in clear form, except or de Bruijn graphs [2]. Despite many promising advances, at two locations: (i) the beginning of the file, and (ii) the the most popular and practical program for compressing raw DEFLATE block(s) where the second sentence appears for sequencing data nowadays remains the Lempel-Ziv based the first time. A random access towards the end of the file gzip program, arguably due to its speed and availability on all would require to somehow obtain information that is present systems. There exist parallel programs for speeding-up gzip only at location (ii). Yet, since this location cannot be easily compression, e.g. pigz1. The underlying compression algo- guessed, it is challenging to design a better strategy than to arXiv:1905.07224v1 [cs.DS] 17 May 2019 rithm of gzip, DEFLATE, easily lends itself to processing of fully decompress the file. blocks of data concurrently. There exist techniques to create so-called ”blocked gzip” However, the decompression aspects are more challenging files, i.e. files that are compressed as a sequence of small and and have been less studied, leaving us with only sequential al- independent blocks. Such files can also contain an additional gorithms. Investigating decompression would not be a pressing index that enables random access and parallel decompression. matter if it was already an IO-bound operation, but actually However as we will see in Section 2, these techniques are i) only around 30-50 MB of compressed data per second can be not so wide-spread and ii) yield worse compression ratios. processed by gunzip on a modern processor. This is much In this work, we will focus on the parallel decompression below the 500 (resp. 100–200) MB/sec read throughput of of files compressed by the gzip program. In bioinformatics, SATA solid-state (resp. mechanical) drives. In the near future, virtually every tool that processes large amounts of raw the wide availability of NVMe solid-state drives will allow sequencing data begins by reading large .fastq.gz file(s). for even faster reads, up to 3 GB/sec. Therefore a potential We first present simple models to provide a general under- standing of gzip compression. Parallel decompression would 1github.com/madler/pigz be straightforward if one could perform decompression at random locations. We therefore start by studying the feasibility will be omitted [7]. Input data is partitioned into blocks, of random access in compressed files, with a focus on files that may or may not be compressed, depending on whether containing DNA sequences. Random access turns out to be the data is ccompressible. To simplify the exposition we challenging for files with high compression ratio. We thus will consider (in this section only) the creation of a single propose an alternative strategy via the design of a new, general- compressed block, which is performed in two stages. The purpose, exact and parallel gzip decompression algorithm first stage is LZ77-style parsing, which encodes the input that can process any arbitrary ASCII file. We demonstrate data into a succession of literals (i.e. unmodified bytes) and the usefulness of this algorithm by showing significant speed- offset/length integer pairs, later denoted as matches, referring ups compared to the baseline method (gunzip) and a highly to previous occurrences of decompressed characters. In the optimized implementation (libdeflate). original definition of LZ77 [16], the parsing has to be of the form ”match-literal-match-literal-. ”, as follows. II. RELATED WORKS Several algorithms have been proposed for the compression Definition 1 (LZ77 parsing). Let s be a string. A LZ77 parsing of genomic data (for a survey, see [6]). Yet, the most popular of s is a sequence of phrases p1; : : : ; pz, where each pi is solution arguably remains gzip. It achieves roughly a 3- a tuple consisting of two integers (offset and length) and a fold size reduction on files in the FASTQ format. While character. The first phrase p1 encodes an empty match (0; 0) files produced by gzip are not designed to make random along with the first character of s. If p1; : : : ; pi−1 represent s access efficient, there exists variants that permit this operation. up to position j, then pi records the offset and length of the For instance, a modification of Lempel-Ziv parsing has been longest prefix of s[j + 1 :::] that occurs in s[: : : j], along with proposed to support random access at the expense of slightly the character that immediately follows the prefix. lower compression ratio [10], but the gzip program does We now define a more general flavor that better abstracts not implement it. In the same spirit, the tabix/bgzip pro- the product of DEFLATE. grams from the SAMtools/HTSlib project [12] create so-called blocked files that are indexed and gzip-compatible2. One Definition 2 (mixed LZ77-style parsing). Let s be a string, a 0 0 can also create an index of blocks within a classical gzip- mixed LZ77-style parsing (mLZ) of s is a sequence p1; : : : ; pz, 0 compressed file for later faster access to random locations [11]. where each pi is either a single character of s or a pair of The index is created during an initial sequential decompression offset/length integers that refers to a substring of s within 0 0 of the whole file. This essentially solves random access, except p1; : : : ; pi−1. that the technique is not so widespread as it requires a separate LZ77 parsing can be seen as a special case of mLZ. In file or a different file format, and does not apply when one DEFLATE, matches are of length between 3 and 258 bytes, only needs to read a given compressed file once. Anecdotally, and are made within a sliding window of length 32 KB, i.e. a majority of compressed files hosted in the Sequence Read offsets are never larger than 32 KB. Archive repository and uploaded in 2018 are not compressed The second stage of DEFLATE uses Huffman coding to in blocks. entropy-compress the output of the first stage. In practice, Previous efforts have aimed to improve compression via DEFLATE uses two Huffman code trees, but going into more better algorithms or hardware acceleration, e.g. in [1], [14]. Ef- details will not be necessary here. An important point however ficient decompression has been relatively less studied, except is that Huffman compression is reset at the beginning of each in the following works. A fine-grained hardware-accelerated block. To summarize, we can abstract the compression process decompression algorithm achieves a 2x speedup compared as follows. to gunzip [15]. A recent unpublished highly-optimized sequential implementation (libdeflate) achieves roughly a Algorithm 1 (DEFLATE compression). Let s be a string, 3x speedup3. Perhaps the closest work to ours is an attempt DEFLATE compression produces a Huffman-compressed rep- to recover corrupted DEFLATE-compressed files [3], [4]. It resentation C from a mixed LZ77-style parsing of s.

Load more