<<

Contrasting the Performance of Compression Algorithms on Genomic Data

Cornel Constantinescu, IBM Research Almaden Outline of the Talk:

• Introduction / Motivation • Data used in experiments • General purpose compressors comparison • Simple Improvements • Special purpose compression • Transparent compression – working on compressed data (prototype) • Parallelism / Multithreading • Conclusion Introduction / Motivation

• Despite the large number of research papers and compression algorithms proposed for compressing genomic data generated by sequencing machines, by far the most commonly used compression algorithm in the industry for FASTQ data is gzip.

• The main drawbacks of the proposed alternative special-purpose compression algorithms are: • slow speed of either compression or decompression or both, and also their • brittleness by making various limiting assumptions about the input FASTQ format (for example, the structure of the headers or fixed lengths of the records [1]) in order to further improve their specialized compression.

1. Ibrahim Numanagic, James K Bonfield, Faraz Hach, Jan Voges, Jorn Ostermann, Claudio Alberti, Marco Mattavelli, and S Cenk Sahinalp. Comparison of high-throughput sequencing tools. Nature Methods, 13(12):1005–1008, October 2016. Fast and Ecient Compression of Next Generation Sequencing Data 2

2 General Purpose Compression of Genomic Data

As stated earlier, gzip/ compression is the method of choice by the industry for FASTQ genomic data. FASTQ genomic data is a text-based format (ASCII readable text) for storing a biological sequence and the corresponding quality scores. Each sequence letter and quality score is encoded with a single ASCII character. FASTQ data is structured in four fields per record (a “read”). The first field is the SEQUENCE ID or the header of the read. The second field contains the Sequence data (consisting of the four nucleotides A,,G,T and eventually N), the third field consists of a ’+’ character eventually followed by the same data as in the header. The fourth field consists of Quality values associated to Genomic data used in experiments: the nucleotides in the sequence data field. Each Quality value is represented as an ASCII character and there are exactly as many quality values as nucleotides (the sizes of the second and the fourth field in each read are equal). For more details on the FASTQ data format please see [7].

FASTQ/FASTA Data Source (ftp://ftp.ncbi.nih.gov/1000genomes/ftp/) Raw Size ERR317482 1.fastq https://www.ebi.ac.uk/ena/data/view/ERR317482 6.8 GB ERR317482 2.fastq https://www.ebi.ac.uk/ena/data/view/ERR317482 6.8 GB ERR034549 1.filt.fastq data/NA12777/sequence read/ERR034549 1.filt.fastq.gz 17 GB ERR034549 2.filt.fastq data/NA12777/sequence read/ERR034549 2.filt.fastq.gz 17 GB ERR125594 1.filt.fastq data/NA21122/sequence read/ERR125594 1.filt.fastq.gz 13 GB ERR125594 2.filt.fastq data/NA21122/sequence read/ERR125594 2.filt.fastq.gz 13 GB SRR099453 1.filt.fastq data/NA11920/sequence read/SRR099453 1.filt.fastq.gz 22 GB SRR099453 2.filt.fastq data/NA11920/sequence read/SRR099453 2.filt.fastq.gz 22 GB SRR764769 1.filt.fastq data/NA21122/sequence read/SRR764769 1.filt.fastq.gz 8.6 GB SRR764769 2.filt.fastq data/NA21122/sequence read/SRR764769 2.filt.fastq.gz 8.6 GB human g1k v37.fasta ftp://ftp.broadinstitute.org/bundle/b37/human g1k v37.fasta.gz 3.0 GB

Table 1: Selection of FASTQ data used in this paper: file names, URL’s and sizes.

To get a sense of how compressible the FASTQ genomic data is, we started by using various popular general purpose compressors on various FASTQ data publicly available from the 1000 Genome Project.

Figure 1: Compression ratios of popular general purpose compressors on FASTQ data (smaller is better).

The FASTQ data samples, names, URL’s and sizes, used in the experiments reported in this paper are described in Table 1. Figures 1, 2 and 3 show the compression and decompression speeds1 in MB/s

1Throughout this paper we refer to MB or GB with regard to powers of 2, i.e. 10242 bytes or 10243 bytes, respectively. Outline of the Talk:

• Introduction / Motivation • Data used in experiments • General purpose compressors comparison • Simple Improvements • Special purpose compression • Transparent compression – working on compressed data • Parallelism / Multithreading • Conclusion Fast and Ecient Compression of Next Generation Sequencing Data 2

2 General Purpose Compression of Genomic Data

As stated earlier, gzip/zlib compression is the method of choice by the industry for FASTQ genomic data. FASTQ genomic data is a text-based format (ASCII readable text) for storing a biological sequence and the corresponding quality scores. Each sequence letter and quality score is encoded with a single ASCII character. FASTQ data is structured in four fields per record (a “read”). The first field is the SEQUENCE ID or the header of the read. The second field contains the Sequence data (consisting of the four nucleotides A,C,G,T and eventually N), the third field consists of a ’+’ character eventually followed by the same data as in the header. The fourth field consists of Quality values associated to the nucleotides in the sequence data field. Each Quality value is represented as an ASCII character and there are exactly as many quality values as nucleotides (the sizes of the second and the fourth field in each read are equal). For more details on the FASTQ data format please see [7].

FASTQ/FASTA Data Source (ftp://ftp.ncbi.nih.gov/1000genomes/ftp/) Size ERR317482 1.fastq https://www.ebi.ac.uk/ena/data/view/ERR317482 6.8 GB ERR317482 2.fastq https://www.ebi.ac.uk/ena/data/view/ERR317482 6.8 GB ERR034549 1.filt.fastq data/NA12777/sequence read/ERR034549 1.filt.fastq.gz 17 GB ERR034549 2.filt.fastq data/NA12777/sequence read/ERR034549 2.filt.fastq.gz 17 GB ERR125594 1.filt.fastq data/NA21122/sequence read/ERR125594 1.filt.fastq.gz 13 GB ERR125594 2.filt.fastq data/NA21122/sequence read/ERR125594 2.filt.fastq.gz 13 GB SRR099453 1.filt.fastq data/NA11920/sequence read/SRR099453 1.filt.fastq.gz 22 GB SRR099453 2.filt.fastq data/NA11920/sequence read/SRR099453 2.filt.fastq.gz 22 GB SRR764769 1.filt.fastq data/NA21122/sequence read/SRR764769 1.filt.fastq.gz 8.6 GB SRR764769 2.filt.fastq data/NA21122/sequence read/SRR764769 2.filt.fastq.gz 8.6 GB human g1k v37.fasta ftp://ftp.broadinstitute.org/bundle/b37/human g1k v37.fasta.gz 3.0 GB

Table 1: Selection of FASTQ data used in this paper: file names, URL’s and sizes.

To get a sense of how compressible the FASTQ genomic data is, we started by using various popular general purposeCompression Ratios on ERR317482_1 (6.8GB) compressors on various FASTQ data publicly available from the 1000 Genome Project.

Figure 1: Compression ratios of popular general purpose compressors on FASTQ data (smaller is better).

The FASTQ data samples, names, URL’s and sizes, used in the experiments reported in this paper are described in Table 1. Figures 1, 2 and 3 show the compression and decompression speeds1 in MB/s

1Throughout this paper we refer to MB or GB with regard to powers of 2, i.e. 10242 bytes or 10243 bytes, respectively. Fast and Ecient Compression of Next Generation Sequencing Data 3 Compression Speeds (MB/s) on ERR317482_1

Figure 2: Compression speeds of popular general purpose compressors on FASTQ data.

Figure 3: Decompression speeds of popular general purpose compressors on FASTQ data. and compression ratios (as compressed size / original size) on the 6.8 GB FASTQ file ERR317482 1.fastq data. The used compression level of the algorithms is given in parentheses in the charts, a lower value usually means to favor speed at the cost of compression while a higher value favors higher compression ratios at the cost of speed. The charts show that zstd with compression level 1 does achieve very high compression and decompression speeds while still providing excellent compression ratios comparable to those of gzip with higher compression levels. So we decided to use zstd inside of a better modeling scheme for FASTQ data. We present these improvements in Section 4. Recently, the Broad Institute consortium and Intel added some improvements to zlib implementation targeting genomic compression for the latest version of GATK genome analysis toolkit [2] (widely used to analyse NGS data). We refer to this version of zlib as IPP zlib in this paper (Intel Performance Primitives zlib). Fast and Ecient Compression of Next Generation Sequencing Data 3

Figure 2: Compression speeds of popular general purpose compressors on FASTQ data. Decompression Speeds (MB/s) on ERR317482_1

Figure 3: Decompression speeds of popular general purpose compressors on FASTQ data. and compression ratios (as compressed size / original size) on the 6.8 GB FASTQ file ERR317482 1.fastq data. The used compression level of the algorithms is given in parentheses in the charts, a lower value usually means to favor speed at the cost of compression while a higher value favors higher compression ratios at the cost of speed. The charts show that zstd with compression level 1 does achieve very high compression and decompression speeds while still providing excellent compression ratios comparable to those of gzip with higher compression levels. So we decided to use zstd inside of a better modeling scheme for FASTQ data. We present these improvements in Section 4. Recently, the Broad Institute consortium and Intel added some improvements to zlib implementation targeting genomic compression for the latest version of GATK genome analysis toolkit [2] (widely used to analyse NGS data). We refer to this version of zlib as IPP zlib in this paper (Intel Performance Primitives zlib). Outline of the Talk:

• Introduction / Motivation • Data used in experiments • General purpose compressors comparison • Simple Improvements • Special purpose compression • Transparent compression – working on compressed data • Parallelism / Multithreading • Conclusion Fast and Ecient Compression of Next Generation Sequencing Data 6

Modeling Improvement:4 Simple improvements of FASTQ genomic data compression

Figure 6 shows a block (of say 64KB) of FASTQ data that can be compressed directly (row-wise, read • Arrange data before compression to have the components of the same type, across after read) or better, it can be compressed column-wise, where a ”column” is made up of same fields taken across reads. The fields in a FASTQ read are in fact lines terminated by a new line character so reads, grouped; so, all headers grouped together, all sequences grouped, and so on.fields are straightforward to parse.

<+ [SEQUENCE ID, description]> <+ [SEQUENCE ID, description]> direct coded block <+ [SEQUENCE ID, description]> <+ [SEQUENCE ID, description]>

Block Columnar coded block

Figure 6: Columnar compression of a block of FASTQ data file as performed in alpha q,e,h . { } • each Each columncomponent or group of components ( can be compressed with the same algorithm, as we did in our alphaqegalgorithm. sequences and using zstd to compress each column, or di↵erent compression algorithms can be used for each column, better quality values) suited or fasterusing on the specifica specific coder ( column data type. In Sectioneg. Lempel 3 we showed- thatZiv for headers and the entropy coders used and Huffman or arithmetic encoder for sequences and quality by zstd, zstd-huf and zstd-fse, perform well and are fast on the sequences and quality valuesvalues). fields. In our alphah and alphae algorithms we use zstd to compress columns 1 and 3 of the FASTQ format and zstd-huf (or zstd-fse) to compress columns 2IBM Confidential and 4 (sequences and quality values) respectively. Figures1 8, 7 and 9 show comparisons of compression ratios, compression speeds and decompression speeds using zlib, zstd and lz4 (direct coding the blocks) versus the block columnar methods: alphaq, alphae and alphah on a wide set of real FASTQ data. Note that all the coders (direct and columnar) in Figures 7, 8 and 9 perform independent compression of small blocks (partitions) of 64KB maximum. We note that alphah is the fastest FASTQ compression/decompression algorithm and the compression ratio achieved is clearly better than zlib, zstd and lz4 but still comparable to alphae which achieves slightly better compression ratios at the cost of speed.

Figure 7: Compression ratios of alpha q,e,h and current general purpose alternatives { } Compression Ratios For A Selection of FASTQ Files

50 ERR317482_1

45 ERR317482_2

SRR099453_1 40 SRR099453_2 35 ERR034549_1

30 ERR034549_2

ERR125594_1 25 ERR125594_2

20 SRR764769_1 Compression Ratio [%] SRR764769_2 15

10

5

0 gzip(-9) gzip(-1) alphae alphah Compression Method Compression Speeds For A Selection of FASTQ Files

300 ERR317482_1

ERR317482_2 250 SRR099453_1

SRR099453_2 200 ERR034549_1

ERR034549_2 150 ERR125594_1

ERR125594_2 100 SRR764769_1

Compression Speed [MB/s] SRR764769_2 50

0 alphae alphah gzip(-9) gzip(-1) Compression Method Decompression Speeds For A Selection of FASTQ Files

400 ERR317482_1

ERR317482_2 350 SRR099453_1

300 SRR099453_2

ERR034549_1 250 ERR034549_2

200 ERR125594_1 ERR125594_2 150 SRR764769_1

SRR764769_2 100 Decompression Speed [MB/s]

50

0 alphae alphah gzip(-9) gzip(-1) Compression Method Outline of the Talk:

• Introduction / Motivation • Data used in experiments • General purpose compressors comparison • Simple Improvements • Special purpose compression • Transparent compression – working on compressed data • Parallelism / Multithreading • Conclusion Fast and Ecient Compression of Next Generation Sequencing Data

Cornel Constantinescu and Gero Schmidt IBM Almaden Research Center, San Jose, California, USA

Despite the large number of research papers and compression algorithms proposed for compressing FASTQ genomic data generated by sequencing machines, by far the most commonly used compression algorithm in the industry for FASTQ data is gzip. The main drawback of the proposed alternative special-purpose compression algorithms is the slow speed of either compression or decompression or both, and also their brittleness by making various limiting assumptions about the input FASTQ format (for example, the structure of the headers or fixed lengths of the records [1]) in order to further improve their specialized compression. In this paper we propose simple, e↵ective and very fast compression/decompression methods for FASTQ genomic data using general purpose compression components in a columnar compression model, robust to eventual changes in the FASTQ format specifi- cation. The compression speed as observed for ERR317482 1 FASTQ data ( 7GB) in our tests is about 40X better than default gzip and 4.5X-7.4X faster than the latest,⇡ best proposals like DSRC v2 and slimfastq as featured in [1]. Decompression speeds are 4X- 5X faster than gzip and 5X-17X faster than the specialized FASTQ compressors, DSRC and slimfastq- a more robust re-implementation of Fqzcomp/Fastqz. Compression ratios are 5.6%-11.4% better than gzip flavors and just 6.4% - 9.7% less than the best, more Comparison complex coders aswith specialized FASTQ compressors: reported in [1].

Figure 1: Compression ratio and speeds of alpha h,e,q , DSRC, slimfastq and gzip { } These new and robust compression methods that we propose in this paper o↵er promising alternatives for the industry to better adapt to the very fast increasing amount of FASTQ genomic data prompted by the personalized genomic medicine. [1] Ibrahim Numanagic, James K Bonfield, Faraz Hach, Jan Voges, J¨orn Ostermann, Claudio Alberti, Marco Mattavelli, and S Cenk Sahinalp. Comparison of high-throughput sequencing data compression tools. Nature Methods, 13(12):1005–1008, October 2016. Outline of the Talk:

• Introduction / Motivation • Data used in experiments • General purpose compressors comparison • Simple Improvements • Special purpose compression • Transparent compression – working on compressed data (prototype) • Parallelism / Multithreading • Conclusion Compressed Data Layout in Spectrum Scale (GPFS):

partition Original groups:

Compressed groups:

Header compressed partition ZNULL block wasted space

Compression Group: • divided into fixed size “compression groups” (example 5 blocks per group). • Each group compressed into smaller set of of blocks (--> sparse). • Header in each group maps original to compressed data location. Partitions – units of compression for fast random access. • Original data blocks are partitioned into chunks, named partitions (32-64KB each). • Partitions are independently compressed to ensure fast random access. - Genomic compression is not available in IBM Spectrum Scale product

- The benchmarks were obtained on a prototype. Compression Ratios in GPFS vs. offline for ERR317482_1 & 2 (6.8 GB)

50%

45%

40%

35%

30% ERR_1_GPFS 25% ERR_2_GPFS

20% ERR_1_OFFLINE

Compression Ratio (%) ERR_2_OFFLINE 15%

10%

5%

0% alphae alphah gzip(-1) Compression Method Read Speed (MB/s) in Spectrum Scale 350

300

250

200

150 Read Speed (MB/s) 100

50

0 alphae alphah gzip(-1)

ERR317482_1 ERR317482_2 Compression Speed (MB/s) in GPFS on ERR317482_1&2 70

60

50

40

30

Compression Speed (MB/s)20

10

0 alphae alphah gzip(-1)

ERR317482_1 ERR317482_2 Advantages of transparent compression of genomic data:

• Simplify user’s interaction allowing them to concentrate on biological research rather than dealing with manual compression and decompression of genomic data.

• Eases the challenge of capacity management of genomic data files for IT infrastructure teams in the genomic sequencing domain.

• Researchers and/or genomic analysis pipelines can work on the files as if they would be uncompressed files and do not need to care if the file is actually compressed or not.

• The infrastructure team can compress the file independently and transparently in the background at any time to save storage space without interfering with the work of the user / researcher or genomic pipeline. Outline of the Talk:

• Introduction / Motivation • Data used in experiments • General purpose compressors comparison • Simple Improvements • Special purpose compression • Transparent compression – working on compressed data • Parallelism / Multithreading • Conclusion Multithreaded Compression with Alpha{e,h} vs. Pigz: • Parallel implementation of the Alpha{e,h} is done using OpenMP in C/C++. • Pigz is open source implemented by the gzip developers using Pthreads in C.

Compression Speed on ERR034549_1.fq (17GB) with # Threads pigz alphah 2500

2000

1500 MB/s 1000

500

0 1 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 # Threads

- Using one thread Alphah is about as fast as pigz using 96 threads! (Intel Xeon Platinum 8160, 2.1GHz, 48 cores / 96 CPUs) Multithreaded Compression Speed on ERR31_1.fq (7GB) of three algorithms

1800

1600

1400

1200

1000 alphae MB/s 800 alphah pigz 600

400

200

0 1 4 8 16 # Threads Multithreaded Decompression Speed of three algorithms (max. 16 threads) 1800

1600

1400

1200

1000 alphae MB/s 800 alphah pigz 600

400

200

0 1 4 8 16 # Threads

• Compression ratios are as for sequential execution:

% better FASTQ Data File Original (MB) Pigz (MB) Alphah (MB) Alphae (MB) (alphae vs pigz) ERR317482_1.fq 6,949 2,641 2,204 2,138 24% ERR034549_1.fq 16,603 5,766 5,145 4,993 15% Multithreaded FASTA Compression vs. Pigz: Compression Speed on ERR317482_1.seq (2.7GB) with #Threads 1600 kmc_ac kmc_hc pigz 1400

1200

1000

800 MB/s 600

400

200

0 1 2 4 8 12 16 # Threads

% better FASTA Data Original (MB) Pigz (MB) Kmc_ac (MB) Kmc_hc (MB) (kmc_ac vs pigz) ERR317482_1.seq 2,719 802 692 751 16% ERR317482_2.seq 2,719 807 692 751 17%

(*)kmc stands for k-mer compression. Summary:

• Better and faster compression methods for genomic data (than gzip) exist.

• It would be good that FASTQ data generated by sequencing machines to be compressed with these methods (eg. alphae/h).

• Adding transparent genomic compression to a file system can alleviate many problems for the users & IT personal, with no need for a standardization.