LW-Fqzip 2: a Parallelized Reference-Based Compression of FASTQ Files Zhi-An Huang1†, Zhenkun Wen1†, Qingjin Deng1, Ying Chu1, Yiwen Sun2 and Zexuan Zhu1*

Huang et al. BMC Bioinformatics (2017) 18:179 DOI 10.1186/s12859-017-1588-x

SOFTWARE Open Access LW-FQZip 2: a parallelized reference-based compression of FASTQ files Zhi-An Huang1†, Zhenkun Wen1†, Qingjin Deng1, Ying Chu1, Yiwen Sun2 and Zexuan Zhu1*

Abstract Background: The rapid progress of high-throughput DNA sequencing techniques has dramatically reduced the costs of whole genome sequencing, which leads to revolutionary advances in gene industry. The explosively increasing volume of raw data outpaces the decreasing disk cost and the storage of huge sequencing data has become a bottleneck of downstream analyses. Data compression is considered as a solution to reduce the dependency on storage. Efficient sequencing data compression methods are highly demanded. Results: In this article, we present a lossless reference-based compression method namely LW-FQZip 2 targeted at FASTQ files. LW-FQZip 2 is improved from LW-FQZip 1 by introducing more efficient coding scheme and parallelism. Particularly, LW-FQZip 2 is equipped with a light-weight mapping model, bitwise prediction by partial matching model, arithmetic coding, and multi-threading parallelism. LW-FQZip 2 is evaluated on both short-read and long-read data generated from various sequencing platforms. The experimental results show that LW-FQZip 2 is able to obtain promising compression ratios at reasonable time and memory space costs. Conclusions: The competence enables LW-FQZip 2 to serve as a candidate tool for archival or space-sensitive applications of high-throughput DNA sequencing data. LW-FQZip 2 is freely available at http://csse.szu.edu.cn/staff/ zhuzx/LWFQZip2 and https://github.com/Zhuzxlab/LW-FQZip2. Keywords: High-throughput sequencing, Sequencing data compression, Reference- based compression, Sequence alignment

Background compression methods oriented to high-throughput DNA The rapid progress of high-throughput DNA sequencing sequencing data are highly demanded. techniques has dramatically reduced the costs of whole Many specialized compression methods have been pro- genome sequencing, which leads to revolutionary ad- posed for sequencing data in raw FASTQ format [7–11], vances in gene industry [1, 2]. Genome studies have pro- sequencing reads (DNA nucleotides only) [12–15] and/or duced tremendous volume of data that poses great aligned SAM/BAM format [16]. Depending on whether challenges to storage and transfer. Data compression extra reference genomes are required, these compression becomes necessary as a silver-bullet solution to ease the methods normally can be classified into reference-based dilemma [3–6]. Nevertheless, the popular generic com- and reference-free methods. pression tools, such as gzip (http://www.gzip.org/) and Reference-based methods align the targeted sequences bzip2 (http://www.bzip.org), cannot obtain satisfactory to some external reference sequence(s) for identifying performance on high-throughput DNA sequencing data, the similarity between them. The variances of the align- because they do not utilize the biological characteristics of ment are encoded instead of the original targeted se- the data like repeat fragments and palindromes. Efficient quences. Reference-based methods generally obtain better compression ratios with more time consumption by involv- ing a sequence alignment pre-processing. Representative * Correspondence: [email protected] state-of-the-art reference-based methods include CRAM †Equal contributors [17], Quip [11], and DeeZ [18]. CRAM [17], working with 1 College of Computer Science and Software Engineering, Shenzhen BAM-based input, records the variances of reads and a University, Shenzhen 518060, China ‘ ’ Full list of author information is available at the end of the article reference genome with Huffman coding. Quip [11] with -r

© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Huang et al. BMC Bioinformatics (2017) 18:179 Page 2 of 8

option compresses SAM/BAM data in a standard LW-FQZip 2 is an improved version of LW-FQZip 1 reference-based scheme and employs highly optimized by introducing parallelism and more efficient coding statistical models for various SAM fields, thus leading to schemes. Especially, LW-FQZip 2 is equipped with light- compromising compression rate. It is also applicable to weight mapping model, bitwise prediction by partial reference-based FASTQ compression in ‘-a’ mode where a matching (PPM), arithmetic coding, and multi-threading de novo assembly procedure is introduced to construct ref- parallelism. It can support various FASTQ files gener- erences in the target data rather than using existing refer- ated from the most well-known high-throughput se- ences. DeeZ [18] lowers the cost of representing common quencing platforms and obtain superior compression differences among the reads’ mapping results by implicitly ratios at reasonable time and memory space costs. assembling the underlying donor genome in order to en- code these variants only once. Implementation Reference-free methods compress the raw sequencing The general framework of LW-FQZip 2 is shown in data, mainly in FASTQ format, directly based on their Fig. 1. Firstly, LW-FQZip 2 splits an input FASTQ file natural statistics. For example, FQZcomp [8] uses an into three data streams (i.e., metadata, nucleotide se- order-k context model to predict the nucleotide sequences, and quality scores) and then the nucleotide quences in FASTQ format followed by arithmetic coding sequences (also known as reads) are divided into equal- based compression. DSRC [10] partitions input FASTQ sized sub-blocks which are simultaneously fed to the data into blocks enabling independent compression of light-weight mapping model implemented with multi- them using LZ77 and Huffman encoding schemes. threading. After the sequence mapping, the matching DSRC 2 [19] is an improvement of DSRC by introducing results (i.e., the mapped position, palindrome flag, match threaded parallelism and more efficient coding scheme. length, and match type) and mismatch values are re- LFQC [20] preforms data transformation to the four corded in different intermediate files. Secondly, the fields of the sequencing records in an FASTQ file separ- metadata and quality scores are simultaneously pro- ately, followed by regular data compressor namely zpaq ceeded by abridging the consecutive repeats with incre- and lpaq8. LEON [21] first builds a reference as a prob- mental encoding and run-length-limited encoding, abilistic de Bruijn graph based on bloom filter, and then respectively. Finally, the intermediate files generated records the reads and quality scores as mapped paths in from the metadata, nucleotide sequences, and quality the graph using arithmetic encoding. SCALCE [22] reor- scores streams are compacted with a combination of ganizes reads in an FASTQ file that share common ‘core’ bitwise order-32 PPM model and arithmetic coder in substrings into groups, and then compacts the groups parallel, except that the intermediate file storing mis- using gzip or LZ77-like tools. SeqDB [23] coordinates match values is compressed by a distinct bitwise order- sequences bases and their corresponding quality scores 28 arithmetic coder. In best compression ratio mode, into 2D byte arrays and compresses them with an existing i.e., LW-FQZip 2 with a ‘–g’ option selected, the quality multithreaded compressor Blosc. Quip [11], in addition to scores and mismatch values are compacted using the the reference-based compression mode, also provides zpaq tool (http://mattmahoney.net/dc/zpaq.html) and reference-free compression using arithmetic coding based the other intermediate files encoded with lpaq9m (http:// on high order Markov chains. Instead of exploiting the mattmahoney.net/dc/text.html#1440). LW-FQZip 2 is redundancy of homologous sequences, reference-free available at http://csse.szu.edu.cn/staff/zhuzx/LWFQZip2 methods put more effort into predictive model and coding and https://github.com/Zhuzxlab/LW-FQZip2. The scheme, which tends to improve the time efficiency by pseudo-code of LW-FQZip 2 is provided in Algorithm 1, sacrificing compression ratio to some degree [24]. Additional file 1. The key components of LW-FQZip 2 are This work focus on long-term archiving and space- described as follows. sensitive scenarios, where superior compression ratio is pursued and reference-based methods are more favourable. Compression of metadata and quality scores In our previous work [25], we have proposed a self- In LW-FQZip 2, the metadata are pre-processed with in- contained reference-based method, namely LW-FQZip 1, cremental encoding, with which the variances of one to compress high-throughput DNA sequencing data in raw metadata to its previous neighbour is stored rather than FASTQ format. LW-FQZip 1 introduces a light-weight the original data. The quality scores are pre-processed mapping model to efficiently align short reads against the with run-length-limited encoding. More details of the in- reference sequence based on a k-mer indexing strategy. cremental encoding and run-length-limited encoding are The light-weight mapping model distinguishes LW-FQZip available in [25]. 1 from other reference-based methods for not relying on After pre-processing, the processed data can be com- any external alignment software. Nevertheless, LW-FQZip pressed with lpaq9m (if ‘-g’ option is selected) or a com- 1 is far from satisfactory in terms of compression efficiency. bination of PPM model and arithmetic coder. In the Huang et al. BMC Bioinformatics (2017) 18:179 Page 3 of 8

Fig. 1 The general framework of LW-FQZip 2. Firstly, the input FASTQ file is split into three data streams of metadata, bases, and quality scores. Secondly, the quality scores and metadata are compacted with run-length-limited encoding and incremental encoding, respectively. The nucleotide bases are partitioned and mapped to an external reference sequence based on the light-weight mapping model. Finally, the processed intermediate files from the three streams are compressed with arithmetic coder and/or other specific coding schemes latter case, a binary tree is established to store the pre- positions of k-mer substrings in the reference with some dictive context information. The pre-processed meta- predefined prefix, e.g., ‘CG’. On mapping a read of nu- data and quality scores are transformed into binary cleotide bases X to the reference, the model identifies all streams to train an order-32 PPM model, i.e., prediction k-mer substring included in IR and selects the valuable based on a 4-byte context (higher order can improve k-mer substrings served as seeds, where some restric- the compression ratios by promoting the predictive tions are predefined to eliminate the low-quality k-mer accuracy but it consumes more memory space and run- substrings (e.g., minimum seed length L, mismatch tol- ning time. A trade-off order 32 is adopted in this erance rate e). Based on the selected seeds, multiple study). The binary quality scores are matched against local alignments are performed to identify the maximum the predicted results per bit by producing ‘0’ or ‘1’,and matches. The mapping results of X including the then the context prediction model is updated accord- mapped position, palindrome flag, match length, match ingly. Finally, the prediction results are recorded using type and mismatch values are recorded in some inter- arithmetic coder or zpaq (if ‘-g’ option is selected). mediate files. If no k-mer substring in X is identified in The pseudo-code of the compression with PPM pre- IR, the palindrome of X undergoes the same mapping diction model and arithmetic coding/zpaq is provided procedure. In the case neither X nor its palindrome is in Algorithm 2, Additional file 1. mapped to the reference, the plain X is output for encoding directly. Reference-based compression of nucleotide sequences To improve the matching rate, the unmapped parts using light-weight mapping model of reads are further partitioned into shorter segments The target nucleotide sequences are mapped to an exter- and realigned against the reference where palindrome nal reference sequence based on the light-weight map- match is considered. The mismatched nucleotide bases ping model [25] and the mapping results are recorded are composed exclusively of four characters (i.e., {‘A’, instead of the original sequences. ‘C’, ‘G’, ‘T’}),whicharemucheasiertoencodethan To make this article self-contained, the light-weight quality scores. Therefore, a simpler yet efficient model mapping model is briefly introduced in this subsection. based on the bitwise order-28 arithmetic coder (http:// The mapping model is designed to implement fast align- cs.fit.edu/~mmahoney/compression/text.html#2212) or ment by indexing the k-mer substrings within the refer- zpaq is adopted. If no proper reference is available, ence. A hash table IR is firstly established to save all LW-FQZip 2 also provides an option ‘-a’ to generate a Huang et al. BMC Bioinformatics (2017) 18:179 Page 4 of 8

reference by assembling a portion of reads that contain LW-FQZip 2 is compared with the other state-of-the- the predefined prefix. art lossless FASTQ compressors namely Quip [11], DSRC 2 [19], FQZcomp [8], CRAM [17], LFQC [20], LEON [21] and SCALCE [22]. The general-purpose Blocking and multithreading compression tools, i.e., gzip and bzip2, as well as the ori- Parallelism is introduced to LW-FQZip 2 to improve the ginal LW-FQZip 1 are also included in the comparison computational efficiency using the Pthreads library. In as baselines. All compared methods are configured to the mapping procedure, the input FASTQ file is parti- obtain the best compression ratio (the parameter set- tioned into b (empirically set to 10 in this study) equal- tings of all methods and the software version informa- sized blocks. Accordingly, b threads are simultaneously tion are provided in the Additional file 1). LW-FQZip 2 executed with each running a light-weight mapping is evaluated on two modes, i.e., the normal mode (‘LW- model for a corresponding block. Afterward, a new sin- FQZip 2’, using arithmetic coders to compress the inter- gle thread is created to collect the mapping results of mediate files) and the best compression ratio mode the previous b threads and dispatch the results into dif- (‘LW-FQZip 2 (−g)’, using lpaq9m and zpaq to compress ferent intermediate files. After the three data streams, the intermediate files). Quip is also executed in two i.e., metadata, quality scores, and nucleotide sequences modes, i.e., the reference-based compression (‘quip -r’) are properly processed, multiple threads are created to and the assembly-based compression (‘quip -a’, a refer- compress the six intermediate files with the correspondence is assembled with a portion of reads, and then a ing encoding schemes as shown in Fig. 1. reference-based compression is conducted using the In summary, LW-FQZip 2 improves LW-FQZip 1 by generated reference). Quip, CRAM, and LW-FQZip 1 introducing multi-threading for the time-consuming read fall within reference-based scheme. The other methods mapping and using more efficient encoding schemes are reference-free methods. based on PPM model, arithmetic coders, lpaq9m and/or The performance of the methods is evaluated in zpaq. The implementation details of LW-FQZip 2 are terms of compression ratio, speed, and memory con- provided in the Additional file 1. sumption. The compression ratios of all compared methods on the ten FASTQ data sets are tabulated in Results and discussion Table 2. The average compression and decompression LW-FQZip 2 is verified using ten representative real- speeds of all methods are plotted in Fig. 2. The mem- world FASTQ files (five short-read data and five long-read ory sizes consumed by the compared methods on data) on a platform running 64-bit Red Hat 4.4.7-16 with each data set are reported in Table 3. More details of four 8-core Intel(R) Xeon(R) E7-8837 CPUs (@2.67GHz the data sets and experimental studies are provided in with Hyper-Threading Technology). These data sets, Additional file 1: Tables S1-S13. The average CPU generated from various well-known high-throughput utilization and version information of all compared DNA sequencing platforms, were downloaded from the methods are provided in Additional file 1: Tables S14 Sequence Read Archive of the National Centre for Bio- and S17, respectively. technology Information (NCBI) [26]. Details of these data From Table 2, it is shown that LW-FQZip 2 and LW- sets are provided in Table 1. FQZip 2 (−g) successfully work on all test data sets.

Table 1 The ten real-world FASTQ data sets used for performance evaluation Datasets Platforms Species Read length (bp) Size (MB) GC content Long-read SRR2916693 454GS Pseudomonas moraviensis 67-1201 425 58.8% SRR2994368 Illumina Miseq Escherichia coli 70-502 4688 49.7% SRR3211986 Pacbio RS Homo sapiens 2-62746 1759 39.6% ERR739513 MinION Phage 5-246140 871 47.9% SRR3190692 Illumina MiSeq Escherichia coli 70-602 11379 52.3%

Short-read ERR385912 Illumina Hiseq 2000 Escherichia coli 51 641 43.5% ERR386131 Ion Torrent PGM Capsicum baccatum 151 1371 50.5% SRR034509 Illumina Analyzer II Escherichia coli 101 5247 52.6% ERR174310 Illumina Hiseq 2000 Homo sapiens 202 105122 N.A. ERR194147 Illumina Hiseq 2000 Homo sapiens 101 202631 40.3% Note: The long-read data sets have variable-length reads, while the short-read data sets have fixed-length reads Huang et al. BMC Bioinformatics (2017) 18:179 Page 5 of 8

Table 2 The compression ratios of the compared methods on ten test data sets LW-FQZip LW-FQZip 2 LW-FQZip Quip Quip DSRC CRAM FQZcomp LFQC LEON SCALCE bzip gzip 2 (−g) 1 (−a) (−r) 2 2 Long- SRR2916693 16.7% 15.3% 18.1% 20.9% 20.5% 20.2% 21.9% 21.6% 12.7% 19.5% 17.2%a 24.2% 29.6% read SRR2994368 17.3% 16.0% 17.9% 20.1% N/A 23.2% 26.4% N/A N/A 23.1% 17.3%a 28.5% 34.2% SRR3211986 33.3% 32.3% N/A 33.3% N/A N/A 33.9% N/A 32.3% N/A 33.4%a 36.4% 42.6% ERR739513 35.2% 34.8% N/A N/A N/A N/A 35.6% N/A 34.9% N/A N/A 39.7% 45.4% SRR3190692 12.7% 11.7% 13.2% 16.5% N/A 20.3% 22.3% N/A N/A 18.1% 12.7%a 24.4% 29.5%

Short- ERR385912 6.4% 5.0% 6.6% 7.2% N/A 7.8% N/A N/A 5.8% 7.0% 6.6%a 13.9% 17.9% read ERR386131 16.5% 16.0% 18.7% 17.7% 16.6% 16.8% 25.5% 24.6% 15.5% N/A 16.6%a 21.5% 26.0% SRR034509 23.7% 22.7% 25.0% 25.1% 24.9% 26.1% 27.4% 26.1% 23.7% 27.9% 24.5%a 31.5% 36.9% ERR174310 21.0% 20.1% N/A 20.0% N/A 20.2% N/A N/A N/A 25.3% 19.6%a 26.2% 31.7% ERR194147 20.1% 14.3% N/A 20.0% N/A 20.3% N/A N/A N/A 20.3% 15.4%a 19.7% 23.6% Compressed Ratio: the compressed file size divided by the original file size; ‘N/A’: the program cannot work on the data, some error occur in program, such as loses fidelity after decompression or decompression failed; ‘a’: the read order is changed after decompression; The best results are highlighted in bold

LW-FQZip 2 (−g) tends to obtain superior compression between LW-FQZip 2, LW-FQZip (−g) and LFQC in ratios to the other methods especially on long-read data. terms of compression ratio, memory usage, and time DSRC 2, Quip, FQZcomp, LEON, SCALCE and LW- consumption in a radar chart in Fig. 3. The results show FQZip 1 suffer from some issues like incompatibility and that LW-FQZip (−g) and LFQC attain slightly better fidelity-loss on the long-read data generated from Pacbio compression ratios than LW-FQZip 2 at the cost of RS and MinION platforms. LW-FQZip 1, CRAM, memory usage and time consumption, respectively. FQZcomp and LFQC also fail on some short-read data Nevertheless, LW-FQZip 2 achieves better compromise sets of large size. LFQC obtains comparable compression over all metrics than the other two methods. ratios to LW-FQZip 2 and LW-FQZip 2 (−g) in the data In terms of compression and decompression speeds, sets it works out. We made an extra comparison analysis LW-FQZip 2 outperforms other reference-based methods

Fig. 2 The average compression and decompression speeds of the compared methods on ten test data sets. The compression speed is calculated as the original file size divided by the compression time. The decompression speed is calculated as the original file size divided by the decompression time Huang et al. BMC Bioinformatics (2017) 18:179 Page 6 of 8

Table 3 The memory usage (MB) of the compared methods on ten test data sets LW- LW- LW- Quip Quip DSRC 2 CRAM FQZcomp LFQC SCALCE LEON bzip 2 gzip FQZip 2 FQZip 2 (−g) FQZip 1 (−a) (−r) SRR2916693 compression 1605 4420 459 759 391 582 784 312 3965 1380 1722 7.6 1.6 decompression 1598 4127 37 756 389 601 620 314 3932 1050 696 4.8 1.5 SRR2994368 compression 1582 14158 1048 2801 389 6175 1355 N/A N/A 4073 5435 7.6 1.6 decompression 1579 13154 69 2198 387 7234 652 N/A N/A 1050 2623 4.8 1.5 SRR3211986 compression 1190 12935 N/A 1098 N/A N/A 5777 N/A 4768 2158 N/A 7.6 1.6 decompression 1528 5657 N/A 1109 N/A N/A 2381 N/A 4320 1035 N/A 4.8 1.5 ERR739513 compression 1283 11511 N/A N/A N/A N/A 3694 N/A 5108 N/A N/A 7.6 1.6 decompression 1403 11079 N/A N/A N/A N/A 1455 N/A 4748 N/A N/A 4.8 1.5 SRR3190692 compression 1726 14560 1058 3552 391 14157 1363 N/A N/A 5219 6776 7.6 1.6 decompression 1725 13329 69 2898 386 14794 661 N/A N/A 1055 3217 4.8 1.5 ERR385912 compression 1603 2793 410 772 389 911 N/A N/A 3140 1422 1717 7.6 1.6 decompression 1603 2655 69 770 392 908 N/A N/A 3060 1040 504 4.8 1.5 ERR386131 compression 1691 12443 1033 771 389 1844 1318 322 5175 1961 N/A 7.6 1.6 decompression 1721 12165 39 768 384 1950 651 319 4835 1049 N/A 4.8 1.5 SRR034509 compression 1748 14670 1073 1531 383 6683 1351 324 5352 4151 4139 7.6 1.6 decompression 1752 7042 71 1218 391 7736 653 309 4859 1050 1799 4.8 1.5 ERR174310 compression 1886 16270 N/A 5333 N/A 5558 N/A N/A N/A 5333 7797 7.6 1.6 decompression 1865 11239 N/A 1156 N/A 18487 N/A N/A N/A 1045 5106 4.8 1.5 ERR194147 compression 1953 17771 N/A 782 N/A 20271 N/A N/A N/A 5380 7192 7.6 1.6 decompression 1963 13908 N/A 780 N/A 24284 N/A N/A N/A 1057 5322 4.8 1.5 Note: The best results are highlighted in bold as shown in Fig. 2. It is worth highlighting that LW- reference-based methods. Among the reference-free FQZip 2 outperforms LW-FQZip 1 in terms of compati- methods, DSRC 2 compresses the fastest with compromis- bility, compression ratio, and speed, which suggests a ing compression ratio by taking full advantage of multi- substantial improvement of LW-FQZip 2 to LW-FQZip 1. threading. Quip and LEON manage to obtain some As expected, reference-free methods tend to be faster than trade-offs between the three evaluation criteria.

Fig. 3 Comparison between LW-FQZip 2, LW-FQZip 2 (−g) and LFQC in a radar chart in terms of average compression ratio, compression time, decompression time, compression memory usage, and decompression usage. In each criterion, the results of the three methods are normalized to the range of [0, 1] and a smaller value, i.e., closer to the centroid, indicates a better performance Huang et al. BMC Bioinformatics (2017) 18:179 Page 7 of 8

SCALCE is a very efficient tool by introducing a Conclusions boosting scheme based on locally consistent parsing This article presents a specialized compression tool LW- (LCP) technique to sort the reads, which enables FQZip 2 for FASTQ files. LW-FQZip 2 shows superior SCALCE to compress similar reads together and at- compression ratios and compatibility with reasonable tain competitive compression ratios at high speed. (de) compression speed and memory space consumption. We also try to adopt the LCP technique in our method It could serve as a candidate tool for archival or storage as the framework shown in Additional file 1: Figure S2. space-sensitive applications of sequencing data. In this attempt, the successfully mapped reads are still The emerging long-read technologies, e.g., Single Mol- compressed with the original LW-FQZip 2, whereas the ecule Real Time (SMRT) sequencing [27] and Nanopore unmapped reads undergo the LCP boosting and gzip sequencing [28], produce much longer DNA sequences, compression. LCP is applied to only a small portion of reportedly providing a more complete picture of genome the reads, yet the results shown in Additional file 1: structure. They are deemed to be a complementary solu- Table S21, suggest that LCP can improve the compres- tion to overcome the shortages of short-read sequencing. sion ratio. Indeed, re-sorting the reads according to their The exponentially increasing long-read sequencing data similarity is really an efficient option to improve the poses new great challenges to the existing specialized compression ratio, especially for archiving-oriented ap- compression methods. LW-FQZip 2 shows good compati- plications. However, since the order of the reads is chan- bility to long-read sequencing data. The current work is ged, it inevitably imposes extra cost if random access of hoped to provide insights into the storage problems of the archive is the concern in the downstream analysis. new sequencing data. More efficient alignment models for LW-FQZip 2 is designed to preserve the original read long-read data will be developed in the future work. order to facilitate the implementation of random access in the future extension of this tool. The compared methods are also tested on the bench- Additional file mark data sets suggested by the MPEG working group on genomic compression (https://github.com/sfu-comp Additional file 1: This document provides the implementation details of LW-FQZip 2 and the detailed experimental results of each compared bio/compression-benchmark/blob/master/samples.md). algorithm in the related comparison study. Algorithm 1. The main The results are presented in Additional file 1: Tables procedure of LW-FQZip 2. Algorithm 2. Compression with PPM prediction S18-S20, where the proposed method shows consistent model and arithmetic coding. Table S1. The performance of LW-FQZip 2. Table S2. The performance of LW-FQZip 2 (−g). Table S3. The performance efficiency. of LW-FQZip 1. Table S4. The performance of Quip (−a). Table S5. The The experimental results suggest that all specialized performance of Quip (−r). Table S6. The performance of DSRC 2. Table S7. methods outperform the general-purpose tools in terms The performance of CRAM. Table S8. The performance of FQZcomp. Table S9. The performance of LFQC. Table S10. The performance of LEON. Table of compression ratio but use more memory space. The S11. The performance of SCALCE. Table S12. The performance of bzip 2. reference-based methods tend to be slower than the Table S13. The performance of gzip. Table S14. The average number of reference-free methods, due to the extra running time CPU cores used by the compared methods. Table S15. The compression ratios and time consumptions of LW-FQZip 2 w/o complementary palindrome involved in the sequence alignment preprocessing. Dif- mapping. Table S16. ThememoryusageoftheLW-FQZip2w/o ferent methods are designed with different strengths and complementary palindrome mapping. Table S17. The version information of can be used for different purposes. There is no single all compared methods. Table S18. The compression ratios of the compared methods on benchmark data provided by MPEG working group on genomic method dominates other methods in all criteria. compression. Table S19. The performance of LW-FQZip 2 on benchmark data The effect of palindrome handling is investigated in provided by MPEG working group on genomic compression. Table S20. The the Additional file 1: Tables S15 and S16. With palin- performance of LW-FQZip 2 (−g) on benchmark data provided by MPEG working group on genomic compression. Table S21. The compression ratios drome handling more reads can be mapped to the refer- of LW-FQZip2 + LCP, LW-FQZip2 -g + LCP, SCALCE, and LW-FQZip 2 on seven ence at a higher speed (the mapping is stopped once the representative data sets. Table S22. The comparison of compression speed of first match is identified), thus the compression ratio and LW-FQZip 2 using SSD and HDD disk systems. Figure S1. The compression speeds of LW-FQZip 2 using different number of threads on five representative speed are improved slightly, while the extra memory data sets. Figure S2. The framework of LW-FQZip 2 with LCP technique. consumption is eligible. (DOCX 229 kb) The effect of thread number is studied in Additional file 1: Figure S1. The compression speed is affected not only by the number of threads but also the disk I/O speed. Abbreviations PPM: Prediction by partial matching; LCP: Locally consistent parsing; As a result, the compression might not be speeded NCBI: National Centre for Biotechnology Information; SMRT: Single up proportionally as the number of threads increases molecule real time; SSD: Solid state disk (as shown in Additional file 1: Figure S1). Using faster disk system like solid state disk (SSD) can help to Acknowledgements speed up both compression and decompression (see We would like to thank the Editor and the reviewers for their precious in the Additional file 1: Table S22). comments on this work which helped improve the quality of this paper. Huang et al. BMC Bioinformatics (2017) 18:179 Page 8 of 8

Funding 9. Tembe W, Lowey J, Suh E. G-SQZ: compact encoding of genomic sequence This work was supported in part by the National Natural Science and quality data. Bioinformatics. 2010;26(17):2192–4. Foundation of China (61471246, 61575125, and 61572328), Guangdong 10. Deorowicz S, Grabowski S. Compression of DNA sequence reads in FASTQ Foundation of Outstanding Young Teachers in Higher Education Institutions format. Bioinformatics. 2011;27(6):860–2. (Yq2013141 and Yq2015141), Guangdong Special Support Program of 11. Jones DC, Ruzzo WL, Peng X, Katze MG. Compression of next-generation Top-notch Young Professionals (2014TQ01X273 and 2015TQ01R453), Guangdong sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Province Ordinary University Characteristic Innovation Project (2015KTSCX122), Res. 2012;40(22):e171. Shenzhen Scientific Research and Development Funding Program 12. Grabowski S, Deorowicz S, Roguski L. Disk-based compression of data from (ZYC201105170243A, JCYJ20150324141711587, CXZZ20140902102350474, genome sequencing. Bioinformatics. 2015;31(9):1389–95. JCYJ20150630105452814, CXZZ20140902160818443, CXZZ20150813151056544, 13. Janin L, Schulz-Trieglaff O, Cox AJ. BEETL-fastq: a searchable compressed and JCYJ20160331114551175), and China-UK Visual Information Processing Lab archive for DNA reads. Bioinformatics. 2014;30(19):2796–801. Foundation. The funding body played no role in the design or conclusions 14. Patro R, Kingsford C. Data-dependent bucketing improves reference-free of the study. compression of sequencing reads. Bioinformatics. 2015;31(17):2770–7. 15. Rozov R, Shamir R, Halperin E. Fast lossless compression via cascading Availability of data and materials Bloom filters. BMC Bioinformatics. 2014;15 Suppl 9:S7. Project name: LW-FQZip 2 16. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Project website: http://csse.szu.edu.cn/staff/zhuzx/LWFQZip2 and https:// Durbin R. The sequence alignment/map format and SAMtools. Bioinformatics. github.com/Zhuzxlab/LW-FQZip2 2009;25(16):2078–9. Operating systems: Linux or MacOS 17. Fritz MH-Y, Leinonen R, Cochrane G, Birney E. Efficient storage of high Programming language: C/C++ throughput DNA sequencing data using reference-based compression. Other requirements: GCC compiler and the archiving tool ‘tar’ Genome Res. 2011;21(5):734–40. License: The MIT License 18. Hach F, Numanagic I, Sahinalp SC. DeeZ: reference-based compression by Any restrictions to use by non-academics: For commercial use, please contact the local assembly. Nat Methods. 2014;11(11):1082–4. authors. All datasets are downloaded from SRA of NCBI. All data supporting the 19. Roguski L, Deorowicz S. DSRC 2–Industry-oriented compression of FASTQ conclusions of this article are included within the article and its additional files. files. Bioinformatics. 2014;30(15):2213–5. 20. Nicolae M, Pathak S, Rajasekaran S. LFQC: a lossless compression algorithm Author’s contributions for FASTQ files. Bioinformatics. 2015;31(20):3276–81. ZH & ZZ conceived the algorithm, developed the program, and wrote the 21. Benoit G, Lemaitre C, Lavenier D, Drezen E, Dayris T, Uricaru R, Rizk G. manuscript. ZW and YS helped with manuscript editing, designed and Reference-free compression of high throughput sequencing data with performed experiments. QD&YC prepared the data sets, carried out analyses and a probabilistic de Bruijn graph. BMC Bioinformatics. 2015;16:288. helped with program design. All authors read and approved the final manuscript. 22. Hach F, Numanagic I, Alkan C, Sahinalp SC. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics. Competing interests 2012;28(23):3051–7. The authors declare that they have no competing interests. 23. Howison M. High-throughput compression of FASTQ data with SeqDB. IEEE/ ACM Trans Comput Biol Bioinform. 2013;10(1):213–8. Consent for publication 24. Kahn SD. On the future of genomic data. Science. 2011;331(6018):728–9. Not applicable. 25. Zhang Y, Li L, Yang Y, Yang X, He S, Zhu Z. Light-weight reference-based compression of FASTQ data. BMC Bioinformatics. 2015;16:188. Ethics approval and consent to participate 26. Leinonen R, Sugawara H, Shumway M. The sequence read archive. Nucleic Not applicable. Acids Res. 2010;39(suppl_1):D19-D21. 27. Flusberg BA, Webster DR, Lee JH, Travers KJ, Olivares EC, Clark TA, Korlach J, ’ Turner SW. Direct detection of DNA methylation during single-molecule, Publisher sNote real-time sequencing. Nat Methods. 2010;7(6):461–5. Springer Nature remains neutral with regard to jurisdictional claims in published 28. Branton D, Deamer DW, Marziali A, Bayley H, Benner SA, Butler T, Di Ventra maps and institutional affiliations. M, Garaj S, Hibbs A, Huang X. The potential and challenges of nanopore sequencing. Nat Biotechnol. 2008;26(10):1146–53. Author details 1College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China. 2School of Medicine, Shenzhen University, Shenzhen 518060, China.

Received: 7 August 2016 Accepted: 9 March 2017

References 1. van Dijk EL, Auger H, Jaszczyszyn Y, Thermes C. Ten years of next-generation sequencing technology. Trends Genet. 2014;30(9):418–26. 2. Kozanitis C, Heiberg A, Varghese G, Bafna V. Using genome query language Submit your next manuscript to BioMed Central to uncover genetic variation. Bioinformatics. 2014;30(1):1–8. 3. Zhu Z, Zhang Y, Ji Z, He S, Yang X. High-throughput DNA sequence data and we will help you at every step: compression. Brief Bioinform. 2015;16(1):1–15. 4. Numanagic I, Bonfield JK, Hach F. Comparison of high-throughput sequencing • We accept pre-submission inquiries data compression tools. Nat Methods. 2016;13(12):1005–8. • Our selector tool helps you to ﬁnd the most relevant journal 5. Hosseini M, Pratas D, Pinho AJ. A survey on data compression methods for • We provide round the clock customer support biological sequences. Information. 2016;7(4):56. 6. Zhu Z, Li L, Zhang Y, Yang Y, Yang X. CompMap: a reference-based • Convenient online submission compression program to speed up read mapping to related reference • Thorough peer review sequences. Bioinformatics. 2015;31(3):426–8. • Inclusion in PubMed and all major indexing services 7. Zhang, Y, Patel K, Endrawis T, Bowers A, Sun Y. A FASTQ compressor based on integer-mapped k-mer indexing for biologist. Gene. 2016;579(1):75-81. • Maximum visibility for your research 8. Bonfield JK, Mahoney MV. Compression of FASTQ and SAM format sequencing data. PLoS ONE. 2013;8(3):e59190. Submit your manuscript at www.biomedcentral.com/submit