GPU Accelerated Adap ve Compression Framework for Genomics Data
GuiXin Guo, Shuang Qiu, ZhiQiang Ye, BingQiang Wang BGI Research, Shenzhen, China Mian Lu Ins tute of HPC, A*STAR, Singapore Simon See BGI-NVIDIA Joint Innova on Lab, Shenzhen, China
GTC2014 Contact March 24-27, 2014 [email protected] San Jose, CA [email protected]
Outline
Ø Introduction Ø Adaptive Compression Framework Ø Implementation of Compression Algorithms Ø Results Ø Conclusion Genomics Data: Exponen al Growth
Moore’s Law for Chips 2x performance per 18 months
Moore’s Law for Genomics 10x data output per 18 months
(a) Cost of sequencing per M(base) DNA sequence has dropped from nearly $6000 in 2001 to slightly more than $0.10 in 2011. (b) Total number of completed sequence genomes has grown exponen ally with decreasing sequencing costs.
Boyle, Nane e R., and Gill, Ryan T. "Tools for genome-wide strain design and construc on." Current Opinion in Biotechnology 23.5 (2012): 666-671. Can Compression Help?
• Challenges proposed for storing and processing of huge volume of genomics data
• BGI as an example – Tens of TBs data generated per day – Tens of PBs storage (several sites) – Ten fold in the (not too far) future
• Observa on – Computa on in genomics features a much lower computa on/IO ra o than classical HPC workloads – IO (or data movement) becomes more expensive than computa on Compression
• Benefits of compression – Reduce storage capacity (especially for archiving) – Reduce IO bandwidth (more balanced compu ng systems architecture) – And, of course, save $$$
• Compression is NOT for free – Squeeze more, compute more – Squeeze less, compute less
• Can GPU help? Take a Look at Genomics Data Files
• Two common characteris cs of genomics data files 1. Table contains mul ple rows and columns
@SRR003092.1.1 3046HAAXX:2:1:933:35.1 length=51 Sequence iden fier GAATAAAGAAAAAATGGAAAACGAAGATGTTGAAATTTTTAATGATTATA Sequence bases +SRR003092.1.1 3046HAAXX:2:1:933:35.1 length=51 Sequence iden fier I>I:1III9?9&I+II.6*,:'*1.?I%-&&67I0(1.",&$%2,+I4)+ Quality scores @SRR003092.2.1 3046HAAXX:2:1:942:57.1 length=51 Sequence iden fier GTATACGTATTATGAATATACTGATTATATAAGCATAAATAAATAAAATA Sequence bases +SRR003092.2.1 3046HAAXX:2:1:942:57.1 length=51 Sequence iden fier IIIIIIIIIIIIIIIDIAI8%I-7II9I3I8@(%/EIA/>;G=DI9=8#6 Quality scores
Example of a FASTQ file containing two sequences. Column major table view
sequence sequence sequence quality iden fier bases iden fier scores 2. Data in the same column are with similar characteris cs Workflow of Adap ve Compression Framework
GPU Op mized Compression Schemes Compression (Combina on of Algorithms) Algorithms
Test and apply the best scheme against each column
Mul ple rounds Column Major Compression Engine of processing Compress each column
Block #i
Transform to column major
Input Output Algorithms Op mized (Till Now)
Commonly used compression schemes Transformational Substitutional Statistical Model- compression compression based compression schemes schemes schemes
Typical basic algorithms
Markov LZ77 BWT MTF Huffman Transform GPU Accelerated
Novel compression algorithm for quality scores (FASTQ)
First-order Markov model Statistical scheme & & Sorting frequencies of character pairs Transformational sheme Four Schemes for Different Data
Generic compression methods? Not efficient Domain-specific methods? Works only on limited data formats
Raw genomics data
Data with many Data with limited Random …… Text-like data similar strings alphabet distributed data e.g. sequence ID e.g. DNA sequence e.g. quality scores
LZ77 Huffman Markov BWT transform Huffman MTF Huffman Tested for best performance Huffman
Column-major compression: Problem still remains: •Flexible for new file formats Serial algorithms too slow •Extensible for new algorithms Op miza on Techniques
Ø Data parallel Simple but efficient scheme to parallelize MTF and its Reverse Input Data / Output Data Split to different block Merge different block Data Block 1 Data Block 2 …… Data Block n Algorithm 1 Reverse algorithm 1 Transformed Transformed …… Transformed Data 1 Data 2 Data 2 … … …
Algorithm k Reverse algorithm k Compressed Compressed …… Compressed Data 1 Data 2 Data n
ØIncrease the parallelism of selected algorithms •(Slightly) Alternate implementa on of the algorithms to reduce data dependency ØOp mize the implementa on on GPU •Embrace state of the art, high performance libraries (e.g. b40c) •Be er u liza on of constant memory and shared memory Parallel Huffman Encoding and Decoding
Input Data Serial in Huffman Tree Single-side Growing Huffman Huffman Tree Memory efficient decoding a a Parallel Huffman decoding b b d c d c Fixed rela on between codeword and code length Shared Memory Auxiliary tables Constant Codeword & Code length Memory C0 C1 C2 …… Ck generate stored L0 L1 L2 …… Lk
Encode characters in parallel Posi on Array
Encoded data: r-bit string S h1 h2 h2 … … hr thread 1 Decode the encoded string thread 2 thread 3 S in parallel with d GPU ...... threads for each character thread d d = the depth of the Huffman tree Markov Transform (A,B): A stores the frequency of a pair B represents the second character
Quality score We propose Markov transform 0 1 2 3 Locally alike Lightweight 0 (0,0) (0,1) (0,2) (1,3)
High data redundancy High parallelism 1 (0,0) (0,1) (3,2) (3,3) Hard to compress Good solu on 2 (1,0) (0,1) (4,2) (2,3) String: 1 3 2 3 1 2 0 3 1 3 1 2 2 3 1 3 1 2 2 2 2 3 (0,0) (5,1) (1,2) (0,3) … …
Use the adjacent characters in the input to form 0 1 2 3 character pairs and count the frequency of each pair 0 (1,3) (0,0) (0,1) (0,2)
Parallelism on GPU by using atomicAdd 1 (3,2) (3,3) (0,0) (0,1) Use the frequency to sort each row in the table 2 (4,2) (2,3) (1,0) (0,1) 3 (5,1) (1,2) (0,0) (0,3) One row is sorted by one block of thread Lookup table only characters stored Lookup the table: use the previous character as the row 0 1 2 3 index, search the current character to get its index in the column, and take the index as its coding value search 0 3 0 1 2 1 2 3 0 1 Each character can be parallel processed 1 3 2 … Coded String: 1 1 1 1 0 0 2 0 0 1 0 0 0 1 0 1 0 0 0 0 0 2 2 3 0 1 3 1 2 0 3 bzip2: Challenge for GPU Accelera on
Compression ratio for fastq file Compression rate for fastq file 50 160
45 140 40 120 ) 35
% gzip 100 gzip ( 30 bzip2 80 bzip2 25 lzip lzip 20 60 lzo lzo 15 40 10 20
Compression ratio 5
Compression rate (MB/s) 0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 - 20 Compression level Compression level
bzip2: good & stable compression ra o, but low compression rate Data dependency leads to difficul es for GPU accelera on!
Patel, R. A., Zhang, Y., Mak, J., Davidson, A., & Owens, J. D. (2012). Parallel lossless data compression on the gpu (pp. 1-9). IEEE. bzip2: Workflow
Bzip2-like Compression Method Pipeline Example Data (represented as a string of length N) a b a b a c a b a c
Most time-consuming!!! Burrows-Wheeler Transform c c b b b a a a a a BW-transformed string of length N Decompression Move-to-Front Transform
99 0 99 0 0 99 0 0 0 0 Huffman Coding N-sized byte array of indices to MTF list Compression
Compressed Data 1 0 1 0 0 1 0 0 0 0 M-sized bit string of encoded data Increase Parallelism of BWT
String: a b a b a c a b a c Burrows-Wheeler Transforma on
Rota on Sor ng Index 0 a b a b a c a b a c a b a b a c a b a c c a b a b a c a b a a b a c a b a b a c a c a b a b a c a b a b a c a b a c a b b a c a b a b a c a a c a b a b a c a b a b a c a b a b a c a c a b a c a b a b BWT string c a b a c a b a b a b a b a c a b a c a a c a b a c a b a b b a c a b a b a c a b a c a b a c a b a b a c a b a c a b a a b a c a b a c a b c a b a b a c a b a b a b a c a b a c a c a b a c a b a b a Most compute intensive
0 1 2 3 4 5 6 7 8 9 Use a high performance S a b a b a c a b a c sor ng library b40c to sort the rank array R1 0 1 0 1 0 2 0 1 0 2 0 1 0 1 0 2 0 1 0 2 Prefix doubling* R2 0 2 0 2 1 3 0 2 1 3 0 2 0 2 1 3 0 2 1 3 Transform the rank R4 array to suffix array in 0 3 1 4 2 5 1 4 2 5 0 3 1 4 2 5 1 4 2 5 Radix sor ng parallel R8 0 5 2 7 4 9 1 6 3 8 0 5 2 7 4 9 1 6 3 8 Get the result of BWT in parallel SA 0 6 2 8 4 1 7 3 9 5 * Sun, Weidong, and Zongmin Ma. "Parallel lexicographic names construc on with CUDA." Parallel and Distributed Systems (ICPADS), 2009 15th Interna onal Conference on. IEEE, 2009. Improve Parallelism of BWT Reverse BWT Reverse Backward reverse: Serial in nature! a b a b a c a b a c one character by another from a start position
Index 0 Index 5 SA (suffix array) 0 6 2 8 4 1 7 3 9 5
BWT string c c b b b a a a a a Solu on More indices are stored in the BWT process to parallelize the BWT reverse Sort the BWT string Sorting plays its role! c c b b b a a a a a a a a a a b b b c c (a) (d) 0 1 2 3 4 5 6 7 8 9 5 6 7 8 9 2 3 4 0 1 aba Different threads start Sor ng cab at different posi ons a a a a a b b b c c a a a a a b b b c c (b) (e) for the BWT reverse 5 6 7 8 9 2 3 4 0 1 5 6 7 8 9 2 3 4 0 1 caba simultaneously a c abab a a a a a b b b c c a a a a a b b b c c (c) (f) 5 6 7 8 9 2 3 4 0 1 5 6 7 8 9 2 3 4 0 1 ca ababa cabac ab a a a a a b b b c c (d) ababacabac 5 6 7 8 9 2 3 4 0 1 Radix Sort on GPU
k-bit keys: d-bits … … d-bits Radix sort key 1 key 2 key 3 key 4 … … key m … … key n
count 1 count 2 … … count 2d Round of sor ng: r = k/d round 1 Bucket 1 Bucket 2 Bucket 2d Memory read and write: key 2 key 7 key 1 key 3 key 9 key 5 … … key m key 8
…… (2n+n)*r
key 7 key 1 key 9 key 5 key 3 key n … … key 2 key m r count 1 count 2 … … count 2d
round Bucket 1 Bucket 2 Bucket 2d key 5 key 7 key 1 key n key 8 key 4 … … key m key 3 Radix sort implemented by b40c is memory bandwidth bounded. Sor ng is s ll the bo leneck! Sample Sort on GPU
Sample Sort on GPU* Part 0 Part 1 Part 2 n keys: 15 12 5 9 2 17 13 8 11 3 6 18 7 1 14 10 16 4
Select the samples 15 9 13 3 7 10
Sort the samples 3 7 9 10 13 15 pivot
Select pivot & Sca er Bucket 2 keys to the buckets Bucket 0 Bucket 1 5 2 3 6 1 4 12 9 8 11 7 10 15 17 13 18 14 16
Locally sort each bucket Bucket 0 Bucket 1 Bucket 2 Efficient u liza on of 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 the shared memory Memory read and write: 2n + 2n
* Leischner, N., Osipov, V., & Sanders, P. (2010, April). GPU sample sort. InParallel & Distributed Processing (IPDPS), 2010 IEEE Interna onal Symposium on (pp. 1-10). IEEE. Performance of GPU-Accelerated Algorithms
Compression method Transform Reverse transform BWT 4x 31x MTF 8x 8x Markov 18x N/A Huffman 2x 5x
The improvement of the The intrinsic parallelism of the new parallelism in BWT reverse designed Markov transform
CPU: Intel Xeon E5630 @2.53GHz GPU: Tesla M2050 / 3GB CUDA: 4.0 Compression Performance of FASTQ File
Compression rate Decompression rate (MB/s) (MB/s) Compression Compression method ratio (%) M2050 K20c M2050 K20c
bzip2 (CPU) 8.24 26.64 24.31
gzip (CPU) 8.45 114.23 29.46
BWT+MTF+Huffman 16.04 21.95 73.90 83.66 31.06
Markov+Huffman 115.60 204.86 77.23 90.87 36.68
Huffman 179.83 215.85 128.39 142.16 60.50
This work 77.80 97.26 124.37 127.78 24.77
Similar compression ra o 11.8x Speed up for compression 4.8x Speed up for decompression
(Improvement possible with more work) Compression Performance of SAM File
Compression rate Decompression rate (MB/s) (MB/s) Compression Compression method ratio (%) M2050 K20c M2050 K20c
bzip2 (CPU) 6.27 24.63 26.71
gzip (CPU) 7.73 106.41 32.26
BWT+MTF+Huffman 15.99 21.78 74.14 80.04 32.55
Markov+Huffman 116.71 206.18 87.09 90.14 39.66
Huffman 177.11 222.22 127.48 144.49 57.69
This work 87.45 98.14 139.93 149.68 26.46
Similar compression ra o 15.6x Speed up for compression 6.1x Speed up for decompression
(Improvement possible with more work) Markov Transform for Quality Scores
Compression rate Decompression rate (MB/s) (MB/s) Compression Compression method ratio (%) M2050 K20c M2050 K20c
bzip2 (CPU) 8.93 22.61 38.47 gzip (CPU) 9.83 93.59 42.35 BWT+MTF+Huffman 9.14 12.38 57.01 75.18 43.86 Huffman 185.32 231.16 121.66 129.28 49.42 Markov+Huffman 97.75 176.65 83.67 88.10 42.01 Comparison to Domain-Specific Methods
Compression Compression rate Decompression Compression method (MB/s) rate (MB/s) ratio (%)
gzip 12.2 45.4 35.35
bzip2 7.0 13.0 29.05
SCALCE 7.8 13.1 25.72
DSRC 13.5 32.2 24.77
quip 8.3 10.9 22.19
fasqz 4.6 3.8 21.95
fqzcomp 8.2 8.3 21.72
Seqsqueeze1 0.6 0.6 21.87
Column major block 111.0 104.4 29.46 compression
23 Conclusion
•We presented adap ve compression framework for genomics data accelerated by GPU, which works very well •Column major compression •Novel algorithm for data like quality score •Generic and extensible •Compression on GPU is not easy, sor ng is s ll a bo leneck
24 Contact: [email protected]