GPU Accelerated Adapve Compression Framework for Genomics Data

GuiXin Guo, Shuang Qiu, ZhiQiang Ye, BingQiang Wang BGI Research, Shenzhen, China Mian Lu Instute of HPC, A*STAR, Singapore Simon See BGI-NVIDIA Joint Innovaon Lab, Shenzhen, China

GTC2014 Contact March 24-27, 2014 [email protected] San Jose, CA [email protected]

Outline

Ø Introduction Ø Adaptive Compression Framework Ø Implementation of Compression Algorithms Ø Results Ø Conclusion Genomics Data: Exponenal Growth

Moore’s Law for Chips 2x performance per 18 months

Moore’s Law for Genomics 10x data output per 18 months

(a) Cost of sequencing per M(base) DNA sequence has dropped from nearly $6000 in 2001 to slightly more than $0.10 in 2011. (b) Total number of completed sequence genomes has grown exponenally with decreasing sequencing costs.

Boyle, Nanee R., and Gill, Ryan T. "Tools for genome-wide strain design and construcon." Current Opinion in Biotechnology 23.5 (2012): 666-671. Can Compression Help?

• Challenges proposed for storing and processing of huge volume of genomics data

• BGI as an example – Tens of TBs data generated per day – Tens of PBs storage (several sites) – Ten fold in the (not too far) future

• Observaon – Computaon in genomics features a much lower computaon/IO rao than classical HPC workloads – IO (or data movement) becomes more expensive than computaon Compression

• Benefits of compression – Reduce storage capacity (especially for archiving) – Reduce IO bandwidth (more balanced compung systems architecture) – And, of course, save $$$

• Compression is NOT for free – Squeeze more, compute more – Squeeze less, compute less

• Can GPU help? Take a Look at Genomics Data Files

• Two common characteriscs of genomics data files 1. Table contains mulple rows and columns

@SRR003092.1.1 3046HAAXX:2:1:933:35.1 length=51 Sequence idenfier GAATAAAGAAAAAATGGAAAACGAAGATGTTGAAATTTTTAATGATTATA Sequence bases +SRR003092.1.1 3046HAAXX:2:1:933:35.1 length=51 Sequence idenfier I>I:1III9?9&I+II.6*,:'*1.?I%-&&67I0(1.",&$%2,+I4)+ Quality scores @SRR003092.2.1 3046HAAXX:2:1:942:57.1 length=51 Sequence idenfier GTATACGTATTATGAATATACTGATTATATAAGCATAAATAAATAAAATA Sequence bases +SRR003092.2.1 3046HAAXX:2:1:942:57.1 length=51 Sequence idenfier IIIIIIIIIIIIIIIDIAI8%I-7II9I3I8@(%/EIA/>;G=DI9=8#6 Quality scores

Example of a FASTQ file containing two sequences. Column major table view

sequence sequence sequence quality idenfier bases idenfier scores 2. Data in the same column are with similar characteriscs Workflow of Adapve Compression Framework

GPU Opmized Compression Schemes Compression (Combinaon of Algorithms) Algorithms

Test and apply the best scheme against each column

Mulple rounds Column Major Compression Engine of processing each column

Block #i

Transform to column major

Input Output Algorithms Opmized (Till Now)

Commonly used compression schemes Transformational Substitutional Statistical Model- compression compression based compression schemes schemes schemes

Typical basic algorithms

Markov LZ77 BWT MTF Huffman Transform GPU Accelerated

Novel compression algorithm for quality scores (FASTQ)

First-order Markov model Statistical scheme & & Sorting frequencies of character pairs Transformational sheme Four Schemes for Different Data

Generic compression methods? Not efficient Domain-specific methods? Works only on limited data formats

Raw genomics data

Data with many Data with limited Random …… Text-like data similar strings alphabet distributed data e.g. sequence ID e.g. DNA sequence e.g. quality scores

LZ77 Huffman Markov BWT transform Huffman MTF Huffman Tested for best performance Huffman

Column-major compression: Problem still remains: •Flexible for new file formats Serial algorithms too slow •Extensible for new algorithms Opmizaon Techniques

Ø Data parallel Simple but efficient scheme to parallelize MTF and its Reverse Input Data / Output Data Split to different block Merge different block Data Block 1 Data Block 2 …… Data Block n Algorithm 1 Reverse algorithm 1 Transformed Transformed …… Transformed Data 1 Data 2 Data 2 … … …

Algorithm k Reverse algorithm k Compressed Compressed …… Compressed Data 1 Data 2 Data n

ØIncrease the parallelism of selected algorithms •(Slightly) Alternate implementaon of the algorithms to reduce data dependency ØOpmize the implementaon on GPU •Embrace state of the art, high performance libraries (e.g. b40c) •Beer ulizaon of constant memory and shared memory Parallel Huffman Encoding and Decoding

Input Data Serial in Huffman Tree Single-side Growing Huffman Huffman Tree Memory efficient decoding a a Parallel Huffman decoding b b d d c Fixed relaon between codeword and code length Shared Memory Auxiliary tables Constant Codeword & Code length Memory C0 C1 C2 …… Ck generate stored L0 L1 L2 …… Lk

Encode characters in parallel Posion Array

Encoded data: r-bit string S h1 h2 h2 … … hr 1 Decode the encoded string thread 2 thread 3 S in parallel with d GPU ...... threads for each character thread d d = the depth of the Huffman tree Markov Transform (A,B): A stores the frequency of a pair B represents the second character

Quality score We propose Markov transform 0 1 2 3 Locally alike Lightweight 0 (0,0) (0,1) (0,2) (1,3)

High data redundancy High parallelism 1 (0,0) (0,1) (3,2) (3,3) Hard to compress Good soluon 2 (1,0) (0,1) (4,2) (2,3) String: 1 3 2 3 1 2 0 3 1 3 1 2 2 3 1 3 1 2 2 2 2 3 (0,0) (5,1) (1,2) (0,3) … …

Use the adjacent characters in the input to form 0 1 2 3 character pairs and count the frequency of each pair 0 (1,3) (0,0) (0,1) (0,2)

Parallelism on GPU by using atomicAdd 1 (3,2) (3,3) (0,0) (0,1) Use the frequency to sort each row in the table 2 (4,2) (2,3) (1,0) (0,1) 3 (5,1) (1,2) (0,0) (0,3) One row is sorted by one block of thread Lookup table only characters stored Lookup the table: use the previous character as the row 0 1 2 3 index, search the current character to get its index in the column, and take the index as its coding value search 0 3 0 1 2 1 2 3 0 1 Each character can be parallel processed 1 3 2 … Coded String: 1 1 1 1 0 0 2 0 0 1 0 0 0 1 0 1 0 0 0 0 0 2 2 3 0 1 3 1 2 0 3 : Challenge for GPU Acceleraon

Compression ratio for fastq file Compression rate for fastq file 50 160

45 140 40 120 ) 35

% 100 gzip ( 30 bzip2 80 bzip2 25 lzip 20 60 lzo lzo 15 40 10 20

Compression ratio 5

Compression rate (MB/s) 0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 - 20 Compression level Compression level

bzip2: good & stable compression rao, but low compression rate Data dependency leads to difficules for GPU acceleraon!

Patel, R. A., Zhang, Y., Mak, J., Davidson, A., & Owens, J. D. (2012). Parallel lossless on the gpu (pp. 1-9). IEEE. bzip2: Workflow

Bzip2-like Compression Method Pipeline Example Data (represented as a string of length N) a b a b a c a b a c

Most time-consuming!!! Burrows-Wheeler Transform c c b b b a a a a a BW-transformed string of length N Decompression Move-to-Front Transform

99 0 99 0 0 99 0 0 0 0 Huffman Coding N-sized byte array of indices to MTF list Compression

Compressed Data 1 0 1 0 0 1 0 0 0 0 M-sized bit string of encoded data Increase Parallelism of BWT

String: a b a b a c a b a c Burrows-Wheeler Transformaon

Rotaon Sorng Index 0 a b a b a c a b a c a b a b a c a b a c c a b a b a c a b a a b a c a b a b a c a c a b a b a c a b a b a c a b a c a b b a c a b a b a c a a c a b a b a c a b a b a c a b a b a c a c a b a c a b a b BWT string c a b a c a b a b a b a b a c a b a c a a c a b a c a b a b b a c a b a b a c a b a c a b a c a b a b a c a b a c a b a a b a c a b a c a b c a b a b a c a b a b a b a c a b a c a c a b a c a b a b a Most compute intensive

0 1 2 3 4 5 6 7 8 9 Use a high performance S a b a b a c a b a c sorng library b40c to sort the rank array R1 0 1 0 1 0 2 0 1 0 2 0 1 0 1 0 2 0 1 0 2 Prefix doubling* R2 0 2 0 2 1 3 0 2 1 3 0 2 0 2 1 3 0 2 1 3 Transform the rank R4 array to suffix array in 0 3 1 4 2 5 1 4 2 5 0 3 1 4 2 5 1 4 2 5 Radix sorng parallel R8 0 5 2 7 4 9 1 6 3 8 0 5 2 7 4 9 1 6 3 8 Get the result of BWT in parallel SA 0 6 2 8 4 1 7 3 9 5 * Sun, Weidong, and Zongmin Ma. "Parallel lexicographic names construcon with CUDA." Parallel and Distributed Systems (ICPADS), 2009 15th Internaonal Conference on. IEEE, 2009. Improve Parallelism of BWT Reverse BWT Reverse Backward reverse: Serial in nature! a b a b a c a b a c one character by another from a start position

Index 0 Index 5 SA (suffix array) 0 6 2 8 4 1 7 3 9 5

BWT string c c b b b a a a a a Soluon More indices are stored in the BWT process to parallelize the BWT reverse Sort the BWT string Sorting plays its role! c c b b b a a a a a a a a a a b b b c c (a) (d) 0 1 2 3 4 5 6 7 8 9 5 6 7 8 9 2 3 4 0 1 aba Different threads start Sorng cab at different posions a a a a a b b b c c a a a a a b b b c c (b) (e) for the BWT reverse 5 6 7 8 9 2 3 4 0 1 5 6 7 8 9 2 3 4 0 1 caba simultaneously a c abab a a a a a b b b c c a a a a a b b b c c (c) (f) 5 6 7 8 9 2 3 4 0 1 5 6 7 8 9 2 3 4 0 1 ca ababa cabac ab a a a a a b b b c c (d) ababacabac 5 6 7 8 9 2 3 4 0 1 Radix Sort on GPU

k-bit keys: d-bits … … d-bits Radix sort key 1 key 2 key 3 key 4 … … key m … … key n

count 1 count 2 … … count 2d Round of sorng: r = k/d round 1 Bucket 1 Bucket 2 Bucket 2d Memory read and write: key 2 key 7 key 1 key 3 key 9 key 5 … … key m key 8

…… (2n+n)*r

key 7 key 1 key 9 key 5 key 3 key n … … key 2 key m r count 1 count 2 … … count 2d

round Bucket 1 Bucket 2 Bucket 2d key 5 key 7 key 1 key n key 8 key 4 … … key m key 3 Radix sort implemented by b40c is memory bandwidth bounded. Sorng is sll the boleneck! Sample Sort on GPU

Sample Sort on GPU* Part 0 Part 1 Part 2 n keys: 15 12 5 9 2 17 13 8 11 3 6 18 7 1 14 10 16 4

Select the samples 15 9 13 3 7 10

Sort the samples 3 7 9 10 13 15 pivot

Select pivot & Scaer Bucket 2 keys to the buckets Bucket 0 Bucket 1 5 2 3 6 1 4 12 9 8 11 7 10 15 17 13 18 14 16

Locally sort each bucket Bucket 0 Bucket 1 Bucket 2 Efficient ulizaon of 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 the shared memory Memory read and write: 2n + 2n

* Leischner, N., Osipov, V., & Sanders, P. (2010, April). GPU sample sort. InParallel & Distributed Processing (IPDPS), 2010 IEEE Internaonal Symposium on (pp. 1-10). IEEE. Performance of GPU-Accelerated Algorithms

Compression method Transform Reverse transform BWT 4x 31x MTF 8x 8x Markov 18x N/A Huffman 2x 5x

The improvement of the The intrinsic parallelism of the new parallelism in BWT reverse designed Markov transform

CPU: Intel Xeon E5630 @2.53GHz GPU: Tesla M2050 / 3GB CUDA: 4.0 Compression Performance of FASTQ File

Compression rate Decompression rate (MB/s) (MB/s) Compression Compression method ratio (%) M2050 K20c M2050 K20c

bzip2 (CPU) 8.24 26.64 24.31

gzip (CPU) 8.45 114.23 29.46

BWT+MTF+Huffman 16.04 21.95 73.90 83.66 31.06

Markov+Huffman 115.60 204.86 77.23 90.87 36.68

Huffman 179.83 215.85 128.39 142.16 60.50

This work 77.80 97.26 124.37 127.78 24.77

Similar compression rao 11.8x Speed up for compression 4.8x Speed up for decompression

(Improvement possible with more work) Compression Performance of SAM File

Compression rate Decompression rate (MB/s) (MB/s) Compression Compression method ratio (%) M2050 K20c M2050 K20c

bzip2 (CPU) 6.27 24.63 26.71

gzip (CPU) 7.73 106.41 32.26

BWT+MTF+Huffman 15.99 21.78 74.14 80.04 32.55

Markov+Huffman 116.71 206.18 87.09 90.14 39.66

Huffman 177.11 222.22 127.48 144.49 57.69

This work 87.45 98.14 139.93 149.68 26.46

Similar compression rao 15.6x Speed up for compression 6.1x Speed up for decompression

(Improvement possible with more work) Markov Transform for Quality Scores

Compression rate Decompression rate (MB/s) (MB/s) Compression Compression method ratio (%) M2050 K20c M2050 K20c

bzip2 (CPU) 8.93 22.61 38.47 gzip (CPU) 9.83 93.59 42.35 BWT+MTF+Huffman 9.14 12.38 57.01 75.18 43.86 Huffman 185.32 231.16 121.66 129.28 49.42 Markov+Huffman 97.75 176.65 83.67 88.10 42.01 Comparison to Domain-Specific Methods

Compression Compression rate Decompression Compression method (MB/s) rate (MB/s) ratio (%)

gzip 12.2 45.4 35.35

bzip2 7.0 13.0 29.05

SCALCE 7.8 13.1 25.72

DSRC 13.5 32.2 24.77

quip 8.3 10.9 22.19

fasqz 4.6 3.8 21.95

fqzcomp 8.2 8.3 21.72

Seqsqueeze1 0.6 0.6 21.87

Column major block 111.0 104.4 29.46 compression

23 Conclusion

•We presented adapve compression framework for genomics data accelerated by GPU, which works very well •Column major compression •Novel algorithm for data like quality score •Generic and extensible •Compression on GPU is not easy, sorng is sll a boleneck

24 Contact: [email protected]