Parallel lossless compression using GPUs
Eva Sitaridi* Rene Mueller Tim Kaldewey
Columbia University IBM Almaden IBM Almaden
[email protected] [email protected] [email protected]
*Work done while interning in IBM Almaden, partially funded from NSF Grant IIS-1218222 Agenda • Introduction • Overview of compression algorithms • GPU implementation – LZSS compression – Huffman coding • Experimental results • Conclusions
2 Why compression? • Data volume doubles every 2 years* – Data retained for longer periods – Data retained for business analytics • Make better utilization of available storage resources – Increase storage capacity – Improve backup performance – Reduce bandwidth utilization •Compression should be seamless •Decompression important for Big Data workloads *Sybase Adaptive Server Enterprise Data Compression, Business white paper 2012 Compression trade-offs
Compression ratio
Compression speed Decompression speed Input file Initial input file Resources •Memory bandwidth •Memory space More important in some cases! •CPU utilization
Compression speed vs Compression efficiency Decompression speed vs Compression efficiency Compression speed vs Decompression speed 4 Compression resource intensive
1 Dataset: English Wikipedia pages 1GB XML text dump
pigz 0.1
Compression efficiency=0.5 lzma xz 0.001 Compressed file is half the original 0 0.1 0.2 0.3 0.4 Compression efficiency
• Default compression level used - Performance on Intel i7-3930K (6 cores, 3.2 GHz) 5
Compression libraries Deflate format – LZ77 compression – Huffman coding – Single threaded snappy
Parallel gzip
XZ
All use LZ-variants 6 LZSS compression Input characters Output tokens
0 1 2 3 … ATTACTAGAATGT TACTAATCTGAT ATTACTAGAATGT(2,5)… CGGGCCGGGCCTG Literals Backreferences Unmatched characters (Position, Length)
Minimum match length
7 LZSS compression Input characters Find longest match Output tokens
0 1 2 3 … ATTACTAGAATGT TACTAATCTGAT ATTACTAGAATGT(2,5)… CGGGCCGGGCCTG Literals Backreferences Unmatched characters (Position, Length)
Sliding window buffer Unencoded lookahead characters Minimum match length
8 LZSS decompression
W I K I P E D I A . C O Window buffer contents
Tokens (0,4)M(5,4)COMM… WIKIMEDIACOMM
Input data block Output data block Huffman algorithm
• Huffman tree 0 13 1
– Leaves: encoded symbols 0 6 1 0 7 1
– Unique prefix for each character 3 4 ‘a’ ‘’f’ 0 1 0 1
‘s’ ‘e’ ’h’ ‘r’ • Huffman coding – Short codes for frequent characters
• Huffman decoding A) Traverse tree to decode B) Use look-up tables for faster decoding
10 What to accelerate? Profile of gzip on Intel i7-3930K Input: Compressible database column 1.9% 1.8% 1.4% 1.9% 4.9% 87.2% LZSS: Longest match LZSS: Other Huffman: Send bits Update crc Huffman: Compress block Huffman: Count tally
>85% of time spent on string matching
Accelerate LZSS first 11 Why GPUs? •LZSS string matching is memory bandwidth intensive - Leverage GPU bandwidth
Intel i7-3930K Tesla K20x
Memory Bandwidth 51.2 GB/s 250 GB/s (Spec) Memory Bandwidth 40.4 GB/s 197GB/s (Measured) #Cores 6 2688
12 How to parallelize compression/decompression? >1000 cores available!
Data block 1 Thread 1 Data block 2 Thread 2
… …
Split input file in independent blocks
Input file
Naïve approach: Threads process independent data/file blocks 13 Memory access pattern
Actual memory access pattern Optimal GPU memory access pattern T1 T2 T3 T1 T2 T3
Data block 1 Data Block 2 Data Block 3 Data Block 1 Data Block 2 Data Block 3
Data block size>32K Many cache lines loaded Thread memory accesses in the same cache line •Low memory bandwidth
14 Thread utilization SIMT Architecture: Group execution Iter. 1 6 active threads T1 T2 T3 T4 T5 T6 i=thread id j=0 … while(window[i]==lookahead[j]) { j++; …. }
Data block 1 Data block 2 Data block 3Data block 4 Data block 5 Data block 6 Different #iterations for each thread
15 Thread utilization SIMT Architecture: Group execution Iter. 2 4 active threads T1 T2 T3 T4 T5 T6 i=thread id j=0 … while(window[i]==lookahead[j]) { j++; …. }
Data block 1 Data block 2 Data block 3Data block 4 Data block 5 Data block 6 Different #iterations for each thread
16 Thread utilization SIMT Architecture: Group execution Iter. 3 1 active thread T1 T2 T3 T4 T5 T6 i=thread id j=0 … while(window[i]==lookahead[j]) { j++; …. }
Data block 1 Data block 2 Data block 3Data block 4 Data block 5 Data block 6 Different #iterations for each thread
(6+4+1)/(3*6) = 11/18 = 61% thread utilization 17 GPU LZSS General compression
Data block 1 Thread group 1 Data block 2 Thread group 2 Compact
Thread group n Data block n Output file Input file Intermediate output
Store list of compressed data block offsets Parallel decompression Better approach: Each data block is processed by a thread group 18 Compression efficiency vs Compression performance
GPU LZSS* Lookahead: 66 chars Block size: 64K chars Faster performance drop •No gain in compression efficiency
Window size
* Related papers A. Ozsoy and M. Swany, “CULZSS: LZSS Lossless Data Compression on CUDA” A. Balevic, “Parallel Variable-Length Encoding on GPGPUs” 19
GPU LZSS decompression 1) Compute total size of tokens (serialized)
Tokens CCGA(0,2)CGG(4,3)AGTT CCGACCCGGCCCAGTT
Compressed input Uncompressed output
20 GPU LZSS decompression 2) Read tokens (parallel)
Tokens CCGA(0,2)CGG(4,3)AGTT CCGACCCGGCCCAGTT
Compressed input Uncompressed output
21 GPU LZSS decompression
3.2) Write uncompressed output:
Tokens CCGA(0,2)CGG(4,3)AGTT CCGACCCGGCCCAGTT
3.1) Compute uncompressed output Compressed input Uncompressed output
Problem: Backreferences processed in parallel might be dependent! Use voting function __ballot to detect conflicts 22 Writing LZSS tokens to output Case A: All literals CCGAGATTGAGTT 1) Write literals (parallel) Tokens
Case B: Literals & non-conflicting backreferences Tokens CCGA(0,2)CGG(0,3)AGTT 1) Write literals (parallel) 2) Write backreferences (parallel)
Case C: Literals & conflicting backreferences
Tokens CCGA(0,2)CGG(4,3)AGTT 1) Write literals (parallel) 2) Write non-conflicting backreferences (parallel) 3) Write remaining backreferences (serial)
23 Huffman entropy coding • Inherently sequential • Coding challenge – Compute destination of encoded data • Decoding challenge – Determine codeword boundaries
Focus on decoding for end-to-end decompression
24 Parallel Huffman decoding
01100110 10111001 11010110 11100001 10111011 01110001 00000010 00001110
File block
25 Parallel Huffman decoding
01100110 •During coding Offset 1 10111001 •Split data blocks in sub-blocks 11010110 •Store sub-block offsets Parallel sub-block decoding Offset 2 11100001 10111011 Offset 3 01110001 00000010 Offset 4 00001110
File block
26 Parallel Huffman decoding
01100110 •During coding Offset 1 10111001 •Split data blocks in sub-blocks 11010110 •Store sub-block offsets Parallel sub-block decoding Offset 2 11100001 10111011 •During decoding Offset 3 01110001 • 00000010 Use look-up tables for decoding rather than Huffman trees Offset 4 00001110 •Fit look-up table in shared memory
File block •Reduce number of codes for length and distance
27 Parallel Huffman decoding
01100110 •During coding Offset 1 10111001 •Split data blocks in sub-blocks 11010110 •Store sub-block offsets Parallel sub-block decoding Offset 2 11100001 10111011 •During decoding Offset 3 01110001 • 00000010 Use look-up tables for decoding rather than Huffman trees Offset 4 00001110 •Fit look-up table in shared memory
File block •Reduce number of codes for length and distance
Trade compression efficiency for decompression speed
28 Experimental system Linux, kernel 3.0.74
Intel i7-3930K Tesla K20x
Memory bandwidth 51.2 GB/s 250 GB/s (Spec) Memory bandwidth 40.4 GB/s 197 GB/s (Measured) Memory capacity 64 GB 6 GB
#Cores 6 (12 threads) 2688 Clock frequency 3.2 GHz 0.732 GHz
29 Datasets
Dataset Size Comp. efficiency*
English 1GB 0.35 wikipedia
Database 245MB 0.98 column
•Datasets already loaded in memory •No disk I/O
*For default parameter of gzip 30 Decompression performance
31 Decompression performance
Data transfers slow down performance
32
Hide GPU to CPU transfer I/O using CUDA Streams
… Read B1 Decode B1 Decompress B1 Write B1 Read B2 Stream Batch processing Time
Hide GPU to CPU transfer I/O using CUDA Streams
… Read B1 Decode B1 Decompress B1 Write B1 Read B2 Stream Batch processing
Read B3 Decode B3
Read B2 Decode B2 Decompress B2 Stream Read B1 Decode B1 Decompress B1 Write B1 Pipeline PCI/E transfers Time
Hide GPU to CPU transfer I/O using CUDA Streams
… Read B1 Decode B1 Decompress B1 Write B1 Read B2 Stream Batch processing
Read B3 Decode B3
Read B2 Decode B2 Decompress B2 Stream Read B1 Decode B1 Decompress B1 Write B1 Pipeline PCI/E transfers
Read B3 Decode B3 Decompress B3 Write B3
Stream Read B2 Decode B2 Decompress B2 Write B2
Read B1 Decode B1 Decompress B1 Write B1 Pipeline PCI/E transfers & Concurrent kernel execution Time Decompression performance
36 Decompression performance Data transfer latency hidden
37 Decompression time breakdown
English Wikipedia Database column Huffman % Huffman % LZSS % LZSS %
38 Decompression time breakdown
English Wikipedia Database column Huffman % Huffman % LZSS % LZSS %
LZSS faster for incompressible datasets
39 Decompression performance vs Compression efficiency
English Wikipedia GPU Deflate 10 w PCI/E transfer
) 1
pigz 0.1 lzma gzip Bandwidth (GB/s Bandwidth xz bzip2
0.01 0 0.1 0.2 0.3 0.4 0.5 Compression efficiency Conclusions • Decompression – Hide GPU-CPU latency using 4-stage pipelining – LZSS faster for incompressible files • Compression – Reduce search time (using hash tables ?)
41 Conclusions Questions? • Decompression – Hide GPU-CPU latency using 4-stage pipelining – LZSS faster for incompressible files • Compression – Reduce search time (using hash tables ?)
42