Parallel Lossless Compression Using Gpus
Total Page:16
File Type:pdf, Size:1020Kb
Parallel lossless compression using GPUs Eva Sitaridi* Rene Mueller Tim Kaldewey Columbia University IBM Almaden IBM Almaden [email protected] [email protected] [email protected] *Work done while interning in IBM Almaden, partially funded from NSF Grant IIS-1218222 Agenda • Introduction • Overview of compression algorithms • GPU implementation – LZSS compression – Huffman coding • Experimental results • Conclusions 2 Why compression? • Data volume doubles every 2 years* – Data retained for longer periods – Data retained for business analytics • Make better utilization of available storage resources – Increase storage capacity – Improve backup performance – Reduce bandwidth utilization •Compression should be seamless •Decompression important for Big Data workloads *Sybase Adaptive Server Enterprise Data Compression, Business white paper 2012 Compression trade-offs Compression ratio Compression speed Decompression speed Input file Initial input file Resources •Memory bandwidth •Memory space More important in some cases! •CPU utilization Compression speed vs Compression efficiency Decompression speed vs Compression efficiency Compression speed vs Decompression speed 4 Compression resource intensive 1 Dataset: English Wikipedia pages 1GB XML text dump pigz 0.1 gzip 0.01 bzip2 Compression efficiency=0.5 lzma xz 0.001 Compressed file is half the original 0 0.1 0.2 0.3 0.4 Compression efficiency • Default compression level used - Performance on Intel i7-3930K (6 cores, 3.2 GHz) 5 Compression libraries Deflate format – LZ77 compression – Huffman coding – Single threaded snappy Parallel gzip XZ All use LZ-variants 6 LZSS compression Input characters Output tokens 0 1 2 3 … ATTACTAGAATGT TACTAATCTGAT ATTACTAGAATGT(2,5)… CGGGCCGGGCCTG Literals Backreferences Unmatched characters (Position, Length) Minimum match length 7 LZSS compression Input characters Find longest match Output tokens 0 1 2 3 … ATTACTAGAATGT TACTAATCTGAT ATTACTAGAATGT(2,5)… CGGGCCGGGCCTG Literals Backreferences Unmatched characters (Position, Length) Sliding window buffer Unencoded lookahead characters Minimum match length 8 LZSS decompression W I K I P E D I A . C O Window buffer contents Tokens (0,4)M(5,4)COMM… WIKIMEDIACOMM Input data block Output data block Huffman algorithm • Huffman tree 0 13 1 – Leaves: encoded symbols 0 6 1 0 7 1 – Unique prefix for each character 3 4 ‘a’ ‘’f’ 0 1 0 1 ‘s’ ‘e’ ’h’ ‘r’ • Huffman coding – Short codes for frequent characters • Huffman decoding A) Traverse tree to decode B) Use look-up tables for faster decoding 10 What to accelerate? Profile of gzip on Intel i7-3930K Input: Compressible database column 1.9% 1.8% 1.4% 1.9% 4.9% 87.2% LZSS: Longest match LZSS: Other Huffman: Send bits Update crc Huffman: Compress block Huffman: Count tally >85% of time spent on string matching Accelerate LZSS first 11 Why GPUs? •LZSS string matching is memory bandwidth intensive - Leverage GPU bandwidth Intel i7-3930K Tesla K20x Memory Bandwidth 51.2 GB/s 250 GB/s (Spec) Memory Bandwidth 40.4 GB/s 197GB/s (Measured) #Cores 6 2688 12 How to parallelize compression/decompression? >1000 cores available! Data block 1 Thread 1 Data block 2 Thread 2 … … Split input file in independent blocks Input file Naïve approach: Threads process independent data/file blocks 13 Memory access pattern Actual memory access pattern Optimal GPU memory access pattern T1 T2 T3 T1 T2 T3 Data block 1 Data Block 2 Data Block 3 Data Block 1 Data Block 2 Data Block 3 Data block size>32K Many cache lines loaded Thread memory accesses in the same cache line •Low memory bandwidth 14 Thread utilization SIMT Architecture: Group execution Iter. 1 6 active threads T1 T2 T3 T4 T5 T6 i=thread id j=0 … while(window[i]==lookahead[j]) { j++; …. } Data block 1 Data block 2 Data block 3Data block 4 Data block 5 Data block 6 Different #iterations for each thread 15 Thread utilization SIMT Architecture: Group execution Iter. 2 4 active threads T1 T2 T3 T4 T5 T6 i=thread id j=0 … while(window[i]==lookahead[j]) { j++; …. } Data block 1 Data block 2 Data block 3Data block 4 Data block 5 Data block 6 Different #iterations for each thread 16 Thread utilization SIMT Architecture: Group execution Iter. 3 1 active thread T1 T2 T3 T4 T5 T6 i=thread id j=0 … while(window[i]==lookahead[j]) { j++; …. } Data block 1 Data block 2 Data block 3Data block 4 Data block 5 Data block 6 Different #iterations for each thread (6+4+1)/(3*6) = 11/18 = 61% thread utilization 17 GPU LZSS General compression Data block 1 Thread group 1 Data block 2 Thread group 2 Compact Thread group n Data block n Output file Input file Intermediate output Store list of compressed data block offsets Parallel decompression Better approach: Each data block is processed by a thread group 18 Compression efficiency vs Compression performance GPU LZSS* Lookahead: 66 chars Block size: 64K chars Faster performance drop •No gain in compression efficiency Window size * Related papers A. Ozsoy and M. Swany, “CULZSS: LZSS Lossless Data Compression on CUDA” A. Balevic, “Parallel Variable-Length Encoding on GPGPUs” 19 GPU LZSS decompression 1) Compute total size of tokens (serialized) Tokens CCGA(0,2)CGG(4,3)AGTT CCGACCCGGCCCAGTT Compressed input Uncompressed output 20 GPU LZSS decompression 2) Read tokens (parallel) Tokens CCGA(0,2)CGG(4,3)AGTT CCGACCCGGCCCAGTT Compressed input Uncompressed output 21 GPU LZSS decompression 3.2) Write uncompressed output: Tokens CCGA(0,2)CGG(4,3)AGTT CCGACCCGGCCCAGTT 3.1) Compute uncompressed output Compressed input Uncompressed output Problem: Backreferences processed in parallel might be dependent! Use voting function __ballot to detect conflicts 22 Writing LZSS tokens to output Case A: All literals CCGAGATTGAGTT 1) Write literals (parallel) Tokens Case B: Literals & non-conflicting backreferences Tokens CCGA(0,2)CGG(0,3)AGTT 1) Write literals (parallel) 2) Write backreferences (parallel) Case C: Literals & conflicting backreferences Tokens CCGA(0,2)CGG(4,3)AGTT 1) Write literals (parallel) 2) Write non-conflicting backreferences (parallel) 3) Write remaining backreferences (serial) 23 Huffman entropy coding • Inherently sequential • Coding challenge – Compute destination of encoded data • Decoding challenge – Determine codeword boundaries Focus on decoding for end-to-end decompression 24 Parallel Huffman decoding 01100110 10111001 11010110 11100001 10111011 01110001 00000010 00001110 File block 25 Parallel Huffman decoding 01100110 •During coding Offset 1 10111001 •Split data blocks in sub-blocks 11010110 •Store sub-block offsets Parallel sub-block decoding Offset 2 11100001 10111011 Offset 3 01110001 00000010 Offset 4 00001110 File block 26 Parallel Huffman decoding 01100110 •During coding Offset 1 10111001 •Split data blocks in sub-blocks 11010110 •Store sub-block offsets Parallel sub-block decoding Offset 2 11100001 10111011 •During decoding Offset 3 01110001 • 00000010 Use look-up tables for decoding rather than Huffman trees Offset 4 00001110 •Fit look-up table in shared memory File block •Reduce number of codes for length and distance 27 Parallel Huffman decoding 01100110 •During coding Offset 1 10111001 •Split data blocks in sub-blocks 11010110 •Store sub-block offsets Parallel sub-block decoding Offset 2 11100001 10111011 •During decoding Offset 3 01110001 • 00000010 Use look-up tables for decoding rather than Huffman trees Offset 4 00001110 •Fit look-up table in shared memory File block •Reduce number of codes for length and distance Trade compression efficiency for decompression speed 28 Experimental system Linux, kernel 3.0.74 Intel i7-3930K Tesla K20x Memory bandwidth 51.2 GB/s 250 GB/s (Spec) Memory bandwidth 40.4 GB/s 197 GB/s (Measured) Memory capacity 64 GB 6 GB #Cores 6 (12 threads) 2688 Clock frequency 3.2 GHz 0.732 GHz 29 Datasets Dataset Size Comp. efficiency* English 1GB 0.35 wikipedia Database 245MB 0.98 column •Datasets already loaded in memory •No disk I/O *For default parameter of gzip 30 Decompression performance 31 Decompression performance Data transfers slow down performance 32 Hide GPU to CPU transfer I/O using CUDA Streams … Read B1 Decode B1 Decompress B1 Write B1 Read B2 Stream Batch processing Time Hide GPU to CPU transfer I/O using CUDA Streams … Read B1 Decode B1 Decompress B1 Write B1 Read B2 Stream Batch processing Read B3 Decode B3 Read B2 Decode B2 Decompress B2 Stream Read B1 Decode B1 Decompress B1 Write B1 Pipeline PCI/E transfers Time Hide GPU to CPU transfer I/O using CUDA Streams … Read B1 Decode B1 Decompress B1 Write B1 Read B2 Stream Batch processing Read B3 Decode B3 Read B2 Decode B2 Decompress B2 Stream Read B1 Decode B1 Decompress B1 Write B1 Pipeline PCI/E transfers Read B3 Decode B3 Decompress B3 Write B3 Stream Read B2 Decode B2 Decompress B2 Write B2 Read B1 Decode B1 Decompress B1 Write B1 Pipeline PCI/E transfers & Concurrent kernel execution Time Decompression performance 36 Decompression performance Data transfer latency hidden 37 Decompression time breakdown English Wikipedia Database column Huffman % Huffman % LZSS % LZSS % 38 Decompression time breakdown English Wikipedia Database column Huffman % Huffman % LZSS % LZSS % LZSS faster for incompressible datasets 39 Decompression performance vs Compression efficiency English Wikipedia GPU Deflate 10 w PCI/E transfer ) 1 pigz 0.1 lzma gzip Bandwidth (GB/s xz bzip2 0.01 0 0.1 0.2 0.3 0.4 0.5 Compression efficiency Conclusions • Decompression – Hide GPU-CPU latency using 4-stage pipelining – LZSS faster for incompressible files • Compression – Reduce search time (using hash tables ?) 41 Conclusions Questions? • Decompression – Hide GPU-CPU latency using 4-stage pipelining – LZSS faster for incompressible files • Compression – Reduce search time (using hash tables ?) 42 .