<<

Parallel using GPUs

Eva Sitaridi* Rene Mueller Tim Kaldewey

Columbia University IBM Almaden IBM Almaden

[email protected] [email protected] [email protected]

*Work done while interning in IBM Almaden, partially funded from NSF Grant IIS-1218222 Agenda • Introduction • Overview of compression • GPU implementation – LZSS compression – • Experimental results • Conclusions

2 Why compression? • Data volume doubles every 2 years* – Data retained for longer periods – Data retained for business analytics • Make better utilization of available storage resources – Increase storage capacity – Improve backup performance – Reduce bandwidth utilization •Compression should be seamless •Decompression important for Big Data workloads *Sybase Adaptive Server Enterprise , Business white paper 2012 Compression trade-offs

Compression ratio

Compression speed Decompression speed Input Initial input file Resources •Memory bandwidth •Memory space More important in some cases! •CPU utilization

Compression speed vs Compression efficiency Decompression speed vs Compression efficiency Compression speed vs Decompression speed 4 Compression resource intensive

1 Dataset: English Wikipedia pages 1GB XML text dump

pigz 0.1

0.01

Compression efficiency=0.5 lzma xz 0.001 Compressed file is half the original 0 0.1 0.2 0.3 0.4 Compression efficiency

• Default compression level used - Performance on i7-3930K (6 cores, 3.2 GHz) 5

Compression libraries format – LZ77 compression – Huffman coding – Single threaded

Parallel gzip

XZ

All use LZ-variants 6 LZSS compression Input characters Output tokens

0 1 2 3 … ATTACTAGAATGT TACTAATCTGAT ATTACTAGAATGT(2,5)… CGGGCCGGGCCTG Literals Backreferences Unmatched characters (Position, Length)

Minimum match length

7 LZSS compression Input characters Find longest match Output tokens

0 1 2 3 … ATTACTAGAATGT TACTAATCTGAT ATTACTAGAATGT(2,5)… CGGGCCGGGCCTG Literals Backreferences Unmatched characters (Position, Length)

Sliding window buffer Unencoded lookahead characters Minimum match length

8 LZSS decompression

W I K I P E D I A . O Window buffer contents

Tokens (0,4)M(5,4)COMM… WIKIMEDIACOMM

Input data block Output data block Huffman

• Huffman tree 0 13 1

– Leaves: encoded symbols 0 6 1 0 7 1

– Unique prefix for each character 3 4 ‘a’ ‘’f’ 0 1 0 1

‘s’ ‘e’ ’h’ ‘’ • Huffman coding – Short codes for frequent characters

• Huffman decoding A) Traverse tree to decode B) Use look-up tables for faster decoding

10 What to accelerate? Profile of gzip on Intel i7-3930K Input: Compressible database column 1.9% 1.8% 1.4% 1.9% 4.9% 87.2% LZSS: Longest match LZSS: Other Huffman: Send Update crc Huffman: block Huffman: Count tally

>85% of time spent on string matching

Accelerate LZSS first 11 Why GPUs? •LZSS string matching is memory bandwidth intensive - Leverage GPU bandwidth

Intel i7-3930K Tesla K20x

Memory Bandwidth 51.2 GB/s 250 GB/s (Spec) Memory Bandwidth 40.4 GB/s 197GB/s (Measured) #Cores 6 2688

12 How to parallelize compression/decompression? >1000 cores available!

Data block 1 Thread 1 Data block 2 Thread 2

… …

Split input file in independent blocks

Input file

Naïve approach: Threads process independent data/file blocks 13 Memory access pattern

Actual memory access pattern Optimal GPU memory access pattern T1 T2 T3 T1 T2 T3

Data block 1 Data Block 2 Data Block 3 Data Block 1 Data Block 2 Data Block 3

Data block size>32K Many cache lines loaded Thread memory accesses in the same cache line •Low memory bandwidth

14 Thread utilization SIMT Architecture: Group execution Iter. 1       6 active threads T1 T2 T3 T4 T5 T6 i=thread id j=0 … while(window[i]==lookahead[j]) { j++; …. }

Data block 1 Data block 2 Data block 3Data block 4 Data block 5 Data block 6 Different #iterations for each thread

15 Thread utilization SIMT Architecture: Group execution Iter. 2       4 active threads T1 T2 T3 T4 T5 T6 i=thread id j=0 … while(window[i]==lookahead[j]) { j++; …. }

Data block 1 Data block 2 Data block 3Data block 4 Data block 5 Data block 6 Different #iterations for each thread

16 Thread utilization SIMT Architecture: Group execution Iter. 3       1 active thread T1 T2 T3 T4 T5 T6 i=thread id j=0 … while(window[i]==lookahead[j]) { j++; …. }

Data block 1 Data block 2 Data block 3Data block 4 Data block 5 Data block 6 Different #iterations for each thread

(6+4+1)/(3*6) = 11/18 = 61% thread utilization 17 GPU LZSS General compression

Data block 1 Thread group 1 Data block 2 Thread group 2 Compact

Thread group n Data block n Output file Input file Intermediate output

Store list of compressed data block offsets Parallel decompression Better approach: Each data block is processed by a thread group 18 Compression efficiency vs Compression performance

GPU LZSS* Lookahead: 66 chars Block size: 64K chars Faster performance drop •No gain in compression efficiency

Window size

* Related papers A. Ozsoy and M. Swany, “CULZSS: LZSS Lossless Data Compression on CUDA” A. Balevic, “Parallel Variable-Length Encoding on GPGPUs” 19

GPU LZSS decompression 1) Compute total size of tokens (serialized)

Tokens CCGA(0,2)CGG(4,3)AGTT CCGACCCGGCCCAGTT

Compressed input Uncompressed output

20 GPU LZSS decompression 2) Read tokens (parallel)

Tokens CCGA(0,2)CGG(4,3)AGTT CCGACCCGGCCCAGTT

Compressed input Uncompressed output

21 GPU LZSS decompression

3.2) Write uncompressed output:

Tokens CCGA(0,2)CGG(4,3)AGTT CCGACCCGGCCCAGTT

3.1) Compute uncompressed output Compressed input Uncompressed output

Problem: Backreferences processed in parallel might be dependent!  Use voting function __ballot to detect conflicts 22 Writing LZSS tokens to output Case A: All literals CCGAGATTGAGTT 1) Write literals (parallel) Tokens

Case B: Literals & non-conflicting backreferences Tokens CCGA(0,2)CGG(0,3)AGTT 1) Write literals (parallel) 2) Write backreferences (parallel)

Case C: Literals & conflicting backreferences

Tokens CCGA(0,2)CGG(4,3)AGTT 1) Write literals (parallel) 2) Write non-conflicting backreferences (parallel) 3) Write remaining backreferences (serial)

23 Huffman entropy coding • Inherently sequential • Coding challenge – Compute destination of encoded data • Decoding challenge – Determine codeword boundaries

Focus on decoding for end-to-end decompression

24 Parallel Huffman decoding

01100110 10111001 11010110 11100001 10111011 01110001 00000010 00001110

File block

25 Parallel Huffman decoding

01100110 •During coding Offset 1 10111001 •Split data blocks in sub-blocks 11010110 •Store sub-block offsets  Parallel sub-block decoding Offset 2 11100001 10111011 Offset 3 01110001 00000010 Offset 4 00001110

File block

26 Parallel Huffman decoding

01100110 •During coding Offset 1 10111001 •Split data blocks in sub-blocks 11010110 •Store sub-block offsets  Parallel sub-block decoding Offset 2 11100001 10111011 •During decoding Offset 3 01110001 • 00000010 Use look-up tables for decoding rather than Huffman trees Offset 4 00001110 •Fit look-up table in shared memory

File block •Reduce number of codes for length and distance

27 Parallel Huffman decoding

01100110 •During coding Offset 1 10111001 •Split data blocks in sub-blocks 11010110 •Store sub-block offsets  Parallel sub-block decoding Offset 2 11100001 10111011 •During decoding Offset 3 01110001 • 00000010 Use look-up tables for decoding rather than Huffman trees Offset 4 00001110 •Fit look-up table in shared memory

File block •Reduce number of codes for length and distance

Trade compression efficiency for decompression speed

28 Experimental system Linux, kernel 3.0.74

Intel i7-3930K Tesla K20x

Memory bandwidth 51.2 GB/s 250 GB/s (Spec) Memory bandwidth 40.4 GB/s 197 GB/s (Measured) Memory capacity 64 GB 6 GB

#Cores 6 (12 threads) 2688 Clock frequency 3.2 GHz 0.732 GHz

29 Datasets

Dataset Size Comp. efficiency*

English 1GB 0.35 wikipedia

Database 245MB 0.98 column

•Datasets already loaded in memory •No disk I/O

*For default parameter of gzip 30 Decompression performance

31 Decompression performance

Data transfers slow down performance

32

Hide GPU to CPU transfer I/O using CUDA Streams

… Read Decode B1 Decompress B1 Write B1 Read B2 Stream Batch processing Time

Hide GPU to CPU transfer I/O using CUDA Streams

… Read B1 Decode B1 Decompress B1 Write B1 Read B2 Stream Batch processing

Read B3 Decode B3

Read B2 Decode B2 Decompress B2 Stream Read B1 Decode B1 Decompress B1 Write B1 Pipeline PCI/E transfers Time

Hide GPU to CPU transfer I/O using CUDA Streams

… Read B1 Decode B1 Decompress B1 Write B1 Read B2 Stream Batch processing

Read B3 Decode B3

Read B2 Decode B2 Decompress B2 Stream Read B1 Decode B1 Decompress B1 Write B1 Pipeline PCI/E transfers

Read B3 Decode B3 Decompress B3 Write B3

Stream Read B2 Decode B2 Decompress B2 Write B2

Read B1 Decode B1 Decompress B1 Write B1 Pipeline PCI/E transfers & Concurrent kernel execution Time Decompression performance

36 Decompression performance Data transfer latency hidden

37 Decompression time breakdown

English Wikipedia Database column Huffman % Huffman % LZSS % LZSS %

38 Decompression time breakdown

English Wikipedia Database column Huffman % Huffman % LZSS % LZSS %

LZSS faster for incompressible datasets

39 Decompression performance vs Compression efficiency

English Wikipedia GPU Deflate 10 w PCI/E transfer

) 1

pigz 0.1 lzma gzip Bandwidth (GB/s Bandwidth xz bzip2

0.01 0 0.1 0.2 0.3 0.4 0.5 Compression efficiency Conclusions • Decompression – Hide GPU-CPU latency using 4-stage pipelining – LZSS faster for incompressible files • Compression – Reduce search time (using hash tables ?)

41 Conclusions Questions? • Decompression – Hide GPU-CPU latency using 4-stage pipelining – LZSS faster for incompressible files • Compression – Reduce search time (using hash tables ?)

42