Decompressing Snappy Compressed Files at the Speed of OpenCAPI
Speaker: Jian Fang TU Delft
1 Current Project … Sort SHADE DB DNA Join Spark Scalable CPU Seq Skyline Heterogeneous POWER9 … Accelerated HBM DatabasE HBM
ARROW
OpenCAPI - OpenCAPI OpenCAPI NVMe Memory SNAP SNAP Decom Actions pression Fletcher Fletcher NV Link FPGA Tables FPGA
GPU 2 Big Data Processing Database Analytics
Data Movement
3 • Moving Data from/to??? Storage • Slow, Bottleneck • 1st step in data processing and database analytics LZ4 RLE LZW GZIP Rawstorage Data CPU
Netezza • Decompress-filter Solutions
Compressed storage DecompressedFilteredFPGA CPU Data Data Data Decompress-filter 4 Snappy compression Algorithm
• LZ77-based, byte-level, lossless Goal
• In Hadoop ecosystem 500MB/s OpenCAPI Per Core Speed • Support Parquet, ORC, etc. Core • Low compression ratio i7
• Fast compression and decompression
5 Mechanism of Snappy compression
• Two kinds of token: Literal token and copy token • Find match from previous data • Example:
ABCDEFGHABCD
offset 8, length4
Use (8,4) to replace “ABCD”
7 Snappy Compression
No Match!
Input … ABCDEFGHABCD
Generate a literal token Literal
Match!
Find next match … ABCDEFGHABCD
copy : offset 8 Generate a copy token length 4
Output … (Literal, “ABCDEFGH”) (Copy, 8, 4)
8 Snappy Decompression
…
(Literal, “ABCDEFGH”)
… ABCDEFGH
(Copy, 8, 4)
Copy
… ABCDEFGHABCD
9 Snappy compression Algorithm
• An independent 64KB history for every 64KB data • Token: copy or literal, copy lenth<=64B • Token size: 2 Byte minimum, average ≈< 4 Byte(assumption for design) • Highly data dependent
Benchmark Compression Ratio Average Token Size DNA1 3.31 5.79 DNA2 2.36 7.38 DNA3 2.57 7.03 TPC-H 2.53 2.72 … … …
10
DNA data source: ftp://ftp.ncbi.nlm.nih.gov/ncbi-asn1 Design in FPGA
• An independent 64KB history for every 64KB data • Token: copy or literal, copy lenth<=64B • Token size: 2 Byte minimum, average ≈< 4 Byte(assumption for design) • Highly data dependent
16X 64KB … history
4KB BRAM 11
DNA data source: ftp://ftp.ncbi.nlm.nih.gov/ncbi-asn1 Design Challenge
• Deal with different types of tokens
• Handle various size of tokens
• Parallelize token translation
• Keep up with the interface bandwidth
13 Design Challenge
• Deal with different types of tokens
• Handle various size of tokens
• Parallelize token translation
• Keep up with the interface bandwidth
14 Design Challenge
• Deal with different types of tokens
• Handle various size of tokens
• Parallelize token translation
• Keep up with the interface bandwidth
15 Token Structure Literal Token • LZ77-based, byte-level • In Hadoop ecosystem • Support Parquet, ORC, etc. • Two types of tokens: literal tokens and copy tokens
Literal Token: (literal, length, data) Copy Token : (copy, offset, length)
16
Source: Y. Qiao. An fpga-based snappy decompressor-filter. Master’s thesis, Delft University of Technology, 2018. Token Structure Copy
• LZ77-based, byte-level Token • In Hadoop ecosystem • Support Parquet, ORC, etc. • Two types of tokens: literal tokens and copy tokens
Tokens are in various length
17
Source: Y. Qiao. An fpga-based snappy decompressor-filter. Master’s thesis, Delft University of Technology, 2018. Prior Solutions: Decode all possiblities
18
Source: Y. Qiao. An fpga-based snappy decompressor-filter. Master’s thesis, Delft University of Technology, 2018. Design Challenge
• Deal with different types of tokens
• Handle various size of tokens
• Parallelize token translation
• Keep up with the interface bandwidth
19 Token Dependency
• Read – Read • No dependency
• Write – Write Token1 Token 2 Token3 Token 4 • Never write same bytes Read Read • No dependency Write Write Read Read • Write – Read Write Write • Forwarding
20 Token Dependency – Address Conflict
• Write – Write • Same block, same line
16X …
BRAMs
21 Token Dependency – Address Conflict
• Write – Write • Same block, different lines
16X …
BRAMs
22 Token Dependency – Address Conflict
• Read – Read
16X …
BRAMs
23 Design Challenge
• Deal with different types of tokens
• Handle various size of tokens
• Parallelize token translation Token size: 4B 2 tokens/cycle • Keep up with the interface bandwidth 250MHz --> 13 instances
token size * token/cycle * frequency * instances = OpenCAPI BW
24 IDEA – from token-level to BRAM-level
16X …
16X …
25 Architecture Overview
26 Architecture Overview
• Two-level Parser
27 Architecture Overview
• Two-level Parser
28 Architecture Overview
• Two-level Parser
• Parallel BRAM Operations
29 Architecture Overview
• Two-level Parser
• Parallel BRAM Operations
• Relax Execution Model
30 Parallel BRAM Operations & Relax Execution Model
31 Results
• Implementation Result (on XCKU15P)
Flip-Flop LUTs BRAMs Frequency Power 32K(3.32%) 46K(8.95%) 29(2.95%) 250MHz 2.9W
• Benchmark • Alice’s Adventures in Wonderland : 85KB (149KB) for compressed (uncompressed) file.
• Throughput FPGA(single CPU(Core i7 with decompressor) one thread) Input 3GB/s 300MB/s Output 5GB/s 500MB/s 32 Publish Paper: J. Fang, J. Chen, P. Hofstee, Z. Al-Ars, J. Hidders, A High-Bandwidth Snappy Decompressor Open in Reconfigurable Logic (CODES+ISSS 2018) Source https://github.com/ChenJianyunp/ FPGA-Snappy-Decompressor.git
33 Contact Me: [email protected] (Jian Fang)