<<

Decompressing Compressed Files at the Speed of OpenCAPI

Speaker: Jian Fang TU Delft

1 Current Project … Sort SHADE DB DNA Join Scalable CPU Seq Skyline Heterogeneous POWER9 … Accelerated HBM DatabasE HBM

ARROW

OpenCAPI - OpenCAPI OpenCAPI NVMe Memory SNAP SNAP Decom Actions pression Fletcher Fletcher NV Link FPGA Tables FPGA

GPU 2 Big Data Processing Database Analytics

Data Movement

3 • Moving Data from/to??? Storage • Slow, Bottleneck • 1st step in data processing and database analytics LZ4 RLE LZW Rawstorage Data CPU

Netezza • Decompress- Solutions

Compressed storage DecompressedFilteredFPGA CPU Data Data Data Decompress-filter 4 Snappy compression Algorithm

• LZ77-based, byte-level, lossless Goal

• In Hadoop ecosystem 500MB/s OpenCAPI Per Core Speed • Support Parquet, ORC, etc. Core • Low compression ratio i7

• Fast compression and decompression

5 Mechanism of Snappy compression

• Two kinds of token: Literal token and copy token • Find match from previous data • Example:

ABCDEFGHABCD

offset 8, length4

Use (8,4) to replace “ABCD”

7 Snappy Compression

No Match!

Input … ABCDEFGHABCD

Generate a literal token Literal

Match!

Find next match … ABCDEFGHABCD

copy : offset 8 Generate a copy token length 4

Output … (Literal, “ABCDEFGH”) (Copy, 8, 4)

8 Snappy Decompression

(Literal, “ABCDEFGH”)

… ABCDEFGH

(Copy, 8, 4)

Copy

… ABCDEFGHABCD

9 Snappy compression Algorithm

• An independent 64KB history for every 64KB data • Token: copy or literal, copy lenth<=64B • Token size: 2 Byte minimum, average ≈< 4 Byte(assumption for design) • Highly data dependent

Benchmark Compression Ratio Average Token Size DNA1 3.31 5.79 DNA2 2.36 7.38 DNA3 2.57 7.03 TPC-H 2.53 2.72 … … …

10

DNA data source: ftp://ftp.ncbi.nlm.nih.gov/ncbi-asn1 Design in FPGA

• An independent 64KB history for every 64KB data • Token: copy or literal, copy lenth<=64B • Token size: 2 Byte minimum, average ≈< 4 Byte(assumption for design) • Highly data dependent

16X 64KB … history

4KB BRAM 11

DNA data source: ftp://ftp.ncbi.nlm.nih.gov/ncbi-asn1 Design Challenge

• Deal with different types of tokens

• Handle various size of tokens

• Parallelize token translation

• Keep up with the interface bandwidth

13 Design Challenge

• Deal with different types of tokens

• Handle various size of tokens

• Parallelize token translation

• Keep up with the interface bandwidth

14 Design Challenge

• Deal with different types of tokens

• Handle various size of tokens

• Parallelize token translation

• Keep up with the interface bandwidth

15 Token Structure Literal Token • LZ77-based, byte-level • In Hadoop ecosystem • Support Parquet, ORC, etc. • Two types of tokens: literal tokens and copy tokens

Literal Token: (literal, length, data) Copy Token : (copy, offset, length)

16

Source: Y. Qiao. An fpga-based snappy decompressor-filter. Master’s thesis, Delft University of Technology, 2018. Token Structure Copy

• LZ77-based, byte-level Token • In Hadoop ecosystem • Support Parquet, ORC, etc. • Two types of tokens: literal tokens and copy tokens

Tokens are in various length

17

Source: Y. Qiao. An fpga-based snappy decompressor-filter. Master’s thesis, Delft University of Technology, 2018. Prior Solutions: Decode all possiblities

18

Source: Y. Qiao. An fpga-based snappy decompressor-filter. Master’s thesis, Delft University of Technology, 2018. Design Challenge

• Deal with different types of tokens

• Handle various size of tokens

• Parallelize token translation

• Keep up with the interface bandwidth

19 Token Dependency

• Read – Read • No dependency

• Write – Write Token1 Token 2 Token3 Token 4 • Never write same bytes Read Read • No dependency Write Write Read Read • Write – Read Write Write • Forwarding

20 Token Dependency – Address Conflict

• Write – Write • Same block, same line

16X …

BRAMs

21 Token Dependency – Address Conflict

• Write – Write • Same block, different lines

16X …

BRAMs

22 Token Dependency – Address Conflict

• Read – Read

16X …

BRAMs

23 Design Challenge

• Deal with different types of tokens

• Handle various size of tokens

• Parallelize token translation Token size: 4B 2 tokens/cycle • Keep up with the interface bandwidth 250MHz --> 13 instances

token size * token/cycle * frequency * instances = OpenCAPI BW

24 IDEA – from token-level to BRAM-level

16X …

16X …

25 Architecture Overview

26 Architecture Overview

• Two-level Parser

27 Architecture Overview

• Two-level Parser

28 Architecture Overview

• Two-level Parser

• Parallel BRAM Operations

29 Architecture Overview

• Two-level Parser

• Parallel BRAM Operations

• Relax Execution Model

30 Parallel BRAM Operations & Relax Execution Model

31 Results

• Implementation Result (on XCKU15P)

Flip-Flop LUTs BRAMs Frequency Power 32K(3.32%) 46K(8.95%) 29(2.95%) 250MHz 2.9W

• Benchmark • Alice’s Adventures in Wonderland : 85KB (149KB) for compressed (uncompressed) .

• Throughput FPGA(single CPU(Core i7 with decompressor) one thread) Input 3GB/s 300MB/s Output 5GB/s 500MB/s 32 Publish Paper: J. Fang, J. Chen, P. Hofstee, Z. Al-Ars, J. Hidders, A High-Bandwidth Snappy Decompressor Open in Reconfigurable Logic (CODES+ISSS 2018) Source https://github.com/ChenJianyunp/ FPGA-Snappy-Decompressor.git

33 Contact Me: [email protected] (Jian Fang)