Hardware Based Compression in Ceph OSD with

Weigang Li (weigang.li@.com) Tushar Gohad ([email protected])

Data Center Group Intel Corporation

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Credits

This work wouldn’t have been possible without contributions from –

 Reddy Chagam ([email protected])  Brian Will ([email protected])  Praveen Mosur ([email protected])  Edward Pullin ([email protected])

2

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Agenda

 Ceph  A Quick Primer  Storage Efficiency and Security Features  Storage Workload Acceleration  Software and Hardware Approaches  Compression in Ceph OSD with BTRFS  Compression in BTRFS and Ceph  Hardware Acceleration with QAT  PoC implementation  Performance Results  Key Takeaways

3

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Ceph

 Open-source, object-based scale-out storage system  Software-defined, hardware-agnostic – runs on commodity hardware  Object, and File support in a unified storage cluster  Highly durable, available – , erasure coding  Replicates and re-balances dynamically

4 Image source: http://ceph.com/ceph-storage

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Ceph

 Scalability – CRUSH data placement, no single POF  Enterprise features – snapshots, cloning, mirroring  Most popular block storage for Openstack use cases  10 years of hardening, vibrant community

5 Source: http://www.openstack.org/assets/survey/April-2016-User-Survey-Report.pdf

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Ceph: Architecture

OSD OSD OSD OSD OSD btrfs POSIX

Backend Backend Backend Backend Backend Bluestore KV

DISK DISK DISK DISK DISK

Commodity Servers M M M

6

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Ceph: Storage Efficiency, Security Erasure Coding, Compression, Encryption

RGW Object OSD OSD Erasure (S3/Swift) ... Coding RBD Compression Block Encryption Filestore – BTRFS E2EE Backend ... Backend Bluestore – native

RADOS DISK ... DISK Encryption Native dm-crypt/LUKS Self-encrypting drive Ceph Cluster Ceph Client

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Storage Workload Acceleration Software and Hardware-based Approaches

CPU PCIe-attach

QPI-attach Reconfigurable

On-chip

Fixed-function Granularity Latency,

Ease Ease Programming of On-core

Application Flexibility Distance from CPU Core

ISA-L (SIMD) ASIC (QAT) FPGA GPGPU

Software-based Approaches Fixed-function, Reconfigurable Approaches

8

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Software-based Acceleration with ISA-L Intel® Intelligent Storage Acceleration

 Collection of optimized low-level storage functions  Uses SIMD primitives to accelerate storage functions in software  Primitives supported  Compression –  Encryption – AES block ciphers (XTS, CBC, GCM)  Erasure Coding – Reed-Soloman  CRC (iSCSI, IEEE, T10DIF), RAID5/6  Multi-buffer hashes (SHA1, SHA256, SHA512, MD5), Multi-hash  OS agnostic: , FreeBSD etc  Commercially-compatible, Free, Open Source under BSD license  https://github.com/01org/isa-l_crypto  https://github.com/01org/isa-l  https://software.intel.com/en-us/storage/ISA-L

9

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Hardware-based Acceleration Intel® QuickAssist Technology

Bulk • Security (symmetric encryption and authentication) for data in Cryptography flight and at rest

Public Key • Secure Key Establishment (asymmetric encryption, digital Cryptography Chipset PCI SoC signatures, key exchange) Express*

Connects to Plugin Connects to CPU via on- Card CPU via on-chip • Lossless for board PCI interconnect Compression data in flight and at rest Express* lanes

Open-source Software Support

Cryptography OpenSSL libcrypto, Linux Kernel Crypto Framework Data Compression zlib (user API), BTRFS/ZFS (kernel), Hadoop, Databases

Hardware acceleration of compute intensive workloads (cryptography and compression) 10

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Support for Offloads in Upstream Ceph Intel® ISA-L and QAT

 Erasure Coding  ISA-L offload support for Reed-Soloman codes  Supported since Hammer  Compression  Filestore  QAT offload for BTRFS compression (kernel patch submission in progress)  Bluestore  ISA-L offload for zlib compression supported in upstream master  QAT offload for zlib compression (work-in-progress)  Encryption  Client-side (E2EE)  RADOS GW encryption with ISA-L and QAT offloads (work-in-progress)  Qemu-RBD encryption with ISA-L and QAT offloads (under investigation) 11

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Today’s Focus: Compression

More Data 10x Data Growth from 2013 to 20201

• Digital universe doubling every two years Compression can save • Data from the Internet of Things, will grow 5x • % of data that is analyzed grows from 22% to 37% storage capacity

1 Source: April 2014, EMC* Digital Universe with Research & Analysis by IDC*

12

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Compression: Cost

90.00 60%  1GB Calgary 80.00 51% Corpus* file on one CPU 50% 70.00 core (HT).

60.00 38% 40%  Compression ratio: less is 33% better 50.00 cRatio = compressed size / original size sec 28% 30%

40.00 % cRatio  CPU intensive, better 30.00 20% compression ratio requires

20.00 more CPU time. 10% 10.00

0.00 0% Source as of August 2016: Intel internal measurements with dual E5- lzo gzip-1 gzip-6 2699 v3 (18C, 2.3GHz, 145W), HT & Turbo Enabled, Fedora 22 64 bit, DDR4-128GB real (s) 6.37 22.75 55.15 83.74 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any change to user (s) 4.07 22.09 54.51 83.18 any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating sys (s) 0.79 0.64 0.59 0.52 your contemplated purchases, including the performance of that product when combined with other products. Any difference in system hardware cRatio % 51% 38% 33% 28% or software design or configuration may affect actual performance. Results have been estimated based on internal Intel analysis and are Compression tool provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. For more information go to real (s) user (s) sys (s) cRatio % http://www.intel.com/performance 13 *The Calgary corpus is a collection of text and binary data files, commonly used for comparing data compression algorithms.

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Benefit of Hardware Acceleration

90.00 60%

80.00 51% 50% 70.00 40% 38% 38% 60.00 40% 33%

50.00 sec 28% Less CPU load, 30%

40.00 % cRatio better compression 30.00 ratio 20%

20.00 10% 10.00

0.00 0% lzo accel-1 * accel-6 ** gzip-1 gzip-6 bzip2 real (s) 6.37 4.01 8.01 22.75 55.15 83.74 user (s) 4.07 0.49 0.45 22.09 54.51 83.18 sys (s) 0.79 1.31 1.22 0.64 0.59 0.52 cRatio % 51% 40% 38% 38% 33% 28% Compression tool

* Intel® QuickAssist Technology DH8955 level-1 Compress 1GB Calgary Corpus File ** Intel® QuickAssist Technology DH8955 level-6

Source as of August 2016: Intel internal measurements with dual E5-2699 v3 (18C, 2.3GHz, 145W), HT & Turbo Enabled, Fedora 22 64 bit, 1 x DH8955 adaptor, DDR4-128GB Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Any difference in system hardware or software design or 14 configuration may affect actual performance. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. For more information go to http://www.intel.com/performance

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Transparent Compression in Ceph: BTRFS

 Copy on Write (CoW) filesystem for Linux.  “Has the correct feature set and roadmap to serve Ceph in the long-term, and is recommended for testing, development, and any non-critical deployments… This compelling list of features makes btrfs the ideal choice for Ceph clusters”*  Native compression support.  Mount with “compress” or “compress-force”.  ZLIB / LZO supported.  Compress up to 128KB each time. 15 * http://docs.ceph.com/docs/hammer/rados/configuration/filesystem-recommendations/

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. BTRFS Compression Accelerated with Intel® QuickAssist Technology  BTRFS currently supports LZO and ZLIB:  The Lempel-Ziv-Oberhumer (LZO) compression is a portable and library that focuses on compression speed rather than data compression ratio.  ZLIB provides lossless data compression based on the compression algorithm.  LZ77 +  Good compression ratio, slow  Intel® QuickAssist Technology supports:  DEFLATE: LZ77 compression followed by Huffman coding with GZIP or ZLIB header.

16

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Hardware Compression in BTRFS

 BTRFS compress the Application page buffers before user sys writing to the storage kernel call media. VFS Page  LKCF select hardware Cache BTRFS engine for compression. LZO

ZLIB  Data compressed by hardware can be de-

Linux Kernel Crypto API Flush compressed by software async compress library, and vise versa. Job DONE

Intel®QuickAssist Technology Driver

Storage

Intel® QuickAssist Technology Media 17

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Hardware Compression in BTRFS (Cont.)

btrfs_compress_pages  BTRFS submit “async” compression job with sg-list containing up to 32 x 4K zlib_compress_pages_async pages.  BTRFS compression thread is put to sleep when the Uncmpressed Data Compressed Data “async” compression API is 4K 4K … 4K 4K 4K 4K called.  BTRFS compression thread Input Buffer (up to128KB) Output buffer (pre-allocated) is woken up when sleep… return hardware complete the

async compress compression job.  Hardware can be fully Linux Kernel Crypto API Callback utilized when multiple DMA DMA BTRFS compression input output interrupt threads run in-parallel. Intel® QuickAssist Technology 18

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Compression in Ceph OSD

Compute  Ceph OSD with BTRFS can support build- ComputeNode in compression: Node  Transparent, real-time compression in Network the filesystem level.  Reduce the amount of data written to local disk, and reduce disk I/O. OSD OSD  Hardware accelerator can be plugged in to free up OSDs’ CPU. BTRFS

Disk Disk

Intel® DH8955

19

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Benchmark - Hardware Setup

FIO Ceph Cluster Linux Linux

CPU 1 CPU 0 CPU 1 DDR4 CPU 0 128GB Xeon(R) CPU Xeon(R) CPU Xeon(R) CPU Xeon(R) CPU HBA E5-2699 v3 E5-2699 v3 E5-2699 v3 E5-2699 v3 LSI00300 (Haswell) @ (Haswell) @ (Haswell) @ (Haswell) @ 2.30GHz 2.30GHz 2.30GHz 2.30GHz

PCIe 40Gb NIC PCIe PCIe PCIe

DDR4 JBOD 64GB

NVMe NVMe

SSD x 12 Intel® Intel® DH8955 DH8955 plug-in card Client plug-in card Server 20

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Benchmark - Ceph Configuration

MON  Deploy Ceph OSD on top of BTRFS as backend OSD-1 OSD-3 OSD-23 OSD-2 OSD-4 OSD-24 filesystem.  Deploy 2 OSDs on 1 SSD  24x OSDs in total. BTRFS …  2x NVMe for journal.  Data written to Ceph OSD

SSD-1 SSD-2 SSD-12 is compressed by Intel® QuickAssist Technology NVMe-1 (Intel® DH8955 plug-in Journal Intel® Intel® card). NVMe-2 DH8955 DH8955 Journal 21

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Test Methodology

FIO thread0  Start 64 FIO threads in FIOFIO thread thread client, each write / read Client CephFS RBD 2GB file to / from Ceph cluster through network. LIBRADOS  Drop caches before tests. For write tests, all files are synchronized to OSDs’ disk RADOS before tests complete.

OSD  The average CPU load, OSD CEPH OSD disk utilization in Ceph OSDs and FIO throughput MON are measured.

22

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Benchmark Configuration Details

Client CPU 2 x Intel® Xeon CPU E5-2699 v3 (Haswell) @ 2.30GHz (36-core 72-threads)

Memory 64GB Network 40GbE, jumbo frame: MTU=8000 Test Tool FIO 2.1.2, engine=libaio, bs=64KB, 64 threads

Ceph Cluster

CPU 2 x Intel (R) Xeon CPU E5-2699 v3 (Haswell) @ 2.30GHz (36-core 72-threads)

Memory 128GB

Network 40GbE, jumbo frame: MTU=8000 HBA HBA LSI00300 OS Fedora 22 (Kernel 4.1.3) OSD 24 x OSD, 2 on one SSD (S3700), no-replica 2 x NVMe (P3700) for journal 2400 pgs Accelerator Intel® QuickAssist Technology, 2 x Intel® QuickAssist Adapters 8955 Dynamic compression Level-1 BTRFS ZLIB S/W ZLIB Level-3 23

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Sequential Write

120% 3500

2910 2960 100% 3003 3000

2500 80% 60% disk saving, with 2000 BW CPU Util (%) 60% minimal CPU overhead (MB/s) cRatio (%) 1500 40% 1157 1000

20% 500

0% 0 off accel * lzo zlib-3 Cpu Util (%) 13.62% 15.25% 28.30% 90.95% cRatio (%) 100% 40% 50% 36% Bandwidth(MB/s) 2910 2960 3003 1157

Source as of August 2016: Intel internal 120 measurements with dual E5-2699 v3 (18C, 2.3GHz, 145W), HT & Turbo Enabled, Fedora 22 64 bit, 100 kernel 4.1.3, 2 x DH8955 adaptor, DDR4-128GB Software and workloads used in performance tests

% 80 may have been optimized for performance only on off Intel microprocessors. Any change to any of those factors may cause the results to vary. You should util 60 accel * consult other information and performance tests to 40 assist you in fully evaluating your contemplated cpu purchases, including the performance of that product 20 lzo when combined with other products. Any difference in system hardware or software design or 0 zlib-3 configuration may affect actual performance. Results have been estimated based on internal Intel analysis 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101105 and are provided for informational purposes only. Any difference in system hardware or software Time (seconds) design or configuration may affect actual * Intel® QuickAssist Technology DH8955 level-1 performance. For more information go to 24 http://www.intel.com/performance ** Dataset is random data generated by FIO

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Sequential Read

30% 3500

3042 25% 2915 3000 2913 20% Minimal CPU overhead 2500 2557 CPU Util (%) 15% for decompression BW (MB/s) 2000 10%

1500 5%

0% 1000 off accel * lzo zlib-3 Cpu Util (%) 7.33% 8.76% 11.81% 26.20% Bandwidth(MB/s) 2557 2915 3042 2913

Source as of August 2016: Intel internal 40 measurements with dual E5-2699 v3 (18C, 2.3GHz, 145W), HT & Turbo Enabled, Fedora 22 64 bit, kernel 4.1.3, 2 x DH8955 adaptor, DDR4-128GB 30 Software and workloads used in performance tests

% may have been optimized for performance only on off Intel microprocessors. Any change to any of those factors may cause the results to vary. You should util 20 consult other information and performance tests to accel * assist you in fully evaluating your contemplated cpu 10 purchases, including the performance of that product lzo when combined with other products. Any difference in system hardware or software design or configuration may affect actual performance. Results 0 zlib-3 have been estimated based on internal Intel analysis 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940 and are provided for informational purposes only. Any difference in system hardware or software Time (seconds) design or configuration may affect actual performance. For more information go to 25 * Intel® QuickAssist Technology DH8955 level-1 http://www.intel.com/performance

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Key Takeaways

 Data compression offers improved storage efficiency  Filesystem level compression in OSD is transparent to the Ceph software stack  Data compression is CPU intensive, getting better compression ratio requires more CPU cost  Software and hardware offloads methods available for accelerating compression, including ISA-L, Intel® QuickAssist Technology and FPGA  Hardware offloading method can greatly reduce CPU cost, optimize disk utilization & IO in Storage infrastructure

26

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Additional Sources of Information

 For more information on Intel® QuickAssist Technology & Intel® QuickAssist Software Solutions can be found here:  Software Package and engine are available at 01.org: Intel QuickAssist Technology | 01.org  For more details on Intel® QuickAssist Technology visit: http://www.intel.com/quickassist  Intel Network Builders: https://networkbuilders.intel.com/ecosystem  Intel®QuickAssist Technology Storage Testimonials  IBM v7000Z w/QuickAssist:http://www- 03.ibm.com/systems/storage/disk/storwize_v7000/overview.html  https://builders.intel.com/docs/networkbuilders/Accelerating-data-economics-IBM-flashSystem-and-Intel-quick-assist- technology.pdf  Intel’s QuickAssist Adapter for Servers: http://ark.intel.com/products/79483/Intel- QuickAssist-Adapter-8950  DEFLATE Compressed Data Format Specification version 1.3 http://tools.ietf.org/html/rfc1951  BTRFS: https://btrfs.wiki.kernel.org  Ceph: http://ceph.com/ 27

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Additional Information

28

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. QAT Attach Options

29

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Legal Disclaimer

 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm%20 Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel, Intel logo, Intel Core, Intel Inside, Intel Inside logo, Intel Ethernet, Intel QuickAssist Technology, Intel Flow Director, Intel Solid State Drives, Intel Intelligent Storage Acceleration Library, Itanium,, Xeon, and Xeon Inside are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, , device drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your system vendor for more information.

No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology is a security technology under development by Intel and requires for operation a computer system with Intel® Virtualization Technology, an Intel Trusted Execution Technology-enabled processor, chipset, BIOS, Authenticated Code Modules, and an Intel or other compatible measured virtual machine monitor. In addition, Intel Trusted Execution Technology requires the system to contain a TPMv1.2 as defined by the Trusted Computing Group and specific software for some uses. See http://www.intel.com/technology/security/ for more information.

Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM) and, for some uses, certain platform software enabled for it. Functionality, performance or other benefits will vary depending on hardware and software configurations and may require a BIOS update. Software applications may not be compatible with all operating systems. Please check with your application vendor.

* Other names and brands may be claimed as the property of others.

Other vendors are listed by Intel as a convenience to Intel's general customer base, but Intel does not make any representations or warranties whatsoever regarding quality, reliability, functionality, or compatibility of these devices. This list and/or these devices may be subject to change without notice.  Copyright © 2016, Intel Corporation. All rights reserved.

30

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved.