Hardware Based Compression in Ceph OSD with BTRFS

Weigang Li ([email protected]) Brian Will ([email protected]) Praveen Mosur ([email protected]) Edward Pullin ([email protected])

Intel Corporation

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Agenda

 Motivation  : value vs. cost  Hardware based compression in Ceph OSD  Benefit of hardware acceleration  Compression algorithms  Compression in BTRFS & Ceph  PoC implementation  Benchmark configuration & result

2

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Compression: Why

More Data 10x Data Growth from 2013 to 20201

• Digital universe doubling every two years Compression can save • Data from the Internet of Things, will grow 5x • % of data that is analyzed grows from 22% to 37% storage capacity

1 Source: April 2014, EMC* Digital Universe with Research & Analysis by IDC*

3

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Compression: Cost

90.00 60%  1GB Calgary

80.00 51% Corpus* file on one CPU 50% core (HT). 70.00  Compression ratio: less 60.00 40% 38% is better

sec 50.00 33% cRatio = compressed size / 30% original size 28% 40.00 % cRatio  CPU intensive, better 30.00 20% compression ratio requires more CPU time. 20.00 10% 10.00 Source as of August 2016: Intel internal measurements with dual E5- 2699 v3 (18C, 2.3GHz, 145W), HT & Turbo Enabled, Fedora 22 64 bit, DDR4-128GB 0.00 0% Software and workloads used in performance tests may have been lzo -1 gzip-6 optimized for performance only on Intel microprocessors. Any change to any of those factors may cause the results to vary. You should consult real (s) 6.37 22.75 55.15 83.74 other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product user (s) 4.07 22.09 54.51 83.18 when combined with other products. Any difference in system hardware or software design or configuration may affect actual performance. sys (s) 0.79 0.64 0.59 0.52 Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system cRatio % 51% 38% 33% 28% hardware or software design or configuration may affect actual performance. For more information go to Compression tool http://www.intel.com/performance 4 *The Calgary Corpus is a collection of text and binary data files, commonly used for comparing data compression algorithms.

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Hardware Acceleration - Intel® QuickAssist Technology

Bulk • Security (symmetric encryption and authentication) for data in Cryptography flight and at rest

Public Key • Secure Key Establishment Chipset PCI Express* SoC (asymmetric encryption, digital Plugin Card Cryptography signatures, key exchange)

Connects to Connects to CPU via off- Connects to CPU CPU via on- board PCI via on-chip board PCI Express* lanes interconnect Express* lanes • Lossless data compression for (slot) Compression data in flight and at rest

Intel® QuickAssist Technology integrates hardware acceleration of compute intensive workloads (specifically, cryptography & compression) on Intel® Architecture Platforms 5

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Benefit of Hardware Acceleration

90.00 60%

80.00 51% 50% 70.00

40% 38% 60.00 38% 40% 33%

50.00 sec 30% Less CPU load, 28%

40.00 % cRatio better compression 30.00 ratio 20%

20.00 10% 10.00

0.00 0% lzo accel-1 * accel-6 ** gzip-1 gzip-6 bzip2 real (s) 6.37 4.01 8.01 22.75 55.15 83.74 user (s) 4.07 0.49 0.45 22.09 54.51 83.18 sys (s) 0.79 1.31 1.22 0.64 0.59 0.52 cRatio % 51% 40% 38% 38% 33% 28% Compression tool

* Intel® QuickAssist Technology DH8955 level-1 Compress 1GB Calgary Corpus File ** Intel® QuickAssist Technology DH8955 level-6

Source as of August 2016: Intel internal measurements with dual E5-2699 v3 (18C, 2.3GHz, 145W), HT & Turbo Enabled, Fedora 22 64 bit, 1 x DH8955 adaptor, DDR4-128GB Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Any difference in system hardware or software design or 6 configuration may affect actual performance. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. For more information go to http://www.intel.com/performance

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. BTRFS Introduction

 Copy on Write (CoW) filesystem for Linux.  “Has the correct feature set and roadmap to serve Ceph in the long-term, and is recommended for testing, development, and any non-critical deployments… This compelling list of features makes btrfs the ideal choice for Ceph clusters”*  Native compression support.  Mount with “compress” or “compress-force”.  ZLIB / LZO supported.  Compress up to 128KB each time. 7 * http://docs.ceph.com/docs/hammer/rados/configuration/filesystem-recommendations/

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Compression Algorithm

 BTRFS currently supports LZO and ZLIB:  The Lempel-Ziv-Oberhumer (LZO) compression is a portable and library that focuses on compression speed rather than data compression ratio.  ZLIB provides lossless data compression based on the compression algorithm.  LZ77 +  Good compression ratio, slow  Intel® QuickAssist Technology supports:  DEFLATE: LZ77 compression followed by Huffman coding with GZIP or ZLIB header.

8

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Hardware Compression in BTRFS

 BTRFS compress the Application page buffers before user sys writing to the storage kernel call media. VFS Page  LKCF select hardware Cache BTRFS engine for compression. LZO

ZLIB  Data compressed by hardware can be de-

Linux Kernel Crypto API Flush compressed by software async compress library, and vise versa. Job DONE

Intel®QuickAssist Technology Driver

Storage

Intel® QuickAssist Technology Media 9

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Hardware Compression in BTRFS (Cont.)

btrfs_compress_pages  BTRFS submit “async” compression job with sg-list containing up to 32 x 4K zlib_compress_pages_async pages.  BTRFS compression thread is put to sleep when the Uncmpressed Data Cmpressed Data “async” compression API is 4K 4K … 4K 4K 4K 4K called.  BTRFS compression thread Input Buffer (up to128KB) Output buffer (pre-allocated) is woken up when sleep… return hardware complete the

async compress compression job.  Hardware can be fully Linux Kernel Crypto API Callback utilized when multiple DMA DMA BTRFS compression input output interrupt threads run in-parallel. Intel® QuickAssist Technology 10

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Compression in Ceph OSD

Compute  Ceph is a distributed object store and file ComputeNode system designed to provide excellent Node performance, reliability and scalability, but Network it doesn’t support native compression currently.  Ceph OSD with BTRFS can support build- OSD OSD in compression:  Transparent, real-time compression in the filesystem level. BTRFS  Reduce the amount of data written to local disk, and reduce disk I/O. Disk Disk  Hardware accelerator can be plugged in to free up OSDs’ CPU. Intel® DH8955  No benefit to the network bandwidth. 11

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Benchmark - Hardware Setup

FIO Ceph Cluster Linux Linux

CPU 1 CPU 0 CPU 1 DDR4 CPU 0 128GB Xeon(R) CPU Xeon(R) CPU Xeon(R) CPU Xeon(R) CPU HBA E5-2699 v3 E5-2699 v3 E5-2699 v3 E5-2699 v3 LSI00300 (Haswell) @ (Haswell) @ (Haswell) @ (Haswell) @ 2.30GHz 2.30GHz 2.30GHz 2.30GHz

PCIe 40Gb NIC PCIe PCIe PCIe

DDR4 JBOD 64GB

NVMe NVMe

SSD x 12 Intel® Intel® DH8955 DH8955 plug-in card Client plug-in card Server 12

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Benchmark - Ceph Configuration

MON  Deploy Ceph OSD on top of BTRFS as backend OSD-1 OSD-3 OSD-23 OSD-2 OSD-4 OSD-24 filesystem.  Deploy 2 OSDs on 1 SSD  24x OSDs in total. BTRFS …  2x NVMe for journal.  Data written to Ceph OSD

SSD-1 SSD-2 SSD-12 is compressed by Intel® QuickAssist Technology NVMe-1 (Intel® DH8955 plug-in Journal Intel® Intel® card). NVMe-2 DH8955 DH8955 Journal 13

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Test Methodology

FIO thread0  Start 64 FIO threads in FIOFIO thread thread client, each write / read Client CephFS RBD 2GB file to / from Ceph cluster through network. LIBRADOS  Drop caches before tests. For write tests, all files are synchronized to OSDs’ disk RADOS before tests complete.

OSD  The average CPU load, OSD CEPH OSD disk utilization in Ceph OSDs and FIO throughput MON are measured.

14

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Benchmark Configuration Details

Client CPU 2 x Intel® Xeon CPU E5-2699 v3 (Haswell) @ 2.30GHz (36-core 72-threads)

Memory 64GB Network 40GbE, jumbo frame: MTU=8000 Test Tool FIO 2.1.2, engine=libaio, bs=64KB, 64 threads

Ceph Cluster

CPU 2 x Intel (R) Xeon CPU E5-2699 v3 (Haswell) @ 2.30GHz (36-core 72-threads)

Memory 128GB

Network 40GbE, jumbo frame: MTU=8000 HBA HBA LSI00300 OS Fedora 22 (Kernel 4.1.3) OSD 24 x OSD, 2 on one SSD (S3700), no-replica 2 x NVMe (P3700) for journal 2400 pgs Accelerator Intel® QuickAssist Technology, 2 x Intel® QuickAssist Adapters 8955 Dynamic compression Level-1 BTRFS ZLIB S/W ZLIB Level-3 15

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Sequential Write

120% 3500

2910 2960 100% 3003 3000

2500 80% 60% disk saving, with 2000 BW CPU Util (%) 60% minimal CPU overhead (MB/s) cRatio (%) 1500 40% 1157 1000

20% 500

0% 0 off accel * lzo zlib-3 Cpu Util (%) 13.62% 15.25% 28.30% 90.95% cRatio (%) 100% 40% 50% 36% Bandwidth(MB/s) 2910 2960 3003 1157

Source as of August 2016: Intel internal 120 measurements with dual E5-2699 v3 (18C, 2.3GHz, 145W), HT & Turbo Enabled, Fedora 22 64 bit, 100 kernel 4.1.3, 2 x DH8955 adaptor, DDR4-128GB Software and workloads used in performance tests

% 80 may have been optimized for performance only on off Intel microprocessors. Any change to any of those factors may cause the results to vary. You should util 60 accel * consult other information and performance tests to 40 assist you in fully evaluating your contemplated cpu purchases, including the performance of that product 20 lzo when combined with other products. Any difference in system hardware or software design or 0 zlib-3 configuration may affect actual performance. Results have been estimated based on internal Intel analysis 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101105 and are provided for informational purposes only. Any difference in system hardware or software Time (seconds) design or configuration may affect actual * Intel® QuickAssist Technology DH8955 level-1 performance. For more information go to 16 http://www.intel.com/performance ** Dataset is random data generated by FIO

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Sequential Read

30% 3500

3042 25% 2915 3000 2913 20% Minimal CPU overhead 2500 2557 CPU Util (%) 15% for decompression BW (MB/s) 2000 10%

1500 5%

0% 1000 off accel * lzo zlib-3 Cpu Util (%) 7.33% 8.76% 11.81% 26.20% Bandwidth(MB/s) 2557 2915 3042 2913

Source as of August 2016: Intel internal 40 measurements with dual E5-2699 v3 (18C, 2.3GHz, 145W), HT & Turbo Enabled, Fedora 22 64 bit, kernel 4.1.3, 2 x DH8955 adaptor, DDR4-128GB 30 Software and workloads used in performance tests

% may have been optimized for performance only on off Intel microprocessors. Any change to any of those factors may cause the results to vary. You should util 20 consult other information and performance tests to accel * assist you in fully evaluating your contemplated cpu 10 purchases, including the performance of that product lzo when combined with other products. Any difference in system hardware or software design or configuration may affect actual performance. Results 0 zlib-3 have been estimated based on internal Intel analysis 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940 and are provided for informational purposes only. Any difference in system hardware or software Time (seconds) design or configuration may affect actual performance. For more information go to 17 * Intel® QuickAssist Technology DH8955 level-1 http://www.intel.com/performance

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Key Takeaways

 Data compression can save disk IO and disk utilization.  Data compression is CPU intensive, getting better compression ratio requires more CPU cost.  Hardware offloading method can greatly reduce CPU cost, optimize disk utilization & IO in the Storage infrastructure.  Filesystem level compression in OSD is transparent to the Ceph software stack.

18

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. References

 DEFLATE Compressed Data Format Specification version 1.3 http://tools.ietf.org/html/rfc1951  BTRFS: https://btrfs.wiki.kernel.org  Ceph: http://ceph.com/  For more information on Intel® QuickAssist Technology & Intel® QuickAssist Software Solutions can be found here:  Software Package and engine are available at 01.org: Intel QuickAssist Technology | 01.org  For more details on Intel® QuickAssist Technology visit: http://www.intel.com/quickassist  Intel Network Builders: https://networkbuilders.intel.com/ecosystem  Intel®QuickAssist Technology Storage Testimonials  IBM v7000Z w/QuickAssist:http://www- 03.ibm.com/systems/storage/disk/storwize_v7000/overview.html  https://builders.intel.com/docs/networkbuilders/Accelerating-data-economics-IBM-flashSystem-and-Intel-quick-assist- technology.pdf  Intel’s QuickAssist Adapter for Servers: http://ark.intel.com/products/79483/Intel- QuickAssist-Adapter-8950 19

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Legal Disclaimer

 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm%20 Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel, Intel logo, Intel Core, Intel Inside, Intel Inside logo, Intel Ethernet, Intel QuickAssist Technology, Intel Flow Director, Intel Solid State Drives, Intel Intelligent Storage Acceleration Library, Itanium,, Xeon, and Xeon Inside are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your system vendor for more information.

No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology is a security technology under development by Intel and requires for operation a computer system with Intel® Virtualization Technology, an Intel Trusted Execution Technology-enabled processor, chipset, BIOS, Authenticated Code Modules, and an Intel or other compatible measured virtual machine monitor. In addition, Intel Trusted Execution Technology requires the system to contain a TPMv1.2 as defined by the Trusted Computing Group and specific software for some uses. See http://www.intel.com/technology/security/ for more information.

Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM) and, for some uses, certain platform software enabled for it. Functionality, performance or other benefits will vary depending on hardware and software configurations and may require a BIOS update. Software applications may not be compatible with all operating systems. Please check with your application vendor.

* Other names and brands may be claimed as the property of others.

Other vendors are listed by Intel as a convenience to Intel's general customer base, but Intel does not make any representations or warranties whatsoever regarding quality, reliability, functionality, or compatibility of these devices. This list and/or these devices may be subject to change without notice.  Copyright © 2016, Intel Corporation. All rights reserved.

20

2016 Storage Developer Conference. © Intel Corp. All Rights Reserved.