Hardware Based Compression in Ceph OSD with BTRFS
Weigang Li (weigang.li@intel.com) Tushar Gohad ([email protected])
Data Center Group Intel Corporation
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Credits
This work wouldn’t have been possible without contributions from –
Reddy Chagam ([email protected]) Brian Will ([email protected]) Praveen Mosur ([email protected]) Edward Pullin ([email protected])
2
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Agenda
Ceph A Quick Primer Storage Efficiency and Security Features Storage Workload Acceleration Software and Hardware Approaches Compression in Ceph OSD with BTRFS Compression in BTRFS and Ceph Hardware Acceleration with QAT PoC implementation Performance Results Key Takeaways
3
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Ceph
Open-source, object-based scale-out storage system Software-defined, hardware-agnostic – runs on commodity hardware Object, Block and File support in a unified storage cluster Highly durable, available – replication, erasure coding Replicates and re-balances dynamically
4 Image source: http://ceph.com/ceph-storage
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Ceph
Scalability – CRUSH data placement, no single POF Enterprise features – snapshots, cloning, mirroring Most popular block storage for Openstack use cases 10 years of hardening, vibrant community
5 Source: http://www.openstack.org/assets/survey/April-2016-User-Survey-Report.pdf
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Ceph: Architecture
OSD OSD OSD OSD OSD btrfs xfs ext4 POSIX
Backend Backend Backend Backend Backend Bluestore KV
DISK DISK DISK DISK DISK
Commodity Servers M M M
6
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Ceph: Storage Efficiency, Security Erasure Coding, Compression, Encryption
RGW Object OSD OSD Erasure (S3/Swift) ... Coding RBD Compression Block Encryption Filestore – BTRFS E2EE Backend ... Backend Bluestore – native
RADOS DISK ... DISK Encryption Native dm-crypt/LUKS Self-encrypting drive Ceph Cluster Ceph Client
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Storage Workload Acceleration Software and Hardware-based Approaches
CPU PCIe-attach
QPI-attach Reconfigurable
On-chip
Fixed-function Granularity Latency,
Ease Ease Programming of On-core
Application Flexibility Distance from CPU Core
ISA-L (SIMD) ASIC (QAT) FPGA GPGPU
Software-based Approaches Fixed-function, Reconfigurable Approaches
8
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Software-based Acceleration with ISA-L Intel® Intelligent Storage Acceleration Library
Collection of optimized low-level storage functions Uses SIMD primitives to accelerate storage functions in software Primitives supported Compression – gzip Encryption – AES block ciphers (XTS, CBC, GCM) Erasure Coding – Reed-Soloman CRC (iSCSI, IEEE, T10DIF), RAID5/6 Multi-buffer hashes (SHA1, SHA256, SHA512, MD5), Multi-hash OS agnostic: Linux, FreeBSD etc Commercially-compatible, Free, Open Source under BSD license https://github.com/01org/isa-l_crypto https://github.com/01org/isa-l https://software.intel.com/en-us/storage/ISA-L
9
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Hardware-based Acceleration Intel® QuickAssist Technology
Bulk • Security (symmetric encryption and authentication) for data in Cryptography flight and at rest
Public Key • Secure Key Establishment (asymmetric encryption, digital Cryptography Chipset PCI SoC signatures, key exchange) Express*
Connects to Plugin Connects to CPU via on- Card CPU via on-chip • Lossless data compression for board PCI interconnect Compression data in flight and at rest Express* lanes
Open-source Software Support
Cryptography OpenSSL libcrypto, Linux Kernel Crypto Framework Data Compression zlib (user API), BTRFS/ZFS (kernel), Hadoop, Databases
Hardware acceleration of compute intensive workloads (cryptography and compression) 10
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Support for Offloads in Upstream Ceph Intel® ISA-L and QAT
Erasure Coding ISA-L offload support for Reed-Soloman codes Supported since Hammer Compression Filestore QAT offload for BTRFS compression (kernel patch submission in progress) Bluestore ISA-L offload for zlib compression supported in upstream master QAT offload for zlib compression (work-in-progress) Encryption Client-side (E2EE) RADOS GW encryption with ISA-L and QAT offloads (work-in-progress) Qemu-RBD encryption with ISA-L and QAT offloads (under investigation) 11
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Today’s Focus: Compression
More Data 10x Data Growth from 2013 to 20201
• Digital universe doubling every two years Compression can save • Data from the Internet of Things, will grow 5x • % of data that is analyzed grows from 22% to 37% storage capacity
1 Source: April 2014, EMC* Digital Universe with Research & Analysis by IDC*
12
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Compression: Cost
90.00 60% Compress 1GB Calgary 80.00 51% Corpus* file on one CPU 50% 70.00 core (HT).
60.00 38% 40% Compression ratio: less is 33% better 50.00 cRatio = compressed size / original size sec 28% 30%
40.00 % cRatio CPU intensive, better 30.00 20% compression ratio requires
20.00 more CPU time. 10% 10.00
0.00 0% Source as of August 2016: Intel internal measurements with dual E5- lzo gzip-1 gzip-6 bzip2 2699 v3 (18C, 2.3GHz, 145W), HT & Turbo Enabled, Fedora 22 64 bit, DDR4-128GB real (s) 6.37 22.75 55.15 83.74 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any change to user (s) 4.07 22.09 54.51 83.18 any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating sys (s) 0.79 0.64 0.59 0.52 your contemplated purchases, including the performance of that product when combined with other products. Any difference in system hardware cRatio % 51% 38% 33% 28% or software design or configuration may affect actual performance. Results have been estimated based on internal Intel analysis and are Compression tool provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. For more information go to real (s) user (s) sys (s) cRatio % http://www.intel.com/performance 13 *The Calgary corpus is a collection of text and binary data files, commonly used for comparing data compression algorithms.
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Benefit of Hardware Acceleration
90.00 60%
80.00 51% 50% 70.00 40% 38% 38% 60.00 40% 33%
50.00 sec 28% Less CPU load, 30%
40.00 % cRatio better compression 30.00 ratio 20%
20.00 10% 10.00
0.00 0% lzo accel-1 * accel-6 ** gzip-1 gzip-6 bzip2 real (s) 6.37 4.01 8.01 22.75 55.15 83.74 user (s) 4.07 0.49 0.45 22.09 54.51 83.18 sys (s) 0.79 1.31 1.22 0.64 0.59 0.52 cRatio % 51% 40% 38% 38% 33% 28% Compression tool
* Intel® QuickAssist Technology DH8955 level-1 Compress 1GB Calgary Corpus File ** Intel® QuickAssist Technology DH8955 level-6
Source as of August 2016: Intel internal measurements with dual E5-2699 v3 (18C, 2.3GHz, 145W), HT & Turbo Enabled, Fedora 22 64 bit, 1 x DH8955 adaptor, DDR4-128GB Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Any difference in system hardware or software design or 14 configuration may affect actual performance. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. For more information go to http://www.intel.com/performance
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Transparent Compression in Ceph: BTRFS
Copy on Write (CoW) filesystem for Linux. “Has the correct feature set and roadmap to serve Ceph in the long-term, and is recommended for testing, development, and any non-critical deployments… This compelling list of features makes btrfs the ideal choice for Ceph clusters”* Native compression support. Mount with “compress” or “compress-force”. ZLIB / LZO supported. Compress up to 128KB each time. 15 * http://docs.ceph.com/docs/hammer/rados/configuration/filesystem-recommendations/
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. BTRFS Compression Accelerated with Intel® QuickAssist Technology BTRFS currently supports LZO and ZLIB: The Lempel-Ziv-Oberhumer (LZO) compression is a portable and lossless compression library that focuses on compression speed rather than data compression ratio. ZLIB provides lossless data compression based on the DEFLATE compression algorithm. LZ77 + Huffman coding Good compression ratio, slow Intel® QuickAssist Technology supports: DEFLATE: LZ77 compression followed by Huffman coding with GZIP or ZLIB header.
16
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Hardware Compression in BTRFS
BTRFS compress the Application page buffers before user sys writing to the storage kernel call media. VFS Page LKCF select hardware Cache BTRFS engine for compression. LZO
ZLIB Data compressed by hardware can be de-
Linux Kernel Crypto API Flush compressed by software async compress library, and vise versa. Job DONE
Intel®QuickAssist Technology Driver
Storage
Intel® QuickAssist Technology Media 17
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Hardware Compression in BTRFS (Cont.)
btrfs_compress_pages BTRFS submit “async” compression job with sg-list containing up to 32 x 4K zlib_compress_pages_async pages. BTRFS compression thread is put to sleep when the Uncmpressed Data Compressed Data “async” compression API is 4K 4K … 4K 4K 4K 4K called. BTRFS compression thread Input Buffer (up to128KB) Output buffer (pre-allocated) is woken up when sleep… return hardware complete the
async compress compression job. Hardware can be fully Linux Kernel Crypto API Callback utilized when multiple DMA DMA BTRFS compression input output interrupt threads run in-parallel. Intel® QuickAssist Technology 18
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Compression in Ceph OSD
Compute Ceph OSD with BTRFS can support build- ComputeNode in compression: Node Transparent, real-time compression in Network the filesystem level. Reduce the amount of data written to local disk, and reduce disk I/O. OSD OSD Hardware accelerator can be plugged in to free up OSDs’ CPU. BTRFS
Disk Disk
Intel® DH8955
19
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Benchmark - Hardware Setup
FIO Ceph Cluster Linux Linux
CPU 1 CPU 0 CPU 1 DDR4 CPU 0 128GB Xeon(R) CPU Xeon(R) CPU Xeon(R) CPU Xeon(R) CPU HBA E5-2699 v3 E5-2699 v3 E5-2699 v3 E5-2699 v3 LSI00300 (Haswell) @ (Haswell) @ (Haswell) @ (Haswell) @ 2.30GHz 2.30GHz 2.30GHz 2.30GHz
PCIe 40Gb NIC PCIe PCIe PCIe
DDR4 JBOD 64GB
NVMe NVMe
SSD x 12 Intel® Intel® DH8955 DH8955 plug-in card Client plug-in card Server 20
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Benchmark - Ceph Configuration
MON Deploy Ceph OSD on top of BTRFS as backend OSD-1 OSD-3 OSD-23 OSD-2 OSD-4 OSD-24 filesystem. Deploy 2 OSDs on 1 SSD 24x OSDs in total. BTRFS … 2x NVMe for journal. Data written to Ceph OSD
SSD-1 SSD-2 SSD-12 is compressed by Intel® QuickAssist Technology NVMe-1 (Intel® DH8955 plug-in Journal Intel® Intel® card). NVMe-2 DH8955 DH8955 Journal 21
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Test Methodology
FIO thread0 Start 64 FIO threads in FIOFIO thread thread client, each write / read Client CephFS RBD 2GB file to / from Ceph cluster through network. LIBRADOS Drop caches before tests. For write tests, all files are synchronized to OSDs’ disk RADOS before tests complete.
OSD The average CPU load, OSD CEPH OSD disk utilization in Ceph OSDs and FIO throughput MON are measured.
22
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Benchmark Configuration Details
Client CPU 2 x Intel® Xeon CPU E5-2699 v3 (Haswell) @ 2.30GHz (36-core 72-threads)
Memory 64GB Network 40GbE, jumbo frame: MTU=8000 Test Tool FIO 2.1.2, engine=libaio, bs=64KB, 64 threads
Ceph Cluster
CPU 2 x Intel (R) Xeon CPU E5-2699 v3 (Haswell) @ 2.30GHz (36-core 72-threads)
Memory 128GB
Network 40GbE, jumbo frame: MTU=8000 HBA HBA LSI00300 OS Fedora 22 (Kernel 4.1.3) OSD 24 x OSD, 2 on one SSD (S3700), no-replica 2 x NVMe (P3700) for journal 2400 pgs Accelerator Intel® QuickAssist Technology, 2 x Intel® QuickAssist Adapters 8955 Dynamic compression Level-1 BTRFS ZLIB S/W ZLIB Level-3 23
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Sequential Write
120% 3500
2910 2960 100% 3003 3000
2500 80% 60% disk saving, with 2000 BW CPU Util (%) 60% minimal CPU overhead (MB/s) cRatio (%) 1500 40% 1157 1000
20% 500
0% 0 off accel * lzo zlib-3 Cpu Util (%) 13.62% 15.25% 28.30% 90.95% cRatio (%) 100% 40% 50% 36% Bandwidth(MB/s) 2910 2960 3003 1157
Source as of August 2016: Intel internal 120 measurements with dual E5-2699 v3 (18C, 2.3GHz, 145W), HT & Turbo Enabled, Fedora 22 64 bit, 100 kernel 4.1.3, 2 x DH8955 adaptor, DDR4-128GB Software and workloads used in performance tests
% 80 may have been optimized for performance only on off Intel microprocessors. Any change to any of those factors may cause the results to vary. You should util 60 accel * consult other information and performance tests to 40 assist you in fully evaluating your contemplated cpu purchases, including the performance of that product 20 lzo when combined with other products. Any difference in system hardware or software design or 0 zlib-3 configuration may affect actual performance. Results have been estimated based on internal Intel analysis 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101105 and are provided for informational purposes only. Any difference in system hardware or software Time (seconds) design or configuration may affect actual * Intel® QuickAssist Technology DH8955 level-1 performance. For more information go to 24 http://www.intel.com/performance ** Dataset is random data generated by FIO
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Sequential Read
30% 3500
3042 25% 2915 3000 2913 20% Minimal CPU overhead 2500 2557 CPU Util (%) 15% for decompression BW (MB/s) 2000 10%
1500 5%
0% 1000 off accel * lzo zlib-3 Cpu Util (%) 7.33% 8.76% 11.81% 26.20% Bandwidth(MB/s) 2557 2915 3042 2913
Source as of August 2016: Intel internal 40 measurements with dual E5-2699 v3 (18C, 2.3GHz, 145W), HT & Turbo Enabled, Fedora 22 64 bit, kernel 4.1.3, 2 x DH8955 adaptor, DDR4-128GB 30 Software and workloads used in performance tests
% may have been optimized for performance only on off Intel microprocessors. Any change to any of those factors may cause the results to vary. You should util 20 consult other information and performance tests to accel * assist you in fully evaluating your contemplated cpu 10 purchases, including the performance of that product lzo when combined with other products. Any difference in system hardware or software design or configuration may affect actual performance. Results 0 zlib-3 have been estimated based on internal Intel analysis 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940 and are provided for informational purposes only. Any difference in system hardware or software Time (seconds) design or configuration may affect actual performance. For more information go to 25 * Intel® QuickAssist Technology DH8955 level-1 http://www.intel.com/performance
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Key Takeaways
Data compression offers improved storage efficiency Filesystem level compression in OSD is transparent to the Ceph software stack Data compression is CPU intensive, getting better compression ratio requires more CPU cost Software and hardware offloads methods available for accelerating compression, including ISA-L, Intel® QuickAssist Technology and FPGA Hardware offloading method can greatly reduce CPU cost, optimize disk utilization & IO in Storage infrastructure
26
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Additional Sources of Information
For more information on Intel® QuickAssist Technology & Intel® QuickAssist Software Solutions can be found here: Software Package and engine are available at 01.org: Intel QuickAssist Technology | 01.org For more details on Intel® QuickAssist Technology visit: http://www.intel.com/quickassist Intel Network Builders: https://networkbuilders.intel.com/ecosystem Intel®QuickAssist Technology Storage Testimonials IBM v7000Z w/QuickAssist:http://www- 03.ibm.com/systems/storage/disk/storwize_v7000/overview.html https://builders.intel.com/docs/networkbuilders/Accelerating-data-economics-IBM-flashSystem-and-Intel-quick-assist- technology.pdf Intel’s QuickAssist Adapter for Servers: http://ark.intel.com/products/79483/Intel- QuickAssist-Adapter-8950 DEFLATE Compressed Data Format Specification version 1.3 http://tools.ietf.org/html/rfc1951 BTRFS: https://btrfs.wiki.kernel.org Ceph: http://ceph.com/ 27
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Additional Information
28
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. QAT Attach Options
29
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved. Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm%20 Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
Intel, Intel logo, Intel Core, Intel Inside, Intel Inside logo, Intel Ethernet, Intel QuickAssist Technology, Intel Flow Director, Intel Solid State Drives, Intel Intelligent Storage Acceleration Library, Itanium,, Xeon, and Xeon Inside are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your system vendor for more information.
No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology is a security technology under development by Intel and requires for operation a computer system with Intel® Virtualization Technology, an Intel Trusted Execution Technology-enabled processor, chipset, BIOS, Authenticated Code Modules, and an Intel or other compatible measured virtual machine monitor. In addition, Intel Trusted Execution Technology requires the system to contain a TPMv1.2 as defined by the Trusted Computing Group and specific software for some uses. See http://www.intel.com/technology/security/ for more information.
Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM) and, for some uses, certain platform software enabled for it. Functionality, performance or other benefits will vary depending on hardware and software configurations and may require a BIOS update. Software applications may not be compatible with all operating systems. Please check with your application vendor.
* Other names and brands may be claimed as the property of others.
Other vendors are listed by Intel as a convenience to Intel's general customer base, but Intel does not make any representations or warranties whatsoever regarding quality, reliability, functionality, or compatibility of these devices. This list and/or these devices may be subject to change without notice. Copyright © 2016, Intel Corporation. All rights reserved.
30
2016 Storage Developer Conference. © Intel Corp. All Rights Reserved.