Poster-424-Ceph-Caching-Mpoat.Pdf

Home , Bcache, Device mapper, Linux Kernel, Linux range of use

Achieving Cost/Performance Balance Ratio Using Tiered Storage Caching Techniques: A Case Study with CephFS Michael Poat, Jerome Lauret SSD acting as a cache to backing Brookhaven National Laboratory HDD

HDD acting as the backing storage Background & Problem Statement Disk Caching Techniques

• STAR has implemented a Ceph Distributed Storage System Bare Disk IOzone – 4096KB Multi-Thread Write primarily using the POSIX compliant CephFS for processing 1TB Mushkin MKNSSDRE1TB • 2TB Seagate ST2000NM0001 (HDD) and 1TB Mushkin 2TB Seagate ST2000NM0001 RAID0 QA, recovering DAQ files, scratch space, and backup storage. 250 MKNSSDRE1TB (SSD) used. RAID0 is composed of 3 HDD. 1TB Mushkin MKNSSDRE1TB 2TB Seagate ST2000NM0001 • IO Performance Test - 4096KB chunk sizes as a function of 200 RAID0 • Can fast SSDs speed up CephFS storage? 250 the number of threads increasing (x-axis) shown in MB/s 150 • Goal: Balance between IO performance and cost per GB MB/s 200 (4096KB = Ceph block size). 100 without breaking the bank. 150 • SSD performs ~2-2.5x faster than bare HDD. 50 MB/s 100 • Ceph Cache Tiering is not a native feature of CephFS (only with • SSD outperforms RAID0 with low number of threads, near 0 • Single thread IO dd write performance. • Multi-client, multi-thread write performance (100 concurrent IO 0 10 20 30 40 50 50 • 4 CephFS configurations – ‘Stock’ 4HDD, bcache threads into CephFS). Ceph object storage). M. Poat, J. Lauret – “Performance and same performance at high number of threads. Threads 0 1SSD:3HDD, dm-cache 1SSD:3HDD, & RAID0. • Comparing 4 CephFS configurations Advanced Data Placement Techniques with Ceph's Distributed 0 10 20 30 40 50 Threads • The ‘Stock’ cluster and RAID0 performance show Stock’ 4HDD – best performing cluster configuration Storage System”, J. Phys.: Conf. Ser. 223 – To be published.  ‘ 1TB Mushkin MKNSSDRE1TB similar performance results.  bcache 1SSD:3HDD – worst performing cluster bcache IOzone – 4096KB Multi-Thread Write 2TB Seagate ST2000NM0001 • Can we implement a low level caching mechanism that is • Linux kernel block layer cache. RAID0 • bcache and dm-cache show poor performance  dm-cache 1SDD:3HDD – performance peak similar to ‘Stock’ cluster. Due 1TB Mushkin MKNSSDRE1TB 2TB Seagate ST2000NM0001 bcache 1SSD:1HDD 1TB Mushkinbcache 1SSD:3HDD MKNSSDRE1TB • One or more SSDs mapped to one or more HDDs to RAID0 bcache 1SSD:1HDD 250 2TB Seagate ST2000NM0001 compared to HDD only clusters. to a ‘write though’ switch over from heavy IO undetected by Ceph and give us the IO performance we desire? RAID0 bcache 1SSD:3HDD bcache 1SSD:1HDD act as a cache. 250 • Using 1SSD:3HDD configuration, 1 SSD cannot  RAID0 – Shown as a reference • Three low-level disk caching techniques investigated 250200 • Green curve represents 1SSD:1HDD bcache device. 200 outperform aggregate 3HDD – detrimental 200150 • Stock Cluster performs the best and most consistent. (Flashcache, dm-cache, and bcache). • IO performance converges with SSD at low number of MB/s 150 100 performance. • bcache is least performant, dm-cache switches to writethough MB/s 150 threads and drops off at higher number of threads. 100 MB/s 50 under load, no IO gain. • Multiple disk configurations were implemented into CephFS and • 1SSD:3HDD (aggregate of all 3 devices) shows very 100 50 0 single and multiple thread IO performance tests were run to see poor performance. Not expected. 50 Single Client dd Write into CephFS 10 Client x 10 dd Writes into CephFS 0 0 10 20 30 40 50 Threads • IO is directed to SSD, bcache slows down IO under 0 10 20 30 40 50 dmcache 1SSD:3HDD Stock Cluster the performance impact. Threads 0 Stock Cluster RAID0 heavy load when multiple bcache devices are created. 0 10 20 30 40 50 bcache 1SSD:3HDD RAID0 Threads bcache/dmcache 1SSD:3HDD bcache/dmcache 1SSD:RAID0 Analysis Procedures 120 120 100 100 Speed Per Host, dm-cache Total IO • Single and multi-thread IOzone performance tests were run IOzone – 4096KB Multi-Thread Write 80 80 1TB Mushkin MKNSSDRE1TB 2TB Seagate ST2000NM0001 1TB Mushkin MKNSSDRE1TB 60 MB/s 60 • Linux kernel device mapper caching technique. 2TB Seagate ST2000NM0001 MB/s across all devices. RAID0 dmcache 1SSD:1HDD RAID0 • One or more SSDs can be mapped to one or more Mushkin SSD - dmcache 40 40 • bcache and dm-cache configurations were implemented. dmcache 1SSD:3HDD 250 HDDs to act as a cache. 250 20 20 200 (Flashcache is not supported by Scientific Linux and is no longer 0 200 0 • dm-cache requires 3 logical volumes in total 0.1 1 10 100 1000 10000 0.1 1 10 100 1000 150 supported natively). Block Size - MB o ‘Metadata’ & ‘Cache’ Volume on SSD 150 MB/s Block Size MB MB/s • bcache, dm-cache, bare HDD & 3x HDD RAID0 CephFS clusters o ‘Origin’ Volume on HDD 100 100 were benchmarked with single-thread IOzone tests and multi- • The dm-cache device is set to writeback. 50 threaded, multi-client IOzone tests. • IO performance is similar with 1SSD:1HDD & 50 0 0 1SSD:3HDD – IO is set to write to SSD. Under heavy 0 10 20 30 40 50 0 10 20 30 40 50 STAR Ceph Distributed Storage System load, IO will writethough to backing HDDs. Threads Threads Single Client DD Write into CephFS Configuration • 30 nodes with 4 – 2TB SAS HDD each. CephFS Performance Results Stock Cluster RAID0 bcache/dmcache 1SSD:3HDD bcache/dmcache 1SSD:RAID0 • Replication 3, total of 80TB of redundant, fault tolerant storage. • 4 CephFS clusters made up of 20 OSDs each (HDD, • All Ceph OSD journals were then mounted onto a 120 100 • The primary use for Ceph is to leverage the POSIX compliant SSD, 1SSD:1HDD dm-cache, & RAID0). separate HDD (except RAID0) 80

CephFS (NFS like) mountable storage for users. • 10 client IOzone 4096KB chunk writes, thread range 1-  SSD Ceph cluster ~3x performance increase MB/s 60 • Applicable uses: processing QA, recovering DAQ files, scratch 10 per client. Performance shown as aggregate.  HDD Ceph cluster ~2x performance increase 40 • dm-cache performance is above SSD and HDD cluster, 20 space, backup store.  dm-cache cluster ~0.5x performance increase 0 while the expectation would be between the two. 0.1 1 10 100 1000 10000 Icinga Conclusion • SSD Ceph cluster with external journals = Faster IO. Block Size - MB Data Placement Techniques: Findings • dm-cache response to journal flush may be the reason • Journal write must ‘sync’ before proceeding to FS write. for out-of-bound performance. • On each node we replaced HDD with an SSD. • Ceph has built in performance based data placement techniques: Drive dependent on ATA_CMD_FLUSH handling. • SSD outperforms HDD in bare test, in Ceph context OSD Pool Mapping, Primary Affinity, OSD Journals on SSDs, and • Bare SSD Ceph cluster - lack of PLP (Power Loss • The SSD was partitioned into 3 and each performance is flipped. Single Client DD Write into CephFS Cache Tiering. Approach applied - past ACAT 2016 work. Protection) cause journal flush to FS = latency. partition was overlaid on top of a remaining HDD • Ceph OSD journals are bottleneck in SSD vs. HDD?  Stock Cluster RAID0

• Efforts to replace one HDD per node with a fast drive (SSD), bcache/dmcache 1SSD:3HDD bcache/dmcache 1SSD:RAID0 and coupled with a caching mechanism performance sought was not obtained. 120 10 Client IOzone Write into CephFS 10 Client IOzone Write into CephFS 100 (bcache/dmcache). • Performance gain is possible but at what cost? 10 Client (aggregate) - 20OSD - IOzone WRITE - 4096KB 10 Client (aggregate) - 20OSD w/ HDD Journals - IOzone WRITE - 4096KB 80 HDD - 20OSD HDD w/ HDD journals - 20OSD 10 Client x 10 DD Writes into CephFS SSD - 20OSD MB/s 60 • The configuration left was 3 OSDs per node. dm-cache 1SSD:1HDD - 20OSD SSD w/ HDD journals - 20OSD RAID0 - 20OSD dm-cache 1SSD:1HDD w/ HDD journals - 20OSD 40 While sacrifing disk space, we hope that by HDD - 60OSD RAID0 NO HDD journals - 20OSD Stock Cluster RAID0 Drive Size Cost Cost per TB Ceph Configurations 600 600 20 bcache 1SSD:3HDD dmcache 1SSD:3HDD 60 OSD HDD Cluster 0 porting all I/O through the SSD it would result in 500 500 Seagate 2TB $120 $60 0.1 1 10 100120 1000 10000 Stock 4 HDD Ceph Node configuration transitioned to clusters with 400 SAS 400 Block Size - MB100 Speed Per Host, better overall I/O performance. MB/s Total IO 1SSD:3HDD, 1SSD:RAID0, and standalone RAID0 configurations. MB/s 300 300 80 Mushkin 1TB $270 $270 200 MB/s 60 MKNSSDRE1TB 200 100 100 40 Speed Per Host, SSD & HDD Ceph Node 20 Per Process 0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 4096KB - 10 Thread 1SSD:RAID0 Threads Threads 0.1 1 10 100 1000 250 Block Size MB 200 4 HDD Ceph Node 150 MB/s 100 Cost Analysis Conclusion 50 • Leveraging CFEngine and Icinga to work • Current 120 2TB HDD cluster – Cost $14,400. 0 • While bare tests show performance gain using together allows STAR to manage a scalable Mushkin Seagate SSD over HDD, CephFS performs the best Block Size MB • In bare test - Consumer grade SSD - 4.5 times cost heterogeneous cluster from a single with a ‘Stock’ 4 HDD (120 OSD) configuration. impact with only 2.25 times performance increase management point. • Disk caching with 1 SSD over 3 HDD does not result in a (Must test in Ceph before large purchase). • RAID0 shows good performance from positive performance impact. • 120 2TB Consumer SSD cluster – Cost $64,800. standalone test. In Ceph context, number of • System administrators can use CFEngine to • Comparing aggregate performance of each configuration OSDs matters most. RAID0 not beneficial. write domain (group) specific policies to resolve below – cost per MB/s –is $1.20 across the board. • 120 2TB Enterprise SSD cluster – Cost $126,000. 1SSD:3HDD • Not all SSDs are the same, featureless SSDs issues globally to the cluster or to individual • The problem is not the cost per MB/s but the cost per TB. • 1 Enterprise SSD:3HDD may positively impact may cause worse performance than HDDs in • Current 120 2TB HDD Cluster – Cost $14,400 sub-groups. RAID0 performance but cost over HDD only cluster = Ceph due to journaling flush-sync. Mushkin Replace with 120 2TB SSD Cluster – Cost $64,800 $31,500 (~ x2 base cost). drives we used perform poorly in Ceph. • The overall workload reduction from the burden 4.5 times cost impact with only 2.25 times performance Config Avg. Cost Cost Cost per Total space w/ increase. Speed @ per TB 4 slot • dm-cache seems to show more stable of repetitive tasks, haphazard monitoring • If slots available, better solution is add small (64GB) fast SSD 4096KB MB/s performance than bcache. However, dm-cache systems, and insecurity of the cluster state has 2TB HDD 100MB/s $120 $1.20 $60 8TB per HDD to act as a cache - cost per node $480 + $135 = would not perform well with our SSDs unless allowed the STAR system administrators to 2TB Consumer 225MB/s $540 $2.40 $270 8TB the journal is offloaded to a separate device. $615. SSD Drive Cumm. Cumm. Cost per Cost per TB focus on more high level tasks by relying on Speed Cost MB/s 2TB Enterprise 540MB/s $1050 $1.95 $525 8TB • Cheap SSDs cannot help with performance SSD automated services and infrastructure 4x HDD 400MB/s $480 $1.20 $60 gain. Enterprise models (with PLP) must be dm-cache w/ 200 MB/s $480 $2.40 $240 4TB monitoring. 1x SSD + 3x HDD 525MB/s $630 $1.20 $90 ConsR. SSD + considered for performance increase. Cost for 2x SSD + 2x HDD 625MB/s $780 $1.20 $130 Jrnl. HDD w/ PLP upgrade is significant. 3x SSD +1x HDD 775MB/s $930 $1.20 $186 4x SSD 900MB/s $1080 $1.20 $270