How to Teach an Old File System Dog New Object Store Tricks

How to Teach an Old File System Dog New Object Store Tricks USENIX HotStorage ’18 Eunji Lee1, Youil Han1, Suli Yang2, Andrea C. Arpaci-Dusseau2, Remzi H. Arpaci-Dusseau2 Chungbuk National University 1 Data-service Platforms • Layering • Abstract away underlying details • Reuse of existing software • Agility: development, operation, and maintenance Eco System of Data-Service Platform 2 Often at odds with efficiency • Local File System • Bottom layer of modern storage platforms • Portability, Extensibility, Ease of Development Distributed Data Store Distributed Data Store Distributed Data Store (Dynamo, MongoDB) (HBase, BigTable) (Ceph) Key-value Store Object Store Object Store Daemon (RocksDB, BerkelyDB) Distributed File System (Ceph) (HDFS, GFS) Local File System Local File System Local File System (Ext4, XFS, BtrFS) (Ext4, XFS, BtrFS) (Ext4, XFS, BtrFS) 3 Local File System • Not intended to serve as an underlying storage engine • Mismatch between the two layers • System-wide optimization • Ignore demands from individual applications • Little control over file system internals • Suffer from degraded QoS • Lack of required operations • No atomic operation • No data movement or reorganization • No additional user-level metadata Out-of-control and Sub-optimal Performance 4 Current Solutions • Bypass File System • Key-value store, Object Store, Database • But, reliniquish file system benefits • Extend file system interfaces • Add new features to POSIX APIs • Slow and conservative evolution • Stable maintenance than specific optimizations Name: Ext2/3/4 Birth: 1993 5 Our Approach • Use a file system as it is, but in a different manner! • Design patterns of user-level data platform • Take advantages of file system • Minimize negative effects of mismatches 6 Contents • Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion 7 Data-service Platform Taxonomy What is the best way to store objects atop a file system? Mapping Packing “Object as a file” “Multiple objects in a file” 8 Case Study: Ceph • Backend object store engine • FileStore : mapping • KStore : packing • BlueStore OSD FileStore KStore BlueStore RGW RBD CephFS RADOS File system … Storage Device Ceph Architecture Backend Object Store 9 Mapping vs. Packing FileStore (Mapping) KStore (Packing) Object LSM Tree A B … File A File A File B … Object Store Log Object Store Log “Object as a File” “Multiple Objects in a File” 10 Experimental Setup • Ceph 12.01 • Amazon EC2 Clusters • Intel Xeon quad-core • 32GB DRAM • 256 GB SSD x 2 • Ubuntu Server 16.04 • File System : XFS (recommended in Ceph) • Backend: FileStore, KStore • Benchmark: Rados • Metric: IOPS, throughput, write traffic 11 Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file metadata 1.5x 10.0 9.0 10.08.8x Original 8.8x 8.0 9.0 Logging Original 7.0 8.0 Compaction Logging 1x 7.0 6.0 Filesystem Compaction riginal Write 5.0 6.0 Filesystem O 4.4x riginal Write 5.0 4.0 O 4.4x Original write traffic 4.0 3.2x KStore 3.0 3.2xKstore(4KB) 864 MB IOPS 3.0 2.1x 2.0 2.1x Kstore(1MB) 2.4 GB 2.0 FileStore 1.0 Filestore(4KB) 332 MB Ratio wrto 1.0 0.0 Filestore(1MB) 3.8 GB Ratio wrto 0.0 Kstore Filestore Kstore Filestore 4KBFilestore 1MB Kstore Filestore Kstore 4KB 1MB Average IOPS Write Traffic Breakdown 12 Performance • Large Write (1MB) • FileStore outperforms KStore by 1.6x • Write amplification by compaction 10.0 8.8x 1.6x 9.0 10.0 Original 8.8x 8.0 9.0 Logging Original 7.0 8.0 Compaction Logging 1x 6.0 7.0 Filesystem Compaction riginal Write 5.0 6.0 O 4.4x Filesystem riginal Write 5.0 FileStore 4.0 O 3.2x 4.4x Original write traffic 3.0 4.0 3.2xKstore(4KB) 864 MB IOPS 3.0 2.1x 2.0 2.1x Kstore(1MB) 2.4 GB 2.0 KStore 1.0 Filestore(4KB) 332 MB Ratio wrto 1.0 0.0 Filestore(1MB) 3.8 GB Ratio wrto Kstore Filestore0.0 Kstore Filestore FileStore KStore 4KBFilestore 1MB Kstore Filestore Kstore 4KB 1MB Average IOPS Write Traffic Breakdown 13 Performance • Lack of atomic update support in file systems • Double-write penalty of logging • Halve bandwidth in large writes 10.0 10.09.0 8.8x Original 8.8x 8.09.0 OriginalLogging 8.0 7.0 LoggingCompaction 7.0 6.0 CompactionFilesystem riginal Write 6.0 Filesystem riginal Write 5.0 O 5.0 4.4x O 4.0 4.4x 4.0 3.2x 3.0 3.2x Original write traffic 3.0 2.1x Kstore(4KB) 864 MB 2.0 2.1x 2.0 Kstore(1MB) 2.4 GB 1.0 1.0 Filestore(4KB) 332 MB Ratio wrto Ratio wrto 0.0 0.0 Filestore(1MB) 3.8 GB FilestoreKstore FilestoreKstore KstoreFilestoreFilestoreKstore 4KB4KB 1MB 1MB Write Traffic Breakdown 14 QoS • FileStore Performance 4KB write Write Traffic 200 20 filestore BG-Write Throughput Page Cache FS: XFS 100 10 W: 4KB Write (MiB) 0 0 Throughput(MB/s) 0 10 20 30 40 50 60 Storage 1MB write Time(s) Periodic Flush w. Buffered I/O 200 300 filestore BG-Write Throughput Transaction Entanglement FS: XFS 100 150 W: 1MB Write (MiB) 0 0 Throughput(MB/s) 0 10 20 30 40 50 60 Time(s) 15 QoS • KStore 4KB write 300 30 kstore BG-Write Throughput User-level Cache FS: XFS 150 15 W: 4KB Write (MiB) 0 0 Throughput(MB/s) 0 10 20 30 40 50 Time(s) Storage 1MB write Frequent Compaction 200 300 Write amplification by merge kstore BG-Write Throughput FS: XFS 100 150 W: 1MB Write (MiB) Throughput: 0 0 Throughput(MB/s) 0 10 20 30 40 50 60 40MB/s Time(s) Consistently Poor 16 Summary • Performance penalties of file systems • Small objects seriously suffer from write amplification caused by filesystem metadata • Large writes are sensitive to write traffic increase by Logging in common, and frequent compaction in packing architecture. • Buffered I/O and out-of-control flush mechanism in file systems makes it challenging to support QoS. 17 Contents • Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion 18 SwimStore • Shadowing with Immutable Metadata Store • Provide consistently excellent performance for all object sizes running over a file system 19 SwimStore • Strategy 1. In-file shadowing Object Problems • Filesystem metadata overhead A’ • Double-write penalty • Performance fluctuation Direct I/O • Compaction cost File A B Indexing Log key, offset, length 20 SwimStore • Strategy 1. In-file shadowing FileStore SwimStore Raw Device Logging A’ A’ Synchronous Asynchronous Direct I/O Buffered I/O A A File File Log File System 21 User-facing Latency increases! SwimStore • File system access is slower than raw device access • File system metadata (e.g., inode, allocation bitmap, etc.) • Transaction entanglement A’ Synchronous m m m m Direct I/O File A File System 22 SwimStore • Strategy 2. Metadata-Immutable Container Single file Metadata- Immutable Per-file Container Raw 1 device A’ 0.86 Synchronous Create a container file Direct I/O and allocate space in advance 0.4 0.4 File A m m m m File System Latency (4KB write) 23 SwimStore • Strategy 3. Hole-punching with Buddy-like Allocation Shadowing technique requires the recycling of obsolete data space 24 SwimStore • Strategy 3. Hole-punching with Buddy-like Allocation Hole-punching …. Opportunities (+) Filesystem has “infinite address space” (+) Filesystem provides “physical space reclamation” with punch-hole 25 SwimStore • Strategy 3. Hole-punching with Buddy-like Allocation Too small holes severely fragments space New object Logical address Physical address 26 SwimStore • Strategy 3. Hole-punching with Buddy-like Allocation 2^0 GC for small holes 2^1 …. Hole-punching for large holes 2^n 27 SwimStore • Architecture … … … LSM-Tree (LevelDB) … Container File Pool Metadata Intent Log (Indexing, attributes, etc.) (metadata, checksum) Contents • Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion 29 Experimental Setup • Ceph 12.01, C++ 12K LOC • Amazon EC2 Clusters • Intel Xeon quad-core • 32GB DRAM • 256 GB SSD x 2 • Ubuntu Server 16.04 • File System : XFS (recommended in Ceph) • Backend: FileStore, KStore, BlueStore, SwimStore • Benchmark: Rados • Metric: IOPS, throughput, write traffic 30 Performance Evaluation • IOPS FileStore SwimStore KStore 3.0 2.5 2.0 1.5 FileStore 1.0 1454 ops/s 1472 ops/s 881 ops/s 243 ops/s 67 ops/s BlueStore 0.5 SwimStore KStore Ratio wrto FileStore 0.0 4KB 16KB 64KB 256KB 1MB BlueStore IOSize Small Write Large Write 2.5x better than FileStore 1.8x better than FileStore 1.6x better than BlueStore 3.1x better than KStore 1.1x better than KStore 31 Performance Evaluation • Write Traffic FileStore 5.0 4.0 3.0 FileStore 2.0 BlueStore 1.0 1129 (MB) 3020 (MB) 5370 (MB) 7342 (MB) 7342 (MB) SwimStore KStore Ratio wrto SwimStore 0.0 4KB 16KB 64KB 256KB 1MB IOSize BlueStore KStore SwimStore 32 Contents • Motivation • Problem Analysis • Solution • Performance Evaluation • Conclusion 33 Conclusion • Explore design patterns to build an object store atop a local file system • SwimStore: a new backend object store • In-file shadowing • Immutable metadata container • Hole-punching with buddy-like allocation • Provide high performance and little performance variations • Retain all benefits of the file system 34 Thank you 35.

Load more