How to Teach an Old File System Dog New Object Store Tricks
USENIX HotStorage ’18
Eunji Lee1, Youil Han1, Suli Yang2, Andrea C. Arpaci-Dusseau2, Remzi H. Arpaci-Dusseau2
Chungbuk National University
1 Data-service Platforms • Layering • Abstract away underlying details • Reuse of existing software • Agility: development, operation, and maintenance
Eco System of Data-Service Platform 2 Often at odds with efficiency • Local File System • Bottom layer of modern storage platforms • Portability, Extensibility, Ease of Development
Distributed Data Store Distributed Data Store Distributed Data Store (Dynamo, MongoDB) (HBase, BigTable) (Ceph)
Key-value Store Object Store Object Store Daemon (RocksDB, BerkelyDB) Distributed File System (Ceph) (HDFS, GFS) Local File System Local File System Local File System (Ext4, XFS, BtrFS) (Ext4, XFS, BtrFS) (Ext4, XFS, BtrFS)
3 Local File System • Not intended to serve as an underlying storage engine • Mismatch between the two layers • System-wide optimization • Ignore demands from individual applications • Little control over file system internals • Suffer from degraded QoS • Lack of required operations • No atomic operation • No data movement or reorganization • No additional user-level metadata Out-of-control and Sub-optimal Performance 4 Current Solutions • Bypass File System • Key-value store, Object Store, Database • But, reliniquish file system benefits
• Extend file system interfaces • Add new features to POSIX APIs • Slow and conservative evolution • Stable maintenance than specific optimizations Name: Ext2/3/4 Birth: 1993 5 Our Approach • Use a file system as it is, but in a different manner! • Design patterns of user-level data platform • Take advantages of file system • Minimize negative effects of mismatches
6 Contents • Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion
7 Data-service Platform Taxonomy What is the best way to store objects atop a file system?
Mapping Packing
“Object as a file” “Multiple objects in a file”
8 Case Study: Ceph • Backend object store engine • FileStore : mapping • KStore : packing • BlueStore OSD
FileStore KStore BlueStore RGW RBD CephFS
RADOS File system
… Storage Device
Ceph Architecture Backend Object Store 9 Mapping vs. Packing
FileStore (Mapping) KStore (Packing)
Object LSM Tree
A B … File A File A File B … Object Store Log Object Store Log
“Object as a File” “Multiple Objects in a File” 10 Experimental Setup • Ceph 12.01 • Amazon EC2 Clusters • Intel Xeon quad-core • 32GB DRAM • 256 GB SSD x 2 • Ubuntu Server 16.04 • File System : XFS (recommended in Ceph) • Backend: FileStore, KStore • Benchmark: Rados • Metric: IOPS, throughput, write traffic
11 Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file metadata 1.5x 10.0 9.0 10.08.8x Original 8.8x 8.0 9.0 Logging Original 7.0 8.0 Compaction Logging 1x 7.0 6.0 Filesystem Compaction riginal Write 5.0 6.0 Filesystem O 4.4x riginal Write 5.0 4.0 O 4.4x Original write traffic 4.0 3.2x KStore 3.0 3.2xKstore(4KB) 864 MB IOPS 3.0 2.1x 2.0 2.1x Kstore(1MB) 2.4 GB 2.0 FileStore 1.0 Filestore(4KB) 332 MB
Ratio wrto 1.0 0.0 Filestore(1MB) 3.8 GB Ratio wrto 0.0 Kstore Filestore Kstore Filestore 4KBFilestore 1MB Kstore Filestore Kstore 4KB 1MB Average IOPS Write Traffic Breakdown 12 Performance • Large Write (1MB) • FileStore outperforms KStore by 1.6x • Write amplification by compaction 10.0 8.8x 1.6x 9.0 10.0 Original 8.8x 8.0 9.0 Logging Original 7.0 8.0 Compaction Logging 1x 6.0 7.0 Filesystem Compaction riginal Write 5.0 6.0
O 4.4x Filesystem riginal Write 5.0 FileStore 4.0 O 3.2x 4.4x Original write traffic 3.0 4.0 3.2xKstore(4KB) 864 MB IOPS 3.0 2.1x 2.0 2.1x Kstore(1MB) 2.4 GB 2.0 KStore 1.0 Filestore(4KB) 332 MB
Ratio wrto 1.0 0.0 Filestore(1MB) 3.8 GB Ratio wrto Kstore Filestore0.0 Kstore Filestore FileStore KStore 4KBFilestore 1MB Kstore Filestore Kstore 4KB 1MB Average IOPS Write Traffic Breakdown 13 Performance • Lack of atomic update support in file systems • Double-write penalty of logging • Halve bandwidth in large writes 10.0 10.09.0 8.8x Original 8.8x 8.09.0 OriginalLogging 8.0 7.0 LoggingCompaction 7.0 6.0 CompactionFilesystem riginal Write 6.0 Filesystem
riginal Write 5.0 O 5.0 4.4x O 4.0 4.4x 4.0 3.2x 3.0 3.2x Original write traffic 3.0 2.1x Kstore(4KB) 864 MB 2.0 2.1x 2.0 Kstore(1MB) 2.4 GB 1.0 1.0 Filestore(4KB) 332 MB Ratio wrto Ratio wrto 0.0 0.0 Filestore(1MB) 3.8 GB FilestoreKstore FilestoreKstore KstoreFilestoreFilestoreKstore 4KB4KB 1MB 1MB Write Traffic Breakdown 14 QoS • FileStore Performance 4KB write Write Traffic 200 20 filestore BG-Write Throughput Page Cache
FS: XFS 100 10 W: 4KB Write (MiB)
0 0 Throughput(MB/s) 0 10 20 30 40 50 60 Storage 1MB write Time(s) Periodic Flush w. Buffered I/O 200 300 filestore BG-Write Throughput Transaction Entanglement
FS: XFS 100 150 W: 1MB Write (MiB)
0 0 Throughput(MB/s) 0 10 20 30 40 50 60 Time(s) 15 QoS • KStore
4KB write 300 30 kstore BG-Write Throughput User-level Cache FS: XFS 150 15 W: 4KB Write (MiB)
0 0 Throughput(MB/s) 0 10 20 30 40 50 Time(s) Storage 1MB write Frequent Compaction 200 300 Write amplification by merge kstore BG-Write Throughput
FS: XFS 100 150 W: 1MB
Write (MiB) Throughput:
0 0 Throughput(MB/s) 0 10 20 30 40 50 60 40MB/s Time(s) Consistently Poor 16 Summary • Performance penalties of file systems • Small objects seriously suffer from write amplification caused by filesystem metadata • Large writes are sensitive to write traffic increase by Logging in common, and frequent compaction in packing architecture. • Buffered I/O and out-of-control flush mechanism in file systems makes it challenging to support QoS.
17 Contents • Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion
18 SwimStore • Shadowing with Immutable Metadata Store • Provide consistently excellent performance for all object sizes running over a file system
19 SwimStore • Strategy 1. In-file shadowing
Object Problems • Filesystem metadata overhead A’ • Double-write penalty • Performance fluctuation Direct I/O • Compaction cost
File A B Indexing Log
key, offset, length
20 SwimStore • Strategy 1. In-file shadowing
FileStore SwimStore Raw Device Logging A’ A’ Synchronous Asynchronous Direct I/O Buffered I/O A A File File Log File System
21 User-facing Latency increases! SwimStore • File system access is slower than raw device access • File system metadata (e.g., inode, allocation bitmap, etc.) • Transaction entanglement
A’
Synchronous m m m m Direct I/O
File A File System
22 SwimStore • Strategy 2. Metadata-Immutable Container Single file Metadata- Immutable Per-file Container Raw 1 device A’ 0.86
Synchronous Create a container file Direct I/O and allocate space in advance 0.4 0.4
File A m m m m
File System Latency (4KB write)
23 SwimStore • Strategy 3. Hole-punching with Buddy-like Allocation
Shadowing technique requires the recycling of obsolete data space
24 SwimStore • Strategy 3. Hole-punching with Buddy-like Allocation
Hole-punching
….
Opportunities (+) Filesystem has “infinite address space” (+) Filesystem provides “physical space reclamation” with punch-hole
25 SwimStore • Strategy 3. Hole-punching with Buddy-like Allocation
Too small holes severely fragments space New object Logical address
Physical address
26 SwimStore • Strategy 3. Hole-punching with Buddy-like Allocation
2^0 GC for small holes
2^1
…. Hole-punching for large holes
2^n
27 SwimStore • Architecture
…
…
… LSM-Tree (LevelDB)
…
Container File Pool Metadata Intent Log (Indexing, attributes, etc.) (metadata, checksum) Contents • Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion
29 Experimental Setup • Ceph 12.01, C++ 12K LOC • Amazon EC2 Clusters • Intel Xeon quad-core • 32GB DRAM • 256 GB SSD x 2 • Ubuntu Server 16.04 • File System : XFS (recommended in Ceph) • Backend: FileStore, KStore, BlueStore, SwimStore • Benchmark: Rados • Metric: IOPS, throughput, write traffic
30 Performance Evaluation • IOPS FileStore SwimStore KStore 3.0 2.5 2.0 1.5 FileStore 1.0 1454 ops/s 1472 ops/s 881 ops/s 243 ops/s 67 ops/s BlueStore 0.5 SwimStore KStore Ratio wrto FileStore 0.0 4KB 16KB 64KB 256KB 1MB BlueStore IOSize Small Write Large Write 2.5x better than FileStore 1.8x better than FileStore 1.6x better than BlueStore 3.1x better than KStore 1.1x better than KStore 31 Performance Evaluation • Write Traffic
FileStore
5.0 4.0 3.0 FileStore 2.0 BlueStore 1.0 1129 (MB) 3020 (MB) 5370 (MB) 7342 (MB) 7342 (MB) SwimStore KStore
Ratio wrto SwimStore 0.0 4KB 16KB 64KB 256KB 1MB IOSize BlueStore KStore SwimStore
32 Contents • Motivation • Problem Analysis • Solution • Performance Evaluation • Conclusion
33 Conclusion • Explore design patterns to build an object store atop a local file system • SwimStore: a new backend object store • In-file shadowing • Immutable metadata container • Hole-punching with buddy-like allocation • Provide high performance and little performance variations • Retain all benefits of the file system
34 Thank you
35