How to Teach an Old Dog New Object Store Tricks

USENIX HotStorage ’18

Eunji Lee1, Youil Han1, Suli Yang2, Andrea . Arpaci-Dusseau2, Remzi H. Arpaci-Dusseau2

Chungbuk National University

1 Data-service Platforms • Layering • Abstract away underlying details • Reuse of existing software • Agility: development, operation, and maintenance

Eco System of Data-Service Platform 2 Often at odds with efficiency • Local File System • Bottom layer of modern storage platforms • Portability, Extensibility, Ease of Development

Distributed Data Store Distributed Data Store Distributed Data Store (Dynamo, MongoDB) (HBase, BigTable) ()

Key-value Store Object Store Object Store (RocksDB, BerkelyDB) Distributed File System (Ceph) (HDFS, GFS) Local File System Local File System Local File System (, XFS, ) (Ext4, XFS, BtrFS) (Ext4, XFS, BtrFS)

3 Local File System • Not intended to serve as an underlying storage engine • Mismatch between the two layers • System-wide optimization • Ignore demands from individual applications • Little control over file system internals • Suffer from degraded QoS • Lack of required operations • No atomic operation • No data movement or reorganization • No additional user-level metadata Out-of-control and Sub-optimal Performance 4 Current Solutions • Bypass File System • Key-value store, Object Store, • But, reliniquish file system benefits

• Extend file system interfaces • Add new features to POSIX APIs • Slow and conservative evolution • Stable maintenance than specific optimizations Name: /3/4 Birth: 1993 5 Our Approach • Use a file system as it is, but in a different manner! • Design patterns of user-level data platform • Take advantages of file system • Minimize negative effects of mismatches

6 Contents • Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion

7 Data-service Platform Taxonomy What is the best way to store objects atop a file system?

Mapping Packing

“Object as a file” “Multiple objects in a file”

8 Case Study: Ceph • Backend object store engine • FileStore : mapping • KStore : packing • BlueStore OSD

FileStore KStore BlueStore RGW RBD CephFS

RADOS File system

… Storage Device

Ceph Architecture Backend Object Store 9 Mapping vs. Packing

FileStore (Mapping) KStore (Packing)

Object LSM Tree

A B … File A File A File B … Object Store Log Object Store Log

“Object as a File” “Multiple Objects in a File” 10 Experimental Setup • Ceph 12.01 • Amazon EC2 Clusters • Xeon quad-core • 32GB DRAM • 256 GB SSD x 2 • Ubuntu Server 16.04 • File System : XFS (recommended in Ceph) • Backend: FileStore, KStore • Benchmark: Rados • Metric: IOPS, throughput, write traffic

11 Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file metadata 1.5x 10.0 9.0 10.08.8x Original 8.8x 8.0 9.0 Logging Original 7.0 8.0 Compaction Logging 1x 7.0 6.0 Filesystem Compaction riginal Write 5.0 6.0 Filesystem O 4.4x riginal Write 5.0 4.0 O 4.4x Original write traffic 4.0 3.2x KStore 3.0 3.2xKstore(4KB) 864 MB IOPS 3.0 2.1x 2.0 2.1x Kstore(1MB) 2.4 GB 2.0 FileStore 1.0 Filestore(4KB) 332 MB

Ratio wrto 1.0 0.0 Filestore(1MB) 3.8 GB Ratio wrto 0.0 Kstore Filestore Kstore Filestore 4KBFilestore 1MB Kstore Filestore Kstore 4KB 1MB Average IOPS Write Traffic Breakdown 12 Performance • Large Write (1MB) • FileStore outperforms KStore by 1.6x • Write amplification by compaction 10.0 8.8x 1.6x 9.0 10.0 Original 8.8x 8.0 9.0 Logging Original 7.0 8.0 Compaction Logging 1x 6.0 7.0 Filesystem Compaction riginal Write 5.0 6.0

O 4.4x Filesystem riginal Write 5.0 FileStore 4.0 O 3.2x 4.4x Original write traffic 3.0 4.0 3.2xKstore(4KB) 864 MB IOPS 3.0 2.1x 2.0 2.1x Kstore(1MB) 2.4 GB 2.0 KStore 1.0 Filestore(4KB) 332 MB

Ratio wrto 1.0 0.0 Filestore(1MB) 3.8 GB Ratio wrto Kstore Filestore0.0 Kstore Filestore FileStore KStore 4KBFilestore 1MB Kstore Filestore Kstore 4KB 1MB Average IOPS Write Traffic Breakdown 13 Performance • Lack of atomic update support in file systems • Double-write penalty of logging • Halve bandwidth in large writes 10.0 10.09.0 8.8x Original 8.8x 8.09.0 OriginalLogging 8.0 7.0 LoggingCompaction 7.0 6.0 CompactionFilesystem riginal Write 6.0 Filesystem

riginal Write 5.0 O 5.0 4.4x O 4.0 4.4x 4.0 3.2x 3.0 3.2x Original write traffic 3.0 2.1x Kstore(4KB) 864 MB 2.0 2.1x 2.0 Kstore(1MB) 2.4 GB 1.0 1.0 Filestore(4KB) 332 MB Ratio wrto Ratio wrto 0.0 0.0 Filestore(1MB) 3.8 GB FilestoreKstore FilestoreKstore KstoreFilestoreFilestoreKstore 4KB4KB 1MB 1MB Write Traffic Breakdown 14 QoS • FileStore Performance 4KB write Write Traffic 200 20 filestore BG-Write Throughput Page Cache

FS: XFS 100 10 W: 4KB Write (MiB)

0 0 Throughput(MB/s) 0 10 20 30 40 50 60 Storage 1MB write Time(s) Periodic Flush w. Buffered I/O 200 300 filestore BG-Write Throughput Transaction Entanglement

FS: XFS 100 150 W: 1MB Write (MiB)

0 0 Throughput(MB/s) 0 10 20 30 40 50 60 Time(s) 15 QoS • KStore

4KB write 300 30 kstore BG-Write Throughput User-level Cache FS: XFS 150 15 W: 4KB Write (MiB)

0 0 Throughput(MB/s) 0 10 20 30 40 50 Time(s) Storage 1MB write Frequent Compaction 200 300 Write amplification by merge kstore BG-Write Throughput

FS: XFS 100 150 W: 1MB

Write (MiB) Throughput:

0 0 Throughput(MB/s) 0 10 20 30 40 50 60 40MB/s Time(s) Consistently Poor 16 Summary • Performance penalties of file systems • Small objects seriously suffer from write amplification caused by filesystem metadata • Large writes are sensitive to write traffic increase by Logging in common, and frequent compaction in packing architecture. • Buffered I/O and out-of-control flush mechanism in file systems makes it challenging to support QoS.

17 Contents • Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion

18 SwimStore • Shadowing with Immutable Metadata Store • Provide consistently excellent performance for all object sizes running over a file system

19 SwimStore • Strategy 1. In-file shadowing

Object Problems • Filesystem metadata overhead A’ • Double-write penalty • Performance fluctuation Direct I/O • Compaction cost

File A B Indexing Log

key, offset, length

20 SwimStore • Strategy 1. In-file shadowing

FileStore SwimStore Raw Device Logging A’ A’ Synchronous Asynchronous Direct I/O Buffered I/O A A File File Log File System

21 User-facing Latency increases! SwimStore • File system access is slower than raw device access • File system metadata (e.g., , allocation bitmap, etc.) • Transaction entanglement

A’

Synchronous m m m m Direct I/O

File A File System

22 SwimStore • Strategy 2. Metadata-Immutable Container Single file Metadata- Immutable Per-file Container Raw 1 device A’ 0.86

Synchronous Create a container file Direct I/O and allocate space in advance 0.4 0.4

File A m m m m

File System Latency (4KB write)

23 SwimStore • Strategy 3. Hole-punching with Buddy-like Allocation

Shadowing technique requires the recycling of obsolete data space

24 SwimStore • Strategy 3. Hole-punching with Buddy-like Allocation

Hole-punching

….

Opportunities (+) Filesystem has “infinite address space” (+) Filesystem provides “physical space reclamation” with punch-hole

25 SwimStore • Strategy 3. Hole-punching with Buddy-like Allocation

Too small holes severely fragments space New object Logical address

Physical address

26 SwimStore • Strategy 3. Hole-punching with Buddy-like Allocation

2^0 GC for small holes

2^1

…. Hole-punching for large holes

2^n

27 SwimStore • Architecture

… LSM-Tree (LevelDB)

Container File Pool Metadata Intent Log (Indexing, attributes, etc.) (metadata, checksum) Contents • Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion

29 Experimental Setup • Ceph 12.01, C++ 12K LOC • Amazon EC2 Clusters • Intel Xeon quad-core • 32GB DRAM • 256 GB SSD x 2 • Ubuntu Server 16.04 • File System : XFS (recommended in Ceph) • Backend: FileStore, KStore, BlueStore, SwimStore • Benchmark: Rados • Metric: IOPS, throughput, write traffic

30 Performance Evaluation • IOPS FileStore SwimStore KStore 3.0 2.5 2.0 1.5 FileStore 1.0 1454 ops/s 1472 ops/s 881 ops/s 243 ops/s 67 ops/s BlueStore 0.5 SwimStore KStore Ratio wrto FileStore 0.0 4KB 16KB 64KB 256KB 1MB BlueStore IOSize Small Write Large Write 2.5x better than FileStore 1.8x better than FileStore 1.6x better than BlueStore 3.1x better than KStore 1.1x better than KStore 31 Performance Evaluation • Write Traffic

FileStore

5.0 4.0 3.0 FileStore 2.0 BlueStore 1.0 1129 (MB) 3020 (MB) 5370 (MB) 7342 (MB) 7342 (MB) SwimStore KStore

Ratio wrto SwimStore 0.0 4KB 16KB 64KB 256KB 1MB IOSize BlueStore KStore SwimStore

32 Contents • Motivation • Problem Analysis • Solution • Performance Evaluation • Conclusion

33 Conclusion • Explore design patterns to build an object store atop a local file system • SwimStore: a new backend object store • In-file shadowing • Immutable metadata container • Hole-punching with buddy-like allocation • Provide high performance and little performance variations • Retain all benefits of the file system

34 Thank you

35