Babudb: Fast and Efficient File System Metadata Storage

BabuDB: Fast and Efficient File System Metadata Storage Jan Stender, Björn Kolbeck, Felix Hupfeld Mikael Högqvist Zuse Institute Berlin Google GmbH Zurich SNAPI 2010 · Jan Stender Motivation – Modern parallel / distributed file systems: – Huge numbers of files and directories – Many storage servers but few metadata servers – Examples: – Lustre, Panasas Active Scale, Google File System – Metadata access critical wrt. system performance – ~75% of all file system calls are metadata accesses – Metadata servers are bottlenecks SNAPI 2010 · Jan Stender Motivation – B-tree-like data structures used for metadata storage – ZFS, btrfs, Lustre, PVFS2 – Downsides: – Hard to implement and test, high code complexity – Multi-version B-trees even more complex – On-disk re-balancing expensive SNAPI 2010 · Jan Stender BabuDB – Key-value store – FS metadata: key-value pairs stored in DB indices SNAPI 2010 · Jan Stender BabuDB: Index SNAPI 2010 · Jan Stender Example SNAPI 2010 · Jan Stender Example: Insertions SNAPI 2010 · Jan Stender Example: Insertions SNAPI 2010 · Jan Stender Example: Lookups SNAPI 2010 · Jan Stender Example: Lookups SNAPI 2010 · Jan Stender Example: Lookups SNAPI 2010 · Jan Stender Example: Lookups SNAPI 2010 · Jan Stender Example: Deletions SNAPI 2010 · Jan Stender Example: Deletions SNAPI 2010 · Jan Stender Example: Deletions SNAPI 2010 · Jan Stender Example: Deletions SNAPI 2010 · Jan Stender Example: Range Lookups SNAPI 2010 · Jan Stender Example: Range Lookups SNAPI 2010 · Jan Stender Example: Range Lookups SNAPI 2010 · Jan Stender Example: Range Lookups SNAPI 2010 · Jan Stender Example: Checkpoints SNAPI 2010 · Jan Stender Example: Checkpoints SNAPI 2010 · Jan Stender Example: Checkpoints SNAPI 2010 · Jan Stender Example: Checkpoints SNAPI 2010 · Jan Stender On-disk Index – Sorted by Keys – Block index in RAM, blocks mmap'ed SNAPI 2010 · Jan Stender BabuDB: Related Work – Inspired by log-structured merge trees (LSM-trees) – Only one on-disk index – No „rolling merge“ – Made popular by Google Bigtable – Insert/lookup/merge similar as in Bigtable's Tablets SNAPI 2010 · Jan Stender BabuDB: Metadata Mapping – Mapping a hierarchical directory tree to a flat database index: SNAPI 2010 · Jan Stender BabuDB: Advantages – Why BabuDB for File System Metadata? – Short-lived files ▪ 50% of all files deleted within 5 minutes – Atomic file system operations w/o locking or transactions ▪ e.g. rename – Directory content in contiguous disk regions ▪ Efficient readdir + stat – Snapshots ▪ No need for multi-version data structures SNAPI 2010 · Jan Stender BabuDB: Evaluation – Linux kernel build 2000 1800 1600 – ~10M calls: 44% stat, 1400 s 1200 d 40% open, 15% n 1000 BabuDB o 800 c ext4 e 600 readlink, 1% others s 400 200 0 Kernel build – Dovecot mail server 400 350 + imaptest 300 s d 250 n o 200 BabuDB – ~2M calls: 51% stat, c e ext4 s 150 48% open, 1% others 100 50 0 Dovecot test SNAPI 2010 · Jan Stender BabuDB: Evaluation – Listing directory content SNAPI 2010 · Jan Stender Summary – BabuDB is ... – an efficient key-value store – optimized for file system metadata but also suitable for other purposes http://babudb.googlecode.com – suitable for large-scale databases – available for Java and C++ under BSD license – used in the XtreemFS metadata server http://www.xtreemfs.org SNAPI 2010 · Jan Stender Thank you for your attention! SNAPI 2010 · Jan Stender Background: XtreemFS – XtreemFS: a distributed replicated Internet file system – part of the XtreemOS research project – developed since 2006 by partners from Germany, Spain and Italy – Object-based architecture: – MRC stores metadata – OSDs store pure file content as objects – Clients provide POSIX file system interface www.xtreemfs.org SNAPI 2010 · Jan Stender The XtreemOS Project – Research project funded by the European Commission – 19 partners from Europe and China – XtreemFS is the data management component – developed by ZIB, NEC HPC Europe, Barcelona Supercomputing Center and ICAR-CNR Italy – ~ 3 years of development – first public release in August 2008 SNAPI 2010 · Jan Stender XtreemFS: Overview – What is XtreemFS? – a distributed and replicated POSIX compliant file system – off-the-shelve Servers – no expensive hardware – servers in Java, runs on Linux / OS X / Solaris – client in C, runs on Linux / OS X / Windows – secure (X.509 and SSL) – easy to install and maintain – open source (GPL) SNAPI 2010 · Jan Stender File System Landscape Internet Cluster FS/ Data Center Network FS/ Centralized PC ext3, ZFS, NFS, SMB Lustre, Panasas, Grid File System GDM NTFS AFS/Coda GPFS, CEPH... GFarm "gridftp" SNAPI 2010 · Jan Stender.

Babudb: Fast and Efficient File System Metadata Storage

Globalfs: a Strongly Consistent Multi-Site File System

Using the Andrew File System on *BSD [email protected], Bsdcan, 2006 Why Another Network Filesystem

Filesystems HOWTO Filesystems HOWTO Table of Contents Filesystems HOWTO

A Persistent Storage Model for Extreme Computing Shuangyang Yang Louisiana State University and Agricultural and Mechanical College

Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques

Using the Andrew File System with BSD

Zfs-Ascalabledistributedfilesystemusingobjectdisks

Performance and Extension of User Space File Introduction (1 of 2) Introduction (2 of 2) Introduction

The Evolution of File Systems

Extension Framework for File Systems in User Space

ZEA, a Data Management Approach for SMR

Making a Case for Distributed File Systems at Exascale