Babudb: Fast and Efficient File System Metadata Storage

Babudb: Fast and Efficient File System Metadata Storage

BabuDB: Fast and Efficient File System Metadata Storage Jan Stender, Björn Kolbeck, Felix Hupfeld Mikael Högqvist Zuse Institute Berlin Google GmbH Zurich SNAPI 2010 · Jan Stender Motivation – Modern parallel / distributed file systems: – Huge numbers of files and directories – Many storage servers but few metadata servers – Examples: – Lustre, Panasas Active Scale, Google File System – Metadata access critical wrt. system performance – ~75% of all file system calls are metadata accesses – Metadata servers are bottlenecks SNAPI 2010 · Jan Stender Motivation – B-tree-like data structures used for metadata storage – ZFS, btrfs, Lustre, PVFS2 – Downsides: – Hard to implement and test, high code complexity – Multi-version B-trees even more complex – On-disk re-balancing expensive SNAPI 2010 · Jan Stender BabuDB – Key-value store – FS metadata: key-value pairs stored in DB indices SNAPI 2010 · Jan Stender BabuDB: Index SNAPI 2010 · Jan Stender Example SNAPI 2010 · Jan Stender Example: Insertions SNAPI 2010 · Jan Stender Example: Insertions SNAPI 2010 · Jan Stender Example: Lookups SNAPI 2010 · Jan Stender Example: Lookups SNAPI 2010 · Jan Stender Example: Lookups SNAPI 2010 · Jan Stender Example: Lookups SNAPI 2010 · Jan Stender Example: Deletions SNAPI 2010 · Jan Stender Example: Deletions SNAPI 2010 · Jan Stender Example: Deletions SNAPI 2010 · Jan Stender Example: Deletions SNAPI 2010 · Jan Stender Example: Range Lookups SNAPI 2010 · Jan Stender Example: Range Lookups SNAPI 2010 · Jan Stender Example: Range Lookups SNAPI 2010 · Jan Stender Example: Range Lookups SNAPI 2010 · Jan Stender Example: Checkpoints SNAPI 2010 · Jan Stender Example: Checkpoints SNAPI 2010 · Jan Stender Example: Checkpoints SNAPI 2010 · Jan Stender Example: Checkpoints SNAPI 2010 · Jan Stender On-disk Index – Sorted by Keys – Block index in RAM, blocks mmap'ed SNAPI 2010 · Jan Stender BabuDB: Related Work – Inspired by log-structured merge trees (LSM-trees) – Only one on-disk index – No „rolling merge“ – Made popular by Google Bigtable – Insert/lookup/merge similar as in Bigtable's Tablets SNAPI 2010 · Jan Stender BabuDB: Metadata Mapping – Mapping a hierarchical directory tree to a flat database index: SNAPI 2010 · Jan Stender BabuDB: Advantages – Why BabuDB for File System Metadata? – Short-lived files ▪ 50% of all files deleted within 5 minutes – Atomic file system operations w/o locking or transactions ▪ e.g. rename – Directory content in contiguous disk regions ▪ Efficient readdir + stat – Snapshots ▪ No need for multi-version data structures SNAPI 2010 · Jan Stender BabuDB: Evaluation – Linux kernel build 2000 1800 1600 – ~10M calls: 44% stat, 1400 s 1200 d 40% open, 15% n 1000 BabuDB o 800 c ext4 e 600 readlink, 1% others s 400 200 0 Kernel build – Dovecot mail server 400 350 + imaptest 300 s d 250 n o 200 BabuDB – ~2M calls: 51% stat, c e ext4 s 150 48% open, 1% others 100 50 0 Dovecot test SNAPI 2010 · Jan Stender BabuDB: Evaluation – Listing directory content SNAPI 2010 · Jan Stender Summary – BabuDB is ... – an efficient key-value store – optimized for file system metadata but also suitable for other purposes http://babudb.googlecode.com – suitable for large-scale databases – available for Java and C++ under BSD license – used in the XtreemFS metadata server http://www.xtreemfs.org SNAPI 2010 · Jan Stender Thank you for your attention! SNAPI 2010 · Jan Stender Background: XtreemFS – XtreemFS: a distributed replicated Internet file system – part of the XtreemOS research project – developed since 2006 by partners from Germany, Spain and Italy – Object-based architecture: – MRC stores metadata – OSDs store pure file content as objects – Clients provide POSIX file system interface www.xtreemfs.org SNAPI 2010 · Jan Stender The XtreemOS Project – Research project funded by the European Commission – 19 partners from Europe and China – XtreemFS is the data management component – developed by ZIB, NEC HPC Europe, Barcelona Supercomputing Center and ICAR-CNR Italy – ~ 3 years of development – first public release in August 2008 SNAPI 2010 · Jan Stender XtreemFS: Overview – What is XtreemFS? – a distributed and replicated POSIX compliant file system – off-the-shelve Servers – no expensive hardware – servers in Java, runs on Linux / OS X / Solaris – client in C, runs on Linux / OS X / Windows – secure (X.509 and SSL) – easy to install and maintain – open source (GPL) SNAPI 2010 · Jan Stender File System Landscape Internet Cluster FS/ Data Center Network FS/ Centralized PC ext3, ZFS, NFS, SMB Lustre, Panasas, Grid File System GDM NTFS AFS/Coda GPFS, CEPH... GFarm "gridftp" SNAPI 2010 · Jan Stender.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    36 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us