Experiences on File Systems Which Is the Best File System for You?

Experiences on File Systems Which is the best file system for you? Jakob Blomer CERN PH/SFT CHEP 2015 Okinawa, Japan 1 / 22 We like file systems for their rich and standardized interface We struggle finding an optimal implementation of that interface Why Distributed File Systems Physics experiments store their files in a variety of file systems for a good reason File system interface is portable: we can take our local analysis ∙ application and run it anywhere on a big data set File system as a storage abstraction is a ∙ sweet spot between data flexibility and data organization 2 / 22 Why Distributed File Systems Physics experiments store their files in a variety of file systems for a good reason File system interface is portable: we can take our local analysis ∙ application and run it anywhere on a big data set File system as a storage abstraction is a ∙ sweet spot between data flexibility and data organization We like file systems for their rich and standardized interface We struggle finding an optimal implementation of that interface 2 / 22 Shortlist of Distributed File Systems Quantcast File System CernVM-FS 3 / 22 Agenda 1 What do we want from a distributed file system? 2 Sorting and searching the file system landscape 3 Technology trends and future challenges 4 / 22 Can one size fit all? Change Frequency Throughput MB/s Mean File Size Data Classes Throughput Home folders IOPS ∙ Physics Data Data ∙ Value ∙ Recorded ∙ Simulated Volume ∙ Analysis results Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙ [Data are illustrative] 5 / 22 Can one size fit all? Change Frequency Throughput MB/s Mean File Size Data Classes Throughput Home folders IOPS ∙ Physics Data Data ∙ Value ∙ Recorded ∙ Simulated Volume ∙ Analysis results Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙ [Data are illustrative] 5 / 22 Can one size fit all? Change Frequency Throughput MB/s Mean File Size Data Classes Throughput Home folders IOPS ∙ Physics Data Data ∙ Value ∙ Recorded ∙ Simulated Volume ∙ Analysis results Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙ [Data are illustrative] 5 / 22 Can one size fit all? Change Frequency Throughput MB/s Mean File Size Data Classes Throughput Home folders IOPS ∙ Physics Data Data ∙ Value ∙ Recorded ∙ Simulated Volume ∙ Analysis results Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙ [Data are illustrative] 5 / 22 Can one size fit all? Change Frequency Throughput MB/s Mean File Size Data Classes Throughput Home folders IOPS ∙ Physics Data Data ∙ Value ∙ Recorded ∙ Simulated Volume ∙ Analysis results Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙ [Data are illustrative] Depending on the use case, the dimensions span orders of magnitude 5 / 22 POSIX Interface File system operations No DFS is fully POSIX essential ∙ compliant create(), unlink(), stat() open(), close(), It must provide just enough to read(), write(), seek() ∙ not break applications Often this can be only difficult for DFSs ∙ discovered by testing File locks Write-through Atomic rename() File ownership Extended attributes Unlink opened files Symbolic links, hard links Device files, IPC files 6 / 22 POSIX Interface File system operations No DFS is fully POSIX essential ∙ compliant create(), unlink(), stat() open(), close(), It must provide just enough to read(), write(), seek() ∙ not break applications Often this can be only difficult for DFSs ∙ discovered by testing File locks Write-through Atomic rename() Missing APIs File ownership Physical file location, file replication Extended attributes properties, file temperature, . Unlink opened files Symbolic links, hard links Device files, IPC files 6 / 22 What we ideally want is an application-defined, mountable file system Fuse Parrot Application-Defined File Systems? Mounted file system File system library FILE * f = fopen ( hdfsFS fs = hdfsConnect( "susy.dat", "r"); "default", 0); w h i l e (...){ hdfsFile f = hdfsOpenFile( fread (...); fs, "susy.dat", ...); ... w h i l e (...){ } hdfsRead(fs, f, ...); f c l o s e ( f ) ; ... } hdfsCloseFile(fs , f); Application independent from file system Performance tuned API Allows for standard tools (ls, grep,...) Requires code changes System administrator selects Application selects the file system the file system 7 / 22 Application-Defined File Systems? Mounted file system File system library FILE * f = fopen ( hdfsFS fs = hdfsConnect( "susy.dat", "r"); "default", 0); w h i l e (...){ hdfsFile f = hdfsOpenFile( fread (...); fs, "susy.dat", ...); ... w h i l e (...){ } hdfsRead(fs, f, ...); f c l o s e ( f ) ; ... } hdfsCloseFile(fs , f); Application independent from file system Performance tuned API Allows for standard tools (ls, grep,...) Requires code changes System administrator selects Application selects the file system the file system What we ideally want is an application-defined, mountable file system Fuse Parrot 7 / 22 Agenda 1 What do we want from a distributed file system? 2 Sorting and searching the file system landscape 3 Technology trends and future challenges 8 / 22 drop- Tahoe- box/own- LAFS cloud HDFS AFS Personal QFS Files Big Data MooseFS MapR FS XtreemFS Distributed General Ceph Lustre File Systems Purpose GPFS Gluster-FS Super- computers NFS Orange- Shared FS Panasas Disk BeeGFS OCFS2 GFS2 9 / 22 drop- Tahoe- box/own- LAFS cloud HDFS AFS Privacy ∙ Personal QFS Sharing ∙ Files Sync MapReduce ∙ ∙ workflows Big Data MooseFS Commodity MapR FS XtreemFS ∙ hardware Incremental Distributed ∙ scalabilityGeneral Ceph Lustre File Systems EasePurpose of ∙ administration GPFS Fast parallel Gluster-FS ∙ writesSuper- (p)NFS InfiniBand,computers ∙ Myrinet, . High level of Orange- Shared ∙ POSIX FS Disk Panasas compliance BeeGFS OCFS2 GFS2 9 / 22 drop- Tahoe- box/own- LAFS cloud HDFS AFS Privacy ∙ Personal QFS Sharing ∙ Files Sync MapReduce ∙ ∙ workflows Big Data MooseFS Commodity MapR FS XtreemFS ∙ hardware Incremental Distributed ∙ scalabilityGeneral Ceph Lustre File Systems EasePurpose of ∙ administration GPFS Fast parallel Gluster-FS ∙ writesSuper- (p)NFS InfiniBand,computers ∙ Myrinet, . High level of Orange- Shared ∙ POSIX FS Disk Panasas compliance BeeGFS OCFS2 GFS2 Used in HEP 9 / 22 drop- Tahoe- box/own- LAFS cloud HDFS AFS Privacy ∙ Personal QFS Sharing ∙ Files Sync dCache MapReduce ∙ ∙ workflows Big Data MooseFS Commodity MapR FS XtreemFS ∙ hardware XRootD Incremental Distributed ∙ scalabilityGeneral Ceph HEP Lustre File Systems EasePurpose of ∙ administration CernVM- FS GPFS Fast parallel Gluster-FS ∙ writesSuper- (p)NFS InfiniBand,computers EOS ∙ Myrinet, . High level of Orange- Shared ∙ POSIX FS Disk Panasas compliance BeeGFS OCFS2 GFS2 Used in HEP 10 / 22 drop- Tahoe- box/own- LAFS cloud HDFS AFS Privacy ∙ Personal QFS Sharing ∙ Files Sync dCache MapReduce ∙ ∙ workflows Big Data MooseFS Commodity MapR FS XtreemFS ∙ hardware Tape access XRootD ∙ WAN Incremental ∙ federation Distributed ∙ scalabilityGeneral Ceph HEP Lustre Software File Systems EasePurpose of ∙ distribution ∙ administration CernVM- Fault- FS ∙ tolerance GPFS Fast parallel Gluster-FS ∙ writesSuper- (p)NFS InfiniBand,computers EOS ∙ Myrinet, . High level of Orange- Shared ∙ POSIX FS Disk Panasas compliance BeeGFS OCFS2 GFS2 Used in HEP 10 / 22 File System Architecture Examples: Hadoop File System, Quantcast File System Object-based file system delete() meta-data create() read() write() ∙ ∙ ∙ ∙ data ∙ ∙ Target: Incremental scaling, large & immutable files Typical for Big Data applications 11 / 22 File System Architecture Examples: Hadoop File System, Quantcast File System Object-based file system delete() meta-data create() Head node can help in job scheduling read() write() ∙ ∙ ∙ ∙ data ∙ ∙ Target: Incremental scaling, large & immutable files Typical for Big Data applications 11 / 22 File System Architecture Examples: Lustre, MooseFS, pNFS, XtreemFS Parallel file system delete() meta-data create() read() write() ∙ data ∙ ∙ ∙ ∙ ∙ Target: Maximum aggregated throughput, large files Typical for High-Performance Computing 12 / 22 File System Architecture Examples: Ceph, OrangeFS Distributed meta-data delete() meta-data create() read() write() ∙ data ∙ ∙ ∙ ∙ ∙ Target: Avoid single point of failure and meta-data bottleneck Modern general-purpose distributed file system 13 / 22 File System Architecture Examples: GlusterFS Symmetric, peer-to-peer hash(pathn) Distributed hash table — Hosts of pathn ○ Target: Conceptual simplicity, inherently scalable 14 / 22 File System Architecture Examples: GlusterFS Symmetric, peer-to-peer hash(pathn) Difficult to deal with node churn Slow lookup beyond LAN In HEP we use caching and catalog based data management Distributed hash table — Hosts of pathn ○ Target: Conceptual simplicity, inherently scalable 14 / 22 Agenda 1 What do we want from a distributed file system? 2 Sorting and searching the file system landscape 3 Technology trends and future challenges 15 / 22 Trends and Challenges We are lucky: large data sets tend to be immutable everywhere For instance: media, backups, VM images, scientific data sets, . Reflected in hardware: shingled magnetic recording drives ∙ Reflected in software: log-structured space management ∙ We need to invest in scaling fault-tolerance and speed together with with the capacity Replication becomes too expensive at the Petabyte and Exabyte scale ∙ Erasure codes ! Explicit use of SSDs ∙ ∙ For meta-data (high IOPS requirement) ∙ As a fast storage pool ∙ As a node-local cache

Load more