A Survey of Distributed File System Technology

A Survey of Distributed File System Technology Jakob Blomer CERN PH/SFT and Stanford University ACAT 2014 Prague 1 / 23 Agenda Motivation Physics experiments store their data in distributed file systems ∙ In High Energy Physics ∙ ∙ Global federation of file systems ∙ Hundreds of peta-bytes of data ∙ Hundreds of millions of objects Outline 1 Usage of distributed file systems 2 Survey and taxonomy 3 Critical areas in distributed file systems for physics applications 4 Developments and future challenges 2 / 23 Distributed File Systems A distributed file system (DFS) provides 1 persistent storage 2 of opaque data (files) 3 in a hierarchical namespace that is shared among networked nodes Files survive the lifetime of processes and nodes ∙ POSIX-like interface: open(), close(), read(), write(),... ∙ ∙ Typically transparent to applications Data model and interface distinguish a DFS from ∙ a distributed (No-)SQL database or a distributed key-value store 3 / 23 1 Distributed File Systems Popular DFSs: A distributed file system (DFS) provides AFS, Ceph, CernVM-FS, 1 persistent storage dCache, EOS, FhGFS, GlusterFS, GPFS, HDFS, 2 of opaque data (files) Lustre, MooseFS, NFS, 3 in a hierarchical namespace that is PanFS, XrootD shared among networked nodes Files survive the lifetime of processes and nodes ∙ POSIX-like interface: open(), close(), read(), write(),... ∙ ∙ Typically transparent to applications Data model and interface distinguish a DFS from ∙ a distributed (No-)SQL database or a distributed key-value store 3 / 23 1 Use Cases and Demands Change Frequency Request Rate MB/s Mean File Size Data Classes Request Rate IOPS Home folders ∙ Data Physics Data Value ∙ ∙ Recorded ∙ Simulated Volume ∙ Analysis results Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙ [Data are illustrative] 4 / 23 Use Cases and Demands Change Frequency Request Rate MB/s Mean File Size Data Classes Request Rate IOPS Home folders – ∙ Data Physics Data Value ∙ ∙ Recorded ∙ Simulated Volume ∙ Analysis results Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙ [Data are illustrative] 4 / 23 Use Cases and Demands Change Frequency Request Rate MB/s Mean File Size Data Classes Request Rate IOPS Home folders – ∙ Data Physics Data – Value ∙ ∙ Recorded ∙ Simulated Volume ∙ Analysis results Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙ [Data are illustrative] 4 / 23 Use Cases and Demands Change Frequency Request Rate MB/s Mean File Size Data Classes Request Rate IOPS Home folders – ∙ Data Physics Data – Value ∙ ∙ Recorded ∙ Simulated Volume ∙ Analysis results Confidentiality Software binaries – ∙ Cache Hit Rate Scratch area Redundancy ∙ [Data are illustrative] 4 / 23 Use Cases and Demands Change Frequency Request Rate MB/s Mean File Size Data Classes Request Rate IOPS Home folders – ∙ Data Physics Data – Value ∙ ∙ Recorded ∙ Simulated Volume ∙ Analysis results Confidentiality Software binaries – ∙ Cache Hit Rate Scratch area – Redundancy ∙ [Data are illustrative] Depending on the use case, the dimensions span orders of magnitude (logarithmic axes) 4 / 23 Distributed File Systems and Use Cases > ls . It is difficult to perform well event_sample.root ∙ under usage characteristics that analysis.C differ by 4 orders of magnitude File system performance is File system: ∙ “please take special care of this file!” highly susceptible to characteristics of individual applications There is no interface to > ls / ∙ specify quality of service (QoS) /home for a particular file /data /software We will deal with a number of /scratch Implicit QoS DFSs for the foreseeable future 5 / 23 Distributed File Systems and Use Cases > ls . It is difficult to perform well event_sample.root ∙ under usage characteristics that analysis.C differ by 4 orders of magnitude File system performance is File system: ∙ “please take special care of this file!” highly susceptible to characteristics of individual applications There is no interface to > ls / ∙ specify quality of service (QoS) /home for a particular file /data /software We will deal with a number of /scratch Implicit QoS ! DFSs for the foreseeable future 5 / 23 POSIX Compliance File system operations essential create(), unlink(), stat() open(), close(), No DFS is fully POSIX read(), write(), seek() ∙ compliant difficult for DFSs It must provide just enough to ∙ File locks not break applications Atomic rename() Field test necessary Open unlinked files ∙ Hard links impossible for DFSs Device files, IPC files 6 / 23 Architecture Sketches Network shares, client-server /share /share ∙ ∙ ∙ Goals: Simplicity, separate storage from application Example: NFS 3 7 / 23 Architecture Sketches Namespace delegation /physics /physics/ams subtree ∙ ∙ ∙ Goals: Scaling network shares, decentral administration Example: AFS 8 / 23 Architecture Sketches Object-based file system delete() meta-data create() read() write() ∙ ∙ ∙ ∙ data ∙ ∙ Goals: Separate meta-data from data, incremental scaling Example: Google File System 9 / 23 Architecture Sketches Parallel file system delete() meta-data create() read() write() ∙ data ∙ ∙ ∙ ∙ ∙ Goals: Maximum throughput, optimized for large files Example: Lustre 10 / 23 Architecture Sketches Distributed meta-data delete() meta-data create() read() write() ∙ data ∙ ∙ ∙ ∙ ∙ Goals: Avoid single point of failure and meta-data bottleneck Example: Ceph 11 / 23 Architecture Sketches Symmetric, peer-to-peer hash(pathn) Distributed hash table — Hosts of pathn ○ Goals: Conceptual simplicity, inherently scalable Difficult to deal with node churn, long lookup beyond LAN Example: GlusterFS 12 / 23 Milestones in Distributed File Systems Biased towards open-source, production file systems 2002 2005 1983 1985 1995 2000 2003 2007 AFS NFS Zebra OceanStore GFS Ceph Venti XRootD 13 / 23 Milestones in Distributed File Systems Biased towards open-source, production file systems 2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD ∙ Andrew File System Client-server “AFS was the first safe and efficient distributed com- Roaming home folders puting system, available [. ] on campus. It was a ∙ clear precursor to the Dropbox-like software pack- Identity tokens and ∙ ages today. [. ] [It] allowed students (like Drew access control lists (ACLs) Houston and Arash Ferdowsi) access to all their Decentralized operation (“Cells”) stuff from any connected computer.” ∙ http://www.wired.com/2011/12/backdrop-dropbox 13 / 23 Milestones in Distributed File Systems Biased towards open-source, production file systems 2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD ∙ Network File System Client-server Focus on portability ∙ Separation of protocol ∙ and implementation Stateless servers ∙ ∙ Fast crash recovery Sandberg, Goldberg, Kleiman, Walsh, Lyon (1985) 13 / 23 Milestones in Distributed File Systems Biased towards open-source, production file systems 2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD ∙ Zebra File System File B File C ParallelFile Manager File A File D Client Client Stripe Striping and parity Cleaner ∙ Redundant array of 1 2 3 4 5 6 Client’s Log ∙ inexpensive nodes (RAIN)Network Log-structured data ∙ Storage Storage 1 2 3 1 ⊗ 2 ⊗ 3 Server Server Storage Storage 4 5 6 4 ⊗ 5 ⊗ 6 Server Server Figure 5: Zebra schematic. Clients run applications; File Servers storage servers store data. The file manager and the stripe cleaner can run on any machine in the system, Figure 4. Per-client striping in Hartman,Zebra. Each Ousterhout client (1995) although it is likely that one machine will run both of forms its new file data into a single append-only log and them. A storage server may also be a client. stripes this log across the servers. In this example file A spans several servers while file B is stored entirely on a space reclaimed from the logs? Zebra solves this problem single server. Parity is computed for the log, not for with a stripe cleaner, which is analogous to the cleaner in a individual files. log-structured file system. The next section13 /provides 23 a more sequential transfers. LFS is particularly effective for writing detailed discussion of these issues and several others. small files, since it can write many files in a single transfer; in contrast, traditional file systems require at least two 3 Zebra Components independent disk transfers for each file. Rosenblum reported a tenfold speedup over traditional file systems for writing The Zebra file system contains four main components small files. LFS is also well-suited for RAIDs because it as shown in Figure 5: clients, which are the machines that batches small writes together into large sequential transfers run application programs; storage servers, which store file and avoids the expensive parity updates associated with data; a file manager, which manages the file and directory small random writes. structure of the file system; and a stripe cleaner, which Zebra can be thought of as a log-structured network file reclaims unused space on the storage servers. There may be system: whereas LFS uses the logging approach at the any number of clients and storage servers but only a single interface between a file server and its disks, Zebra uses the file manager and stripe cleaner. More than one of these logging approach at the interface between a client and its components may share a single physical machine; for servers. Figure 4 illustrates this approach, which we call example, it is possible for one machine to be both a storage

A Survey of Distributed File System Technology

Andrew File System (AFS) Google File System February 5, 2004

A Survey of Distributed File Systems

The Influence of Scale on Distributed File System Design

Distributed File Systems

Design and Evolution of the Apache Hadoop File System(HDFS)

Andrew File System (AFS)

Using the Andrew File System with BSD

The Andrew File System (AFS)

The Andrew File System (From CMU) Case Study Andrew File System

Distributed File Systems

A Distributed File System for Distributed Conferencing System

Of File Systems and Storage Models