
A Survey of Distributed File System Technology Jakob Blomer CERN PH/SFT and Stanford University ACAT 2014 Prague 1 / 23 Agenda Motivation Physics experiments store their data in distributed file systems ∙ In High Energy Physics ∙ ∙ Global federation of file systems ∙ Hundreds of peta-bytes of data ∙ Hundreds of millions of objects Outline 1 Usage of distributed file systems 2 Survey and taxonomy 3 Critical areas in distributed file systems for physics applications 4 Developments and future challenges 2 / 23 Distributed File Systems A distributed file system (DFS) provides 1 persistent storage 2 of opaque data (files) 3 in a hierarchical namespace that is shared among networked nodes Files survive the lifetime of processes and nodes ∙ POSIX-like interface: open(), close(), read(), write(),... ∙ ∙ Typically transparent to applications Data model and interface distinguish a DFS from ∙ a distributed (No-)SQL database or a distributed key-value store 3 / 23 1 Distributed File Systems Popular DFSs: A distributed file system (DFS) provides AFS, Ceph, CernVM-FS, 1 persistent storage dCache, EOS, FhGFS, GlusterFS, GPFS, HDFS, 2 of opaque data (files) Lustre, MooseFS, NFS, 3 in a hierarchical namespace that is PanFS, XrootD shared among networked nodes Files survive the lifetime of processes and nodes ∙ POSIX-like interface: open(), close(), read(), write(),... ∙ ∙ Typically transparent to applications Data model and interface distinguish a DFS from ∙ a distributed (No-)SQL database or a distributed key-value store 3 / 23 1 Use Cases and Demands Change Frequency Request Rate MB/s Mean File Size Data Classes Request Rate IOPS Home folders ∙ Data Physics Data Value ∙ ∙ Recorded ∙ Simulated Volume ∙ Analysis results Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙ [Data are illustrative] 4 / 23 Use Cases and Demands Change Frequency Request Rate MB/s Mean File Size Data Classes Request Rate IOPS Home folders – ∙ Data Physics Data Value ∙ ∙ Recorded ∙ Simulated Volume ∙ Analysis results Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙ [Data are illustrative] 4 / 23 Use Cases and Demands Change Frequency Request Rate MB/s Mean File Size Data Classes Request Rate IOPS Home folders – ∙ Data Physics Data – Value ∙ ∙ Recorded ∙ Simulated Volume ∙ Analysis results Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙ [Data are illustrative] 4 / 23 Use Cases and Demands Change Frequency Request Rate MB/s Mean File Size Data Classes Request Rate IOPS Home folders – ∙ Data Physics Data – Value ∙ ∙ Recorded ∙ Simulated Volume ∙ Analysis results Confidentiality Software binaries – ∙ Cache Hit Rate Scratch area Redundancy ∙ [Data are illustrative] 4 / 23 Use Cases and Demands Change Frequency Request Rate MB/s Mean File Size Data Classes Request Rate IOPS Home folders – ∙ Data Physics Data – Value ∙ ∙ Recorded ∙ Simulated Volume ∙ Analysis results Confidentiality Software binaries – ∙ Cache Hit Rate Scratch area – Redundancy ∙ [Data are illustrative] Depending on the use case, the dimensions span orders of magnitude (logarithmic axes) 4 / 23 Distributed File Systems and Use Cases > ls . It is difficult to perform well event_sample.root ∙ under usage characteristics that analysis.C differ by 4 orders of magnitude File system performance is File system: ∙ “please take special care of this file!” highly susceptible to characteristics of individual applications There is no interface to > ls / ∙ specify quality of service (QoS) /home for a particular file /data /software We will deal with a number of /scratch Implicit QoS DFSs for the foreseeable future 5 / 23 Distributed File Systems and Use Cases > ls . It is difficult to perform well event_sample.root ∙ under usage characteristics that analysis.C differ by 4 orders of magnitude File system performance is File system: ∙ “please take special care of this file!” highly susceptible to characteristics of individual applications There is no interface to > ls / ∙ specify quality of service (QoS) /home for a particular file /data /software We will deal with a number of /scratch Implicit QoS ! DFSs for the foreseeable future 5 / 23 POSIX Compliance File system operations essential create(), unlink(), stat() open(), close(), No DFS is fully POSIX read(), write(), seek() ∙ compliant difficult for DFSs It must provide just enough to ∙ File locks not break applications Atomic rename() Field test necessary Open unlinked files ∙ Hard links impossible for DFSs Device files, IPC files 6 / 23 Architecture Sketches Network shares, client-server /share /share ∙ ∙ ∙ Goals: Simplicity, separate storage from application Example: NFS 3 7 / 23 Architecture Sketches Namespace delegation /physics /physics/ams subtree ∙ ∙ ∙ Goals: Scaling network shares, decentral administration Example: AFS 8 / 23 Architecture Sketches Object-based file system delete() meta-data create() read() write() ∙ ∙ ∙ ∙ data ∙ ∙ Goals: Separate meta-data from data, incremental scaling Example: Google File System 9 / 23 Architecture Sketches Parallel file system delete() meta-data create() read() write() ∙ data ∙ ∙ ∙ ∙ ∙ Goals: Maximum throughput, optimized for large files Example: Lustre 10 / 23 Architecture Sketches Distributed meta-data delete() meta-data create() read() write() ∙ data ∙ ∙ ∙ ∙ ∙ Goals: Avoid single point of failure and meta-data bottleneck Example: Ceph 11 / 23 Architecture Sketches Symmetric, peer-to-peer hash(pathn) Distributed hash table — Hosts of pathn ○ Goals: Conceptual simplicity, inherently scalable Difficult to deal with node churn, long lookup beyond LAN Example: GlusterFS 12 / 23 Milestones in Distributed File Systems Biased towards open-source, production file systems 2002 2005 1983 1985 1995 2000 2003 2007 AFS NFS Zebra OceanStore GFS Ceph Venti XRootD 13 / 23 Milestones in Distributed File Systems Biased towards open-source, production file systems 2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD ∙ Andrew File System Client-server “AFS was the first safe and efficient distributed com- Roaming home folders puting system, available [. ] on campus. It was a ∙ clear precursor to the Dropbox-like software pack- Identity tokens and ∙ ages today. [. ] [It] allowed students (like Drew access control lists (ACLs) Houston and Arash Ferdowsi) access to all their Decentralized operation (“Cells”) stuff from any connected computer.” ∙ http://www.wired.com/2011/12/backdrop-dropbox 13 / 23 Milestones in Distributed File Systems Biased towards open-source, production file systems 2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD ∙ Network File System Client-server Focus on portability ∙ Separation of protocol ∙ and implementation Stateless servers ∙ ∙ Fast crash recovery Sandberg, Goldberg, Kleiman, Walsh, Lyon (1985) 13 / 23 Milestones in Distributed File Systems Biased towards open-source, production file systems 2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD ∙ Zebra File System File B File C ParallelFile Manager File A File D Client Client Stripe Striping and parity Cleaner ∙ Redundant array of 1 2 3 4 5 6 Client’s Log ∙ inexpensive nodes (RAIN)Network Log-structured data ∙ Storage Storage 1 2 3 1 ⊗ 2 ⊗ 3 Server Server Storage Storage 4 5 6 4 ⊗ 5 ⊗ 6 Server Server Figure 5: Zebra schematic. Clients run applications; File Servers storage servers store data. The file manager and the stripe cleaner can run on any machine in the system, Figure 4. Per-client striping in Hartman,Zebra. Each Ousterhout client (1995) although it is likely that one machine will run both of forms its new file data into a single append-only log and them. A storage server may also be a client. stripes this log across the servers. In this example file A spans several servers while file B is stored entirely on a space reclaimed from the logs? Zebra solves this problem single server. Parity is computed for the log, not for with a stripe cleaner, which is analogous to the cleaner in a individual files. log-structured file system. The next section13 /provides 23 a more sequential transfers. LFS is particularly effective for writing detailed discussion of these issues and several others. small files, since it can write many files in a single transfer; in contrast, traditional file systems require at least two 3 Zebra Components independent disk transfers for each file. Rosenblum reported a tenfold speedup over traditional file systems for writing The Zebra file system contains four main components small files. LFS is also well-suited for RAIDs because it as shown in Figure 5: clients, which are the machines that batches small writes together into large sequential transfers run application programs; storage servers, which store file and avoids the expensive parity updates associated with data; a file manager, which manages the file and directory small random writes. structure of the file system; and a stripe cleaner, which Zebra can be thought of as a log-structured network file reclaims unused space on the storage servers. There may be system: whereas LFS uses the logging approach at the any number of clients and storage servers but only a single interface between a file server and its disks, Zebra uses the file manager and stripe cleaner. More than one of these logging approach at the interface between a client and its components may share a single physical machine; for servers. Figure 4 illustrates this approach, which we call example, it is possible for one machine to be both a storage
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages53 Page
-
File Size-