A Survey of Distributed Technology

Jakob Blomer CERN PH/SFT and Stanford University

ACAT 2014 Prague

1 / 23 Agenda

Motivation Physics experiments store their data in distributed file systems ∙ In High Energy Physics ∙ ∙ Global federation of file systems ∙ Hundreds of peta-bytes of data ∙ Hundreds of millions of objects

Outline

1 Usage of distributed file systems

2 Survey and taxonomy

3 Critical areas in distributed file systems for physics applications

4 Developments and future challenges

2 / 23 Distributed File Systems

A distributed file system (DFS) provides

1 persistent storage

2 of opaque data (files)

3 in a hierarchical namespace that is shared among networked nodes

Files survive the lifetime of processes and nodes ∙ POSIX-like interface: open(), close(), read(), write(),... ∙ ∙ Typically transparent to applications Data model and interface distinguish a DFS from ∙ a distributed (No-)SQL database or a distributed key-value store

3 / 23

1 Distributed File Systems

Popular DFSs: A distributed file system (DFS) provides AFS, , CernVM-FS, 1 persistent storage dCache, EOS, FhGFS, GlusterFS, GPFS, HDFS, 2 of opaque data (files) , MooseFS, NFS, 3 in a hierarchical namespace that is PanFS, XrootD shared among networked nodes

Files survive the lifetime of processes and nodes ∙ POSIX-like interface: open(), close(), read(), write(),... ∙ ∙ Typically transparent to applications Data model and interface distinguish a DFS from ∙ a distributed (No-)SQL database or a distributed key-value store

3 / 23

1 Use Cases and Demands

Change Frequency Request Rate MB/s Mean File Size Data Classes Request Rate IOPS Home folders ∙ Data Physics Data Value ∙ ∙ Recorded ∙ Simulated ∙ Analysis results

Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙

[Data are illustrative]

4 / 23 Use Cases and Demands

Change Frequency Request Rate MB/s Mean File Size Data Classes Request Rate IOPS Home folders – ∙ Data Physics Data Value ∙ ∙ Recorded ∙ Simulated Volume ∙ Analysis results

Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙

[Data are illustrative]

4 / 23 Use Cases and Demands

Change Frequency Request Rate MB/s Mean File Size Data Classes Request Rate IOPS Home folders – ∙ Data Physics Data – Value ∙ ∙ Recorded ∙ Simulated Volume ∙ Analysis results

Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙

[Data are illustrative]

4 / 23 Use Cases and Demands

Change Frequency Request Rate MB/s Mean File Size Data Classes Request Rate IOPS Home folders – ∙ Data Physics Data – Value ∙ ∙ Recorded ∙ Simulated Volume ∙ Analysis results

Confidentiality Software binaries – ∙ Cache Hit Rate Scratch area Redundancy ∙

[Data are illustrative]

4 / 23 Use Cases and Demands

Change Frequency Request Rate MB/s Mean File Size Data Classes Request Rate IOPS Home folders – ∙ Data Physics Data – Value ∙ ∙ Recorded ∙ Simulated Volume ∙ Analysis results

Confidentiality Software binaries – ∙ Cache Hit Rate Scratch area – Redundancy ∙

[Data are illustrative]

Depending on the use case, the dimensions span orders of magnitude (logarithmic axes) 4 / 23 Distributed File Systems and Use Cases

> ls . It is difficult to perform well event_sample.root ∙ under usage characteristics that analysis.C differ by 4 orders of magnitude File system performance is File system: ∙ “please take special care of this file!” highly susceptible to characteristics of individual applications There is no interface to > ls / ∙ specify quality of service (QoS) /home for a particular file /data /software We will deal with a number of

/scratch Implicit QoS DFSs for the foreseeable future

5 / 23 Distributed File Systems and Use Cases

> ls . It is difficult to perform well event_sample.root ∙ under usage characteristics that analysis.C differ by 4 orders of magnitude File system performance is File system: ∙ “please take special care of this file!” highly susceptible to characteristics of individual applications There is no interface to > ls / ∙ specify quality of service (QoS) /home for a particular file /data /software We will deal with a number of

/scratch Implicit QoS → DFSs for the foreseeable future

5 / 23 POSIX Compliance

File system operations essential create(), unlink(), stat() open(), close(), No DFS is fully POSIX read(), write(), seek() ∙ compliant difficult for DFSs It must provide just enough to ∙ File locks not break applications Atomic rename() Field test necessary Open unlinked files ∙ Hard links impossible for DFSs Device files, IPC files

6 / 23 Architecture Sketches

Network shares, client-server

/share

/share

∙ ∙ ∙

Goals: Simplicity, separate storage from application Example: NFS 3

7 / 23 Architecture Sketches

Namespace delegation

/physics

/physics/ams subtree ∙ ∙ ∙

Goals: Scaling network shares, decentral administration Example: AFS

8 / 23 Architecture Sketches

Object-based file system

delete() meta-data create()

read() write()

∙ ∙ ∙ ∙

data ∙ ∙

Goals: Separate meta-data from data, incremental scaling Example:

9 / 23 Architecture Sketches

Parallel file system

delete() meta-data create()

read() write()

data ∙ ∙ ∙ ∙ ∙

Goals: Maximum throughput, optimized for large files Example: Lustre

10 / 23 Architecture Sketches

Distributed meta-data

delete() meta-data create()

read() write()

data ∙ ∙ ∙ ∙ ∙

Goals: Avoid single point of failure and meta-data bottleneck Example: Ceph

11 / 23 Architecture Sketches

Symmetric, peer-to-peer

hash(pathn)

Distributed hash table — Hosts of pathn ○

Goals: Conceptual simplicity, inherently scalable Difficult to deal with node churn, long lookup beyond LAN Example: GlusterFS 12 / 23 Milestones in Distributed File Systems Biased towards open-source, production file systems

2002 2005 1983 1985 1995 2000 2003 2007

AFS NFS Zebra OceanStore GFS Ceph Venti XRootD

13 / 23 Milestones in Distributed File Systems Biased towards open-source, production file systems

2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD

∙ Andrew File System Client-server “AFS was the first safe and efficient distributed com- Roaming home folders puting system, available [. . . ] on campus. It was a ∙ clear precursor to the Dropbox-like software pack- Identity tokens and ∙ ages today. [. . . ] [It] allowed students (like Drew access control lists (ACLs) Houston and Arash Ferdowsi) access to all their Decentralized operation (“Cells”) stuff from any connected computer.” ∙ http://www.wired.com/2011/12/backdrop-dropbox

13 / 23 Milestones in Distributed File Systems Biased towards open-source, production file systems

2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD

Client-server

Focus on portability ∙ Separation of protocol ∙ and implementation Stateless servers ∙ ∙ Fast crash recovery

Sandberg, Goldberg, Kleiman, Walsh, Lyon (1985)

13 / 23 Milestones in Distributed File Systems Biased towards open-source, production file systems

2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD

∙ Zebra File System File B File C ParallelFile Manager File A File D Client Client Stripe Striping and parity Cleaner ∙ Redundant array of 1 2 3 4 5 6 Client’s Log ∙ inexpensive nodes (RAIN)Network Log-structured data ∙ Storage Storage 1 2 3 1 ⊗ 2 ⊗ 3 Server Server Storage Storage 4 5 6 4 ⊗ 5 ⊗ 6 Server Server Figure 5: Zebra schematic. Clients run applications; File Servers storage servers store data. The file manager and the stripe cleaner can run on any machine in the system, Figure 4. Per-client striping in Hartman, Zebra. EachOusterhout client (1995) although it is likely that one machine will run both of forms its new file data into a single append-only log and them. A storage server may also be a client. stripes this log across the servers. In this example file A spans several servers while file B is stored entirely on a space reclaimed from the logs? Zebra solves this problem single server. Parity is computed for the log, not for with a stripe cleaner, which is analogous to the cleaner in a individual files. log-structured file system. The next section13 /provides 23 a more sequential transfers. LFS is particularly effective for writing detailed discussion of these issues and several others. small files, since it can write many files in a single transfer; in contrast, traditional file systems require at least two 3 Zebra Components independent disk transfers for each file. Rosenblum reported a tenfold speedup over traditional file systems for writing The Zebra file system contains four main components small files. LFS is also well-suited for RAIDs because it as shown in Figure 5: clients, which are the machines that batches small writes together into large sequential transfers run application programs; storage servers, which store file and avoids the expensive parity updates associated with data; a file manager, which manages the file and directory small random writes. structure of the file system; and a stripe cleaner, which Zebra can be thought of as a log-structured network file reclaims unused space on the storage servers. There may be system: whereas LFS uses the logging approach at the any number of clients and storage servers but only a single interface between a file server and its disks, Zebra uses the file manager and stripe cleaner. More than one of these logging approach at the interface between a client and its components may share a single physical machine; for servers. Figure 4 illustrates this approach, which we call example, it is possible for one machine to be both a storage per-client striping. Each Zebra client organizes its new file server and a client. The remainder of this section describes data into an append-only log, which it then stripes across the each of the components in isolation; Section 4 then shows servers. The client computes parity for the log, not for how the components work together to implement operations individual files. Each client creates its own log, so a single such as reading and writing files, and Sections 5 and 6 stripe in the file system contains data written by a single describe how Zebra deals with crashes. client. We will describe Zebra under the assumption that there Per-client striping has a number of advantages over per- are several storage servers, each with a single disk. file striping. The first is that the servers are used efficiently However, this need not be the case. For example, storage regardless of file sizes: large writes are striped, allowing servers could each contain several disks managed as a them to be completed in parallel, and small writes are RAID, thereby giving the appearance to clients of a single batched together by the log mechanism and written to the disk with higher capacity and throughput. It is also possible servers in large transfers; no special handling is needed for to put all of the disks on a single server; clients would treat either case. Second, the parity mechanism is simplified. it as several logical servers, all implemented by the same Each client computes parity for its own log without fear of physical machine. This approach would still provide many interactions with other clients. Small files do not have of Zebra’s benefits: clients would still batch small files for excessive parity overhead because parity is not computed transfer over the network, and it would still be possible to on a per-file basis. Furthermore, parity never needs to be reconstruct data after a disk failure. However, a single- updated because file data are never overwritten in place. server Zebra system would limit system throughput to that The above introduction to per-client striping leaves of the one server, and the system would not be able to some unanswered questions. For example, how can files be operate when the server is unavailable. shared between client workstations if each client is writing its own log? Zebra solves this problem by introducing a 3.1 Clients central file manager, separate from the storage servers, that manages metadata such as directories and file attributes and Clients are machines where application programs supervises interactions between clients. Also, how is free execute. When an application reads a file the client must Milestones in Distributed File Systems Biased towards open-source, production file systems

2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD

∙ Ideally, a user would entrust all of his or her data to OceanStore; OceanStore in return, the utility’s economies of scale would yield much better pool Peer-to-peer availability, performance, and reliability than would be available pool otherwise. Further, the geographic distribution of servers would “Global Scale”: support deep archival storage, i.e. storage that would survive ma- ∙ 1010 users, 1014 files jor disasters and regional outages. In a time when desktop worksta- pool pool Untrusted infrastructure tions routinely ship with tens of gigabytes of spinning storage, the ∙ management of data is far more expensive than the storage media. pool Based on peer-to-peer OceanStore hopes to take advantage of this excess of storage space ∙ to make the management of data seamless and carefree. overlay network pool 1.2 Two Unique Goals pool Nomadic data through ∙ aggressive caching The OceanStore system has two design goals that differentiate it from similar systems: (1) the ability to be constructed from an un- Foundation for today’s trusted infrastructure and (2) support of nomadic data. ∙ decentral dropbox replacements Kubiatowicz et al. (2000) Untrusted Infrastructure: OceanStore assumes that the infras- Figure 1: The OceanStore system. The core of the system is tructure is fundamentally untrusted. Servers may crash without composed of a multitude of highly connected “pools”, among warning or leak information to third parties. This lack of trust is in- which data is allowed to “flow” freely. Clients connect to one or herent in the utility model and is different from other cryptographic more pools, perhaps intermittently. 13 / 23 systems such as [35]. Only clients can be trusted with cleartext—all information that enters the infrastructure must be encrypted. How- ever, rather than assuming that servers are passive repositories of Objects are replicated and stored on multiple servers. This replica- information (such as in CFS [5]), we allow servers to be able to tion provides availability1 in the presence of network partitions and participate in protocols for distributed consistency management. To durability against failure and attack. A given replica is independent this end, we must assume that most of the servers are working cor- of the server on which it resides at any one time; thus we refer to rectly most of the time, and that there is one class of servers that we them as floating replicas. can trust to carry out protocols on our behalf (but not trust with the A replica for an object is located through one of two mecha- content of our data). This responsible party is financially responsi- nisms. First, a fast, probabilistic algorithm attempts to find the ble for the integrity of our data. object near the requesting machine. If the probabilistic algorithm fails, location is left to a slower, deterministic algorithm. Nomadic Data: In a system as large as OceanStore, locality is Objects in the OceanStore are modified through updates. Up- of extreme importance. Thus, we have as a goal that data can be dates contain information about what changes to make to an ob- cached anywhere, anytime, as illustrated in Figure 1. We call this ject and the assumed state of the object under which those changes policy promiscuous caching. Data that is allowed to flow freely is were developed, much as in the Bayou system [13]. In principle, called nomadic data. Note that nomadic data is an extreme con- every update to an OceanStore object creates a new version2. Con- sequence of separating information from its physical location. Al- sistency based on versioning, while more expensive to implement though promiscuous caching complicates data coherence and loca- than update-in-place consistency, provides for cleaner recovery in tion, it provides great flexibility to optimize locality and to trade off the face of system failures [49]. It also obviates the need for backup consistency for availability. To exploit this flexibility, continuous and supports “permanent” pointers to information. introspective monitoring is used to discover tacit relationships be- OceanStore objects exist in both active and archival forms. An tween objects. The resulting “meta-information” is used for local- active form of an object is the latest version of its data together ity management. Promiscuous caching is an important distinction with a handle for update. An archival form represents a permanent, between OceanStore and systems such as NFS [43] and AFS [23] read-only version of the object. Archival versions of objects are in which cached data is confined to particular servers in particular encoded with an erasure code and spread over hundreds or thou- regions of the network. Experimental systems such as XFS [3] al- sands of servers [18]; since data can be reconstructed from any suf- low “cooperative caching” [12], but only in systems connected by ficiently large subset of fragments, the result is that nothing short a fast LAN. of a global disaster could ever destroy information. We call this highly redundant data encoding deep archival storage. The rest of this paper is as follows: Section 2 gives a system-level An application writer views the OceanStore as a number of ses- overview of the OceanStore system. Section 3 shows sample ap- sions. Each session is a sequence of read and write requests related plications of the OceanStore. Section 4 gives more architectural to one another through the session guarantees, in the style of the detail, and Section 5 reports on the status of the current prototype. Bayou system [13]. Session guarantees dictate the level of con- Section 6 examines related work. Concluding remarks are given in sistency seen by a session’s reads and writes; they can range from Section 7. supporting extremely loose consistency semantics to supporting the ACID semantics favored in databases. In support of legacy code, OceanStore also provides an array of familiar interfaces such as the 2 SYSTEM OVERVIEW Unix file system interface and a simple transactional interface. An OceanStore prototype is currently under development. This sec- tion provides a brief overview of the planned system. Details on the If application semantics allow it, this availability is provided at the expense individual system components are left to Section 4. of consistency. In fact, groups of updates are combined to create new versions, and we The fundamental unit in OceanStore is the persistent object. plan to provide interfaces for retiring old versions, as in the Elephant File Each object is named by a globally unique identifier, or GUID. System [44].

2 Jukebox Bootes: storage size Emelie: storage size Venti 300 450 Active file system 400 Milestones in Distributed File Systems 250 350 200 300 Biased towards open-source, production file systems 250 150 200 2002 2005 Size (Gb) Size (Gb) 100 1501983 1985 1995 2000 2003 2007 100 50 50 ∙ 0

Jul-90 Jan-91 Jul-91 Jan-92 Jul-92 Jan-93 Jul-93 Jan-94 Jul-94 Jan-95 Jul-95 Jan-96 Jul-96 Jan-97 Jul-97 Jan-98 0 AFSJan-97 Jul-97 Jan-98 NFS Jul-98 Jan-99 Jul-99 Jan-00 Jul-00 Jan-01 Jul-01 Zebra OceanStore GFS Ceph Venti XRootD

∙ Venti Bootes: ratio of archival to active data Emelie: ratio of archival to active data Archival storage 5 7 Jukebox / Active 4.5 Venti / Active 6 De-duplication through 4 ∙ content-addressable storage 3.5 5 3 4 Content hashes provide 2.5 ∙ Ratio Ratio 3 intrinsic file integrity 2 1.5 2 Merkle trees verify the 1 1 ∙ 0.5 file system hierarchy 0

0 Jan-97 Jul-97 Jan-98 Jul-98 Jan-99 Jul-99 Jan-00 Jul-00 Jan-01 Jul-01 Jul-90 Jan-91 Jul-91 Jan-92 Jul-92 Jan-93 Jul-93 Jan-94 Jul-94 Jan-95 Jul-95 Jan-96 Jul-96 Jan-97 Jul-97

Quinlan and Dorward (2002) Figure 6. Graphs of the various sizes of two Plan 9 file servers. overlapped and do not benefit from the striping of the Figure 6 depicts the size of the active file system as index. One possible solution is a form of read-ahead. measured over time by du, the space consumed on the 13 / 23 When reading a block from the data log, it is feasible to jukebox, and the size of the jukebox’s data if it were to also read several following blocks. These extra blocks be stored on Venti. The ratio of the size of the archival can be added to the caches without referencing the data and the active file system is also given. As can be index. If blocks are read in the same order they were seen, even without using Venti, the storage required to written to the log, the latency of uncached index implement the daily snapshots in Plan 9 is relatively lookups will be avoided. This strategy should work well modest, a result of the block level incremental approach for streaming data such as multimedia files. to generating a snapshot. When the archival data is stored to Venti the cost of retaining the snapshots is The basic assumption in Venti is that the growth in reduced significantly. In the case of the emelie file capacity of disks combined with the removal of system, the size on Venti is only slightly larger than the duplicate blocks and compression of their contents active file system; the cost of retaining the daily enables a model in which it is not necessary to reclaim snapshots is almost zero. Note that the amount of space by deleting archival data. To demonstrate why we storage that Venti uses for the snapshots would be the believe this model is practical, we present some same even if more conventional methods were used to statistics derived from a decade’s use of the Plan 9 file back up the file system. The Plan 9 approach to system. snapshots is not a necessity, since Venti will remove duplicate blocks. The computing environment in which we work includes two Plan 9 file servers named bootes and emelie. When stored on Venti, the size of the jukebox data is Bootes was our primary file repository from 1990 until reduced by three factors: elimination of duplicate 1997 at which point it was superseded by emelie. Over blocks, elimination of block fragmentation, and the life of these two file servers there have been 522 compression of the block contents. Table 2 presents the user accounts of which between 50 and 100 were active percent reduction for each of these factors. Note, bootes at any given time. The file servers have hosted uses a 6 Kbyte block size while emelie uses 16 Kbyte, numerous development projects and also contain so the effect of removing fragmentation is more several large data sets including chess end games, significant on emelie. astronomical data, satellite imagery, and multimedia files. Milestones in Distributed File Systems Biased towards open-source, production file systems

2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD

∙ Google File System Object-based

Co-designed for map-reduce ∙ Coalesce storage and ∙ compute nodes Serialize data access ∙

Google AppEngine documentation

13 / 23 Milestones in Distributed File Systems Biased towards open-source, production file systems

2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD

∙ XRootD open() redirect Namespace delegation open() Client redirect open() Global tree of redirectors ∙ xrootd Flexibility through ∙ cmsd decomposition into pluggable components 641 = 64 xrootd xrootd Namespace independent cmsd cmsd ∙ from data access 642 = 4096 xrootd xrootd xrootd xrootd cmsd cmsd cmsd cmsd

Hanushevsky (2013)

13 / 23 root

row1 row3 row4 choose(1,row) row2

cab22 choose(3,cabinet) cab21 cab23 cab24 Milestones in Distributed File Systems choose(1,disk) ··· Biased··· towards open-source,··· production file systems ··· ··· ··· ··· ··· ··· ··· ··· ··· 2002 2005 1983 1985 1995 2000 2003 2007 Figure 5.1: A partial view of a four-level cluster map hierarchy consisting of rows, cabinets, and ∙ shelves of disks. Bold linesAFS illustrateNFS items selected by each select operationZebra in theOceanStore placement GFS Ceph rule and fictitious mapping described by Table 5.1. Venti XRootD ∙ Ceph File System and RADOS Action Resulting⃗i Parallel, distributed meta-data take(root) root select(1,row) row2 Peer-to-peer file system ∙ at the cluster scale select(3,cabinet) cab21 cab23 cab24 select(1,disk) disk2107 disk2313 disk2437 Data placement ∙ emit across failure domains

Weil (2007) Adaptive workload ∙ distribution Table 5.1: A simple rule that distributes three replicas across three cabinets in the same row. descends through any intermediate buckets, pseudo-randomly selecting a nested item in each bucket using the function c(r,x) (defined for each kind of bucket in Section 5.2.4), until it finds an item of the requested type t. The resulting n⃗i distinct items are placed back into the input⃗i | | 13 / 23 and either form the input for a subsequent select(n,t) or are moved into the result vector with an emit operation.

As an example, the rule defined in Table 5.1 begins at the root of the hierarchy in

Figure 5.1 and with the first select(1,row) chooses a single bucket of type “row” (it selects row2).

91 A Taxonomy

Name- Check- space summing delegation Dis- Consensus tributed hash tables Erasure codes Fault- Scalability tolerance Workload adaptation

Repli- cation

Distributed File Systems

Caching Encryp- tion

Security, Efficiency Integrity Striping Merkle- trees

Content- De- addressable Log- duplication structured storage Snapshots data

14 / 23 Critical Areas in DFSs for Physics Applications

Fault-Tolerance Fundamental problems as the number of components grows

1 Faults are the norm

2 Faults are often correlated

3 No safe way to distinguish temporary unavailability from faults

Bandwidth Utilization Data structures that work throughout the memory hierarchy ∙ Efficient writing of small files: analysis result merging, meta-data ∙

15 / 23 Fault Tolerance and Data Reliability

Data Reliability Techniques Replication: simple and fast but large storage overhead ∙ ∙ Trend from random placement to “de-correlation” Erasure codes: any n + 휀 out of n + k blocks reconstruct data ∙ ∙ Different codes offer different trade-offs between computational complexity and storage overhead Checksums: detect silent data corruption ∙ Engineering Challenges Fault detection ∙ Automatic and fast recovery ∙ Failure prediction ∙ e. g. based on MTTF and Markov models

16 / 23 Fault Tolerance and Data Reliability

Example from Google data centers impact under a variety of replication schemes and 100 placement policies. (Sections 5 and 6) 80 Formulate a Markov model for data availability, that • 60 can scale to arbitrary cell sizes, and captures the in- Statistical separation teraction of failures with replication policies and re- 40 Events (%) of temporary and covery times. (Section 7) 20 permanent faults Introduce multi-cell replication schemes and com- 0 • 1s 10s 1min 15min 1h 6h 1d 7d 1mon pare the availability and bandwidth trade-offs Unavailability event duration against single-cell schemes. (Sections 7 and 8) Link FordFigure et al. 1: (2010) Cumulative distribution function of the duration of Show the impact of hardware failure on our cells is node unavailability periods. • significantly smaller than the impact of effectively tuning recovery and replication parameters. (Sec- tion 8) by our monitoring system. The node remains unavail- able until it regains responsiveness or the storage system Our results show the importance of considering reconstructs the data from other surviving nodes. 17 / 23 cluster-wide failure events in the choice of replication Nodes can become unavailable for a large number of and recovery policies. reasons. For example, a storage node or networking switch can be overloaded; a node binary or operating system may crash or restart; a machine may experience 2 Background a hardware error; automated repair processes may tem- porarily remove disks or machines; or the whole clus- We study end to end data availability in a cloud com- ter could be brought down for maintenance. The vast puting storage environment. These environments often majority of such unavailability events are transient and use loosely coupled distributed storage systems such as do not result in permanent data loss. Figure 1 plots the GFS [1, 16] due to the parallel I/O and cost advantages CDF of node unavailability duration, showing that less they provide over traditional SAN and NAS solutions. A than 10% of events last longer than 15 minutes. This few relevant characteristics of such systems are: data is gathered from tens of Google storage cells, each with 1000 to 7000 nodes, over a one year period. The Storage server programs running on physical ma- • cells are located in different datacenters and geographi- chines in a datacenter, managing local disk storage cal regions, and have been used continuously by different on behalf of the distributed storage cluster. We refer projects within Google. We use this dataset throughout to the storage server programs as storage nodes or the paper, unless otherwise specified. nodes. Experience shows that while short unavailability A pool of storage service masters managing data • events are most frequent, they tend to have a minor im- placement, load balancing and recovery, and moni- pact on cluster-level availability and data loss. This is toring of storage nodes. because our distributed storage systems typically add A replication or erasure code mechanism for user enough redundancy to allow data to be served from other • data to provide resilience to individual component sources when a particular node is unavailable. Longer failures. unavailability events, on the other hand, make it more likely that faults will overlap in such a way that data A large collection of nodes along with their higher could become unavailable at the cluster level for long level coordination processes [17] are called a cell or periods of time. Therefore, while we track unavailabil- storage cell. These systems usually operate in a shared ity metrics at multiple time scales in our system, in this pool of machines running a wide variety of applications. paper we focus only on events that are 15 minutes or A typical cell may comprise many thousands of nodes longer. This interval is long enough to exclude the ma- housed together in a single building or set of colocated jority of benign transient events while not too long to ex- buildings. clude significant cluster-wide phenomena. As in [11], we observe that initiating recovery after transient failures is 2.1 Availability inefficient and reduces resources available for other op- erations. For these reasons, GFS typically waits 15 min- A storage node becomes unavailable when it fails to re- utes before commencing recovery of data on unavailable spond positively to periodic health checking pings sent nodes.

2 Log-Structured Data

Unix file system Idea: Store all modifications in a Disk File Inode change log

Log-structured file system Use full hardware Directory Index Log −→ Disk bandwidth for ··· small objects

Used by Advantages Zebra experimental DFS ∙ Minimal seek, in-place updates Commercial filers ∙ ∙ Fast and robust crash recovery (e. g. NetApp) ∙ Efficient allocation in Key-value stores ∙ ∙ DRAM, flash, and disks File systems for ∙ Applicable for merging, meta-data

18 / 23 Do we get what we need?

Typical physics experiment cluster Up to 1 000 standard nodes ∙ 1 GbE or 10 GbE network, ∙ high bisection bandwidth

Goals for a DFS for analysis applications At least 90 % available disk capacity ∙ At least 50 % of maximum ∙ aggregated throughput Fault-tolerant to a small number ∙ of disk/node failures Symmetric architecture, fully decentralized ∙

Bruce Allen, MPI for Gravitational Physics (2009) Link 19 / 23 Decomposition Good track record of “outsourcing” tasks ∙ e. g. authentication (Kerberos), distributed coordination (ZooKeeper) Ongoing: separation of namespace and data access ∙ Increases the number of standards and interfaces and ∙ temporarily increases the effort on the development side ⊕ Faster adaption to a changing computing landscape

Complexity and Decomposition in Distributed File Systems For a DFS: At least 5 years from inception to widespread adoption Complexity Open source DFSs comprise some 100 kLOC to 750 kLOC ∙ It takes a community effort to stabilize them ∙ Once established, it can become prohibitively expensive ∙ to move to a different DFS (data lock-in)

20 / 23 Complexity and Decomposition in Distributed File Systems For a DFS: At least 5 years from inception to widespread adoption Complexity Open source DFSs comprise some 100 kLOC to 750 kLOC ∙ It takes a community effort to stabilize them ∙ Once established, it can become prohibitively expensive ∙ to move to a different DFS (data lock-in)

Decomposition Good track record of “outsourcing” tasks ∙ e. g. authentication (Kerberos), distributed coordination (ZooKeeper) Ongoing: separation of namespace and data access ∙ Increases the number of standards and interfaces and ∙ temporarily increases the effort on the development side ⊕ Faster adaption to a changing computing landscape

20 / 23 Distributed File Systems in the Exascale

“Exascale” computing ×1000 ∙ 18 HDD (EB of data, 10 ops/s) DRAM envisaged by 2020

Storage capacity and ×100 ∙ bandwidth scale at different pace

It will be more difficult or Increase ∙ impossible to constantly “move data in and out” ×10 Ethernet bandwidth scaled Bandwidth similarly to capacity ∙  Capacity Yielding segregation of ×1 storage and computing to 1994 2004 2014 a symmetric DFS can be Year part of a solution 21 / 23 Distributed File Systems in the Exascale

“Exascale” computing ×1000 ∙ 18 HDD (EB of data, 10 ops/s) DRAM envisaged by 2020 Ethernet

Storage capacity and ×100 ∙ bandwidth scale at different pace

It will be more difficult or Increase ∙ impossible to constantly “move data in and out” ×10 Ethernet bandwidth scaled Bandwidth ∙ similarly to capacity ∙  Capacity Yielding segregation of ×1 storage and computing to 1994 2004 2014 a symmetric DFS can be Year part of a solution 21 / 23 Distributed File Systems in the Exascale

“Exascale” computing ×1000 ∙ 18 HDD (EB of data, 10 ops/s) DRAM envisaged by 2020 Ethernet

Storage capacity and ×100 ∙ bandwidth scale at different pace

It will be more difficult or Increase ∙ impossible to constantly “move data in and out” ×10 Ethernet bandwidth scaled Bandwidth ∙ similarly to capacity ∙  Capacity Yielding segregation of ∙ ×1 storage and computing to 1994 2004 2014 a symmetric DFS can be Year part of a solution 21 / 23 Checkpointing is the state-of-the-art technique for delivering requirements are modeled to include 7 days of computing on the fault tolerance for high-performance computing (HPC) on large- entire HEC system. By 1M nodes, checkpointing with parallel file scale systems. It has been shown that writing the state of a process systems (red line with circle symbol) has a complete performance to persistent storage is the largest contributor to the performance collapse (as was predicted from Figure 5). Note that the overhead of checkpointing [12]. Let's assume that parallel file hypothetical distributed file system (blue line with hollow circle systems are expected to continue to improve linearly in symbol) could still have 90%+ uptime even at exascale levels. throughput with the growing number of nodes (an optimistic assumption); however, note that per node memory (the amount of 3. RADICAL NEW VISION state that is to be saved) will likely continue to grow We believe the HEC community needs to develop both the exponentially. Assume the MTTF is modeled after the BlueGene theoretical and practical aspects of building efficient and scalable with a claimed 1000 years MTTF per node (another optimistic distributed storage for HEC systems that will scale four orders of assumption). Figure 5 shows the expectedDistributed MTTF of about File 7 days Systems magnitude in in the concurrency. Exascale We believe that emerging distributed for a 0.8 petaflop IBM BlueGene/P with 64K nodes, while the file systems could co-exist with existing parallel file systems, and checkpointing overhead is about 34 minutes; at 1M nodes (~1 could be optimized to support a subset of workloads critical for exaflop), the MTTF would be only 8.4 hours, while the HPC“Exascale” and MTC computing workloads at exascale. There have been other ∙ 18 checkpointing overheadExample would be over from 9 HPChours. distributed(EB of filedata, systems 10 ops/s) proposed in the past, such as Google's GFS [13] and Yahoo's HDFS [14]; however these have not been 1000 envisaged by 2020 SystemMTTF(hours) widely adopted in HEC due to the workloads, data access patterns, CheckpointOverhead(hours) and supported interfaces (POSIX [15]) being quite different.

(hours) Current file systems lack scalable distributed metadata 100 management,Storage capacity and well anddefined interfaces to expose data locality

(hours) ∙for general computing frameworks that are not constrained by the bandwidth scale at map-reduce model to allow data-aware job scheduling with batch 10 Overhead different pace schedulers (e.g. PBS [16], SGE [17], Condor [18], Falkon [19]). MTTF WeIt believe will be that more future difficult HEC systems or should be designed with non- 1 ∙volatile memory (e.g. solid state memory [20], phase change

System impossible to constantly memory [21]) on every compute node (Figure 7); every compute “move data in and out” node would actively participate in the metadata and data Checkpointing 0.1 management,Ethernet bandwidth leveraging the scaled abundance of computational power ∙many-core processors will have and the many orders of magnitude highersimilarly bisection to capacity bandwidth in multi-dimensional torus networks as compared to available cost effective bandwidth into remote SystemScale(#ofNodes) networkYielding persistent segregation storage. of This shift in large-scale storage Figure 5: Expected checkpointing cost and MTTF towards ∙architecturestorage and design computing will lead to to improving application Raicu, Foster, Beckmanexascale (2011) Link performancea symmetric and DFS scalability can be for some of the most demanding applications. This shift in design is controversial as it requires Simulations results (see Figure 6) from 1K nodes to 2M nodes part of a solution storage architectures in HEC to be redefined from their traditional (simulating HEC from 2000 to 2019) show the application uptime architectures of the past several decades. This new approach was collapse for capability HEC systems. Today, 20% or more of the 21 / 23 not feasible up to recently due to the unreliable spinning disk computing capacity in a large HPC system is wasted due to technology [22] that has dominated the persistent storage space failures and recoveries [12], which is confirmed by the simulation since the dawn of supercomputing. However, the advancements in results. solid-state memory (with MTTF of over two million hours [23]) 100% have opened up opportunities to rethink storage systems for HEC, 90% 2000 2009 distributing storage on every compute node, without sacrificing BG/L BG/P node reliability or power consumption. The benefits of this new 80% 1024nodes 73,728nodes % architecture lies in its enabling of some workloads to scale near- 70% 2007 linearly with systems scales by leveraging data locality and the BG/L 60% 106,496nodes full network bisection bandwidth. Uptime 50% 40% 30%

Application 20% NoCheckpointing 2019 10% CheckpointingtoParallelFileSystem CheckpointingtoDistributedFileSystem ~1,000,000nodes 0%

Scale(#ofnodes) Figure 6: Simulation application uptime towards exascale The distributed file system shown in the simulation results is a hypothetical file system that could scale near linearly by Figure 7: Proposed storage architecture with persistent local preserving the data locality in checkpointing. The application storage on every compute node

13 The Road Ahead

Distributed File Systems

Seamless integration with applications ∙ Feature-rich: quotas, permissions, links ∙ Globally federated namespace ∙

Building Blocks

Key-Value and Blob Stores

Very good scalability on local networks ∙ Flat namespace + search facility ∙ Turns out that certain types of data ∙ don’t need a hierarchical namespace e. g. cached objects, media, VM images 22 / 23 The Road Ahead

Data Sharing Distributed File Systems

Seamless integration with applications ∙ Feature-rich: quotas, permissions, links ∙ Globally federated namespace ∙

Building Blocks

Data Provisioning Key-Value and Blob Stores

Very good scalability on local networks ∙ Flat namespace + search facility ∙ Turns out that certain types of data ∙ don’t need a hierarchical namespace e. g. cached objects, media, VM images 22 / 23 2 Hard disks become data silos ∙ We need to focus on optimal bandwidth utilization ∙ Once written, we have to leave data where they are storage and compute nodes coalesce →

3 Every now and then, a new file system comes along — they all are assembled from the same technology toolbox ∙ We need solidly engineered building blocks from this toolbox ∙ We need to validate early with our real application workload

Conclusion

1 Distributed file systems stay ∙ physics data processing applications use file systems ∙ the hierarchical namespace is a natural way to organize data

23 / 23 3 Every now and then, a new file system comes along — they all are assembled from the same technology toolbox ∙ We need solidly engineered building blocks from this toolbox ∙ We need to validate early with our real application workload

Conclusion

1 Distributed file systems stay ∙ physics data processing applications use file systems ∙ the hierarchical namespace is a natural way to organize data

2 Hard disks become data silos ∙ We need to focus on optimal bandwidth utilization ∙ Once written, we have to leave data where they are storage and compute nodes coalesce →

23 / 23 Conclusion

1 Distributed file systems stay ∙ physics data processing applications use file systems ∙ the hierarchical namespace is a natural way to organize data

2 Hard disks become data silos ∙ We need to focus on optimal bandwidth utilization ∙ Once written, we have to leave data where they are storage and compute nodes coalesce →

3 Every now and then, a new file system comes along — they all are assembled from the same technology toolbox ∙ We need solidly engineered building blocks from this toolbox ∙ We need to validate early with our real application workload

23 / 23 http://lonelychairsatcern.tumblr.com Backup Slides Source of Hardware Bandwidth and Capacity Numbers

Method and entries marked † from Patterson (2004) Link

Hard Disk Drives DRAM Ethernet Year Capacity Bandwidth Capacity Bandwidth Bandwidth 1993 16 Mibit/chip† 267 MiB/s† 1994 4.3 GB† 9 MB/s† 1995 100 Mbit/s† 2003 73.4 GB† 86 MB/s† 10 Gbit/s† 2004 512 Mibit/chip 3.2 GiB/s 2014 6 TB 220 MB/s‡ 8 Gibit/chip 25.6 GiB/s 100 Gbit/s Increase ×1395 ×24 ×512 ×98 ×1000 ‡http://www.storagereview.com/seagate_enterprise_capacity_6tb_35_sas_hdd_review_v4

HDD: Seagate ST15150 (1994)†, Seagate 373453 (2004)†, Seagate ST6000NM0034 (2014) DRAM: Fast Page DRAM (1993)†, DDR2-400 (2004), DDR4-3200 (2014) Ethernet: Fast Ethernet IEEE 802.3u (1995)†, 10 GbitE IEEE 802.3ae (2003)†, 100 GbitE IEEE 802.3bj (2014)

High-end SSD example: Hitachi FlashMAX III (2014), ≈2 TB, ≈2 GB/s 26 / 23 Full potential together with ∙ content-addressable storage Self-verifying data chunks, ∙ trivial to distribute and cache

Data Integrity and File System Snapshots

Root (/) Contents: A, B Hash tree with cryptographic ∙ hash(hA, hB ) hash function provides secure identifier for sub trees It is easy to sign a small hash ∙ /A /B value (data authenticity)

Contents: 훼, 훽 Contents: Efficient calculation of changes ∅ ∙ hash(h훼, h훽) =: hA hash( ) =: hB (fast replication) ∅ Bonus: versioning and data ∙ de-duplication /A/훼 /A/훽

Contents: data훼 Contents: data훽 hash(data훼) =: h훼 hash(data훽) =: h훽

Merkle tree

27 / 23 Data Integrity and File System Snapshots

hash(hA, hB )

Contents: hA [A], hB [B] Hash tree with cryptographic ∙ hash function provides secure identifier for sub trees It is easy to sign a small hash hash(h , h ) =: h ∙ 훼 훽 A hash(∅) =: hB value (data authenticity)

Contents: h훼 [훼], h훽 [훽] Contents: ∅ Efficient calculation of changes ∙ (fast replication) Bonus: versioning and data ∙ de-duplication hash(data훼) =: h훼 hash(data훽) =: h훽 Full potential together with Contents: data훼 Contents: data훽 ∙ content-addressable storage Merkle tree Self-verifying data chunks, ∙ trivial to distribute and cache Content-addressable⊕ storage

27 / 23 BibliographyI

Survey Articles

Satyanarayanan, M. (1990). A survey of distributed file systems. Annual Review of Computer Science, 4(1):73–104. Guan, P., Kuhl, M., Li, Z., and Liu, X. (2000). A survey of distributed file systems. University of California, San Diego. Agarwal, P. and Li, H. C. (2003). A survey of secure, fault-tolerant distributed file systems. URL: http://www.cs.utexas.edu/users/browne/cs395f2003/projects/LiAgarwalReport.pdf. Thanh, T. D., Mohan, S., Choi, E., Kim, S., and Kim, P. (2008). A taxonomy and survey on distributed file systems. In Proc. int. conf. on Networked Computing and Advanced Information Management (NCM’08), pages 144 – 149. Depardon, B., Séguin, C., and Mahec, G. L. (2013). Analysis of six distributed file systems. Technical Report hal-00789086, Université de Picardie Jules Verne.

28 / 23 BibliographyII

Donvito, G., Marzulli, G., and Diacono, D. (2014). Testing of several distributed file-systems (hdfs, ceph and glusterfs) for supporting the hep experiment analysis. Journal of Physics: Conference Series, 513.

File Systems

Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., and Lyon, B. (1985). Design and implementation of the sun network filesystem. In Proc. of the Summer USENIX conference, pages 119–130. Morris, J. H., Satyanarayanan, M., Conner, M. H., Howard, J. H., Rosenthal, D. S. H., and Smith, F. D. (1986). Andrew: A distributed personal computing environment. Communications of the ACM, 29(3):184–201. Hartman, J. H. and Osterhout, J. K. (1995). The Zebra striped network file system. ACM Transactions on Computer Systems, 13(3):274–310.

29 / 23 BibliographyIII

Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., and Zhao, B. (2000). OceanStore: An architecture for global-scale persistent storage. ACM SIGPLAN Notices, 35(11):190–201. Quinlan, S. and Dorward, S. (2002). Venti: a new approach to archival storage. In Proc. of the 1st USENIX Conf. on File and Storage Technologies (FAST’02), pages 89–102. Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The Google file system. ACM SIGOPS Operating Systems Review, 37(5):29–43. Schwan, P. (2003). Lustre: Building a file system for 1,000-node clusters. In Proc. of the 2003 Linux Symposium, pages 380–386. Dorigo, A., Elmer, P., Furano, F., and Hanushevsky, A. (2005). XROOTD - a highly scalable architecture for data access. WSEAS Transactions on Computers, 4(4):348–353.

30 / 23 BibliographyIV

Weil, S. A. (2007). Ceph: reliable, scalable, and high-performance distributed storage. PhD thesis, University of California Santa Cruz.

31 / 23