Experiences on File Systems Which is the best for you?

Jakob Blomer CERN PH/SFT

CHEP 2015 Okinawa, Japan

1 / 22 We like file systems for their rich and standardized interface We struggle finding an optimal implementation of that interface

Why Distributed File Systems

Physics experiments store their files in a variety of file systems for a good reason

File system interface is portable: we can take our local analysis ∙ application and run it anywhere on a big data set File system as a storage abstraction is a ∙ sweet spot between data flexibility and data organization

2 / 22 Why Distributed File Systems

Physics experiments store their files in a variety of file systems for a good reason

File system interface is portable: we can take our local analysis ∙ application and run it anywhere on a big data set File system as a storage abstraction is a ∙ sweet spot between data flexibility and data organization

We like file systems for their rich and standardized interface We struggle finding an optimal implementation of that interface

2 / 22 Shortlist of Distributed File Systems

Quantcast File System

CernVM-FS

3 / 22 Agenda

1 What do we want from a distributed file system?

2 Sorting and searching the file system landscape

3 Technology trends and future challenges

4 / 22 Can one size fit all?

Change Frequency Throughput MB/s Mean File Size Data Classes Throughput Home folders IOPS ∙ Physics Data Data ∙ Value ∙ Recorded ∙ Simulated ∙ Analysis results

Confidentiality binaries ∙ Cache Hit Rate Scratch area Redundancy ∙

[Data are illustrative]

5 / 22 Can one size fit all?

Change Frequency Throughput MB/s Mean File Size Data Classes Throughput Home folders IOPS ∙ Physics Data Data ∙ Value ∙ Recorded ∙ Simulated Volume ∙ Analysis results

Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙

[Data are illustrative]

5 / 22 Can one size fit all?

Change Frequency Throughput MB/s Mean File Size Data Classes Throughput Home folders IOPS ∙ Physics Data Data ∙ Value ∙ Recorded ∙ Simulated Volume ∙ Analysis results

Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙

[Data are illustrative]

5 / 22 Can one size fit all?

Change Frequency Throughput MB/s Mean File Size Data Classes Throughput Home folders IOPS ∙ Physics Data Data ∙ Value ∙ Recorded ∙ Simulated Volume ∙ Analysis results

Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙

[Data are illustrative]

5 / 22 Can one size fit all?

Change Frequency Throughput MB/s Mean File Size Data Classes Throughput Home folders IOPS ∙ Physics Data Data ∙ Value ∙ Recorded ∙ Simulated Volume ∙ Analysis results

Confidentiality Software binaries ∙ Cache Hit Rate Scratch area Redundancy ∙

[Data are illustrative]

Depending on the use case, the dimensions span orders of magnitude

5 / 22 POSIX Interface

File system operations No DFS is fully POSIX essential ∙ compliant create(), unlink(), stat() open(), close(), It must provide just enough to read(), write(), seek() ∙ not break applications

Often this can be only difficult for DFSs ∙ discovered by testing File locks Write-through Atomic rename() File ownership Extended attributes Unlink opened files Symbolic links, hard links Device files, IPC files

6 / 22 POSIX Interface

File system operations No DFS is fully POSIX essential ∙ compliant create(), unlink(), stat() open(), close(), It must provide just enough to read(), write(), seek() ∙ not break applications

Often this can be only difficult for DFSs ∙ discovered by testing File locks Write-through Atomic rename() Missing File ownership Physical file location, file Extended attributes properties, file temperature, . . Unlink opened files Symbolic links, hard links Device files, IPC files

6 / 22 What we ideally want is an application-defined, mountable file system

Fuse Parrot

Application-Defined File Systems?

Mounted file system File system

FILE * f = fopen ( hdfsFS fs = hdfsConnect( "susy.dat", "r"); "default", 0); w h i l e (...){ hdfsFile f = hdfsOpenFile( fread (...); fs, "susy.dat", ...); ... w h i l e (...){ } hdfsRead(fs, f, ...); f tag">c l o s e ( f ) ; ... } hdfsCloseFile(fs , f);

Application independent from file system Performance tuned API Allows for standard tools (ls, grep,...) Requires code changes

System administrator selects Application selects the file system the file system

7 / 22 Application-Defined File Systems?

Mounted file system File system library

FILE * f = fopen ( hdfsFS fs = hdfsConnect( "susy.dat", "r"); "default", 0); w h i l e (...){ hdfsFile f = hdfsOpenFile( fread (...); fs, "susy.dat", ...); ... w h i l e (...){ } hdfsRead(fs, f, ...); f c l o s e ( f ) ; ... } hdfsCloseFile(fs , f);

Application independent from file system Performance tuned API Allows for standard tools (ls, grep,...) Requires code changes

System administrator selects Application selects the file system the file system

What we ideally want is an application-defined, mountable file system

Fuse Parrot

7 / 22 Agenda

1 What do we want from a distributed file system?

2 Sorting and searching the file system landscape

3 Technology trends and future challenges

8 / 22 drop- Tahoe- box/own- LAFS cloud

HDFS AFS

Personal QFS Files

Big Data MooseFS

MapR FS XtreemFS

Distributed General File Systems Purpose

GPFS -FS Super- NFS

Orange- Shared FS Panasas Disk

BeeGFS

OCFS2

GFS2

9 / 22 drop- Tahoe- box/own- LAFS cloud

HDFS AFS Privacy ∙ Personal QFS Sharing ∙ Files Sync MapReduce ∙ ∙ workflows Big Data MooseFS Commodity MapR FS XtreemFS ∙ hardware

Incremental Distributed ∙ scalabilityGeneral Ceph Lustre File Systems EasePurpose of ∙ administration

GPFS Fast parallel Gluster-FS ∙ writesSuper- (p)NFS InfiniBand,computers ∙ , . . . High level of Orange- Shared ∙ POSIX FS Disk Panasas compliance

BeeGFS

OCFS2

GFS2

9 / 22 drop- Tahoe- box/own- LAFS cloud

HDFS AFS Privacy ∙ Personal QFS Sharing ∙ Files Sync MapReduce ∙ ∙ workflows Big Data MooseFS Commodity MapR FS XtreemFS ∙ hardware

Incremental Distributed ∙ scalabilityGeneral Ceph Lustre File Systems EasePurpose of ∙ administration

GPFS Fast parallel Gluster-FS ∙ writesSuper- (p)NFS InfiniBand,computers ∙ Myrinet, . . . High level of Orange- Shared ∙ POSIX FS Disk Panasas compliance

BeeGFS

OCFS2

GFS2

Used in HEP 9 / 22 drop- Tahoe- box/own- LAFS cloud

HDFS AFS Privacy ∙ Personal QFS Sharing ∙ Files Sync dCache MapReduce ∙ ∙ workflows Big Data MooseFS Commodity MapR FS XtreemFS ∙ hardware XRootD Incremental Distributed ∙ scalabilityGeneral Ceph HEP Lustre File Systems EasePurpose of ∙ administration CernVM- FS GPFS Fast parallel Gluster-FS ∙ writesSuper- (p)NFS InfiniBand,computers EOS ∙ Myrinet, . . . High level of Orange- Shared ∙ POSIX FS Disk Panasas compliance

BeeGFS

OCFS2

GFS2

Used in HEP

10 / 22 drop- Tahoe- box/own- LAFS cloud

HDFS AFS Privacy ∙ Personal QFS Sharing ∙ Files Sync dCache MapReduce ∙ ∙ workflows Big Data MooseFS Commodity MapR FS XtreemFS ∙ hardware Tape access XRootD ∙ WAN Incremental ∙ federation Distributed ∙ scalabilityGeneral Ceph HEP Lustre Software File Systems EasePurpose of ∙ distribution ∙ administration CernVM- Fault- FS ∙ tolerance GPFS Fast parallel Gluster-FS ∙ writesSuper- (p)NFS InfiniBand,computers EOS ∙ Myrinet, . . . High level of Orange- Shared ∙ POSIX FS Disk Panasas compliance

BeeGFS

OCFS2

GFS2

Used in HEP

10 / 22 File System Architecture

Examples: Hadoop File System,

Object-based file system

delete() meta-data create()

read() write()

∙ ∙ ∙ ∙

data ∙ ∙

Target: Incremental scaling, large & immutable files Typical for Big Data applications 11 / 22 File System Architecture

Examples: Hadoop File System, Quantcast File System

Object-based file system

delete() meta-data create()

Head node can help in job scheduling read() write()

∙ ∙ ∙ ∙

data ∙ ∙

Target: Incremental scaling, large & immutable files Typical for Big Data applications 11 / 22 File System Architecture

Examples: Lustre, MooseFS, pNFS, XtreemFS

Parallel file system

delete() meta-data create()

read() write()

data ∙ ∙ ∙ ∙ ∙

Target: Maximum aggregated throughput, large files

Typical for High-Performance Computing 12 / 22 File System Architecture

Examples: Ceph, OrangeFS

Distributed meta-data

delete() meta-data create()

read() write()

data ∙ ∙ ∙ ∙ ∙

Target: Avoid single point of failure and meta-data bottleneck

Modern general-purpose distributed file system 13 / 22 File System Architecture

Examples: GlusterFS

Symmetric, peer-to-peer

hash(pathn)

Distributed hash table — Hosts of pathn ○

Target: Conceptual simplicity, inherently scalable 14 / 22 File System Architecture

Examples: GlusterFS

Symmetric, peer-to-peer

hash(pathn)

Difficult to deal with node churn Slow lookup beyond LAN In HEP we use caching and catalog based data management

Distributed hash table — Hosts of pathn ○

Target: Conceptual simplicity, inherently scalable 14 / 22 Agenda

1 What do we want from a distributed file system?

2 Sorting and searching the file system landscape

3 Technology trends and future challenges

15 / 22 Trends and Challenges

We are lucky: large data sets tend to be immutable everywhere For instance: media, backups, VM images, scientific data sets, . Reflected in hardware: shingled magnetic recording drives ∙ Reflected in software: log-structured space management ∙

We need to invest in scaling fault-tolerance and speed together with with the capacity Replication becomes too expensive at the Petabyte and Exabyte scale ∙ Erasure codes → Explicit use of SSDs ∙ ∙ For meta-data (high IOPS requirement) ∙ As a fast storage pool ∙ As a node-local cache

16 / 22 Trends and Challenges

We are lucky: large data sets tend to be immutable everywhere For instance: media, backups, VM images, scientific data sets, . Reflected in hardware: shingled magnetic recording drives ∙ Reflected in software: log-structured space management ∙

We need to invest in scaling fault-tolerance and speed together with with the capacity Replication becomes too expensive at the Petabyte and Exabyte scale ∙ Erasure codes → Explicit use of SSDs ∙ Amazon Elastic File System ∙ For meta-data (high IOPS requirement) will be on SSD ∙ As a fast storage pool ∙ As a node-local cache

16 / 22 Log-Structured Data

Idea: Store all modifications in a change log

Unix file system

Disk File

Log-structured file system Index Log −→ Disk ···

Used by Advantages Zebra experimental DFS Minimal seek ∙ ∙ Commercial filers Fast and robust crash recovery ∙ (e. g. NetApp) ∙ Efficient allocation in Key-value and BLOB stores ∙ DRAM, flash, and disks ∙ File systems for flash Ideal for immutable data ∙ ∙ 17 / 22 Bandwidth Lags Behind Capacity

 Capacity Bandwidth ×1000 ∙ HDD Capacity and bandwidth DRAM ∙ SSD of affordable storage scale at different paces It will be prohibitively ×100 ∙ costly to constantly “move data in and out”

Ethernet bandwidth scaled ∙ Improvement similarly to capacity ×10 We should use the high bi-section bandwidth among worker nodes and them part of the storage network ×1 1995 2005 2015 Year 18 / 22 Fault Tolerance at the Petascale

How to prevent data loss

1 Replication: simple, large storage overhead (similar to RAID 1)

2 Erasure codes: based on parity blocks (similar to RAID 5)

Requires extra yDatay Parity compute power distribute all blocks

Available commercially: GPFS, NEC Hydrastor, Scality RING, . . . Emerging in file systems: EOS, Ceph, QFS

Engineering Challenges Fault detection ∙ De-correlation of failure domains instead of random placement ∙ Failure prediction, e. g. based on MTTF and Markov models ∙ 19 / 22 Fault Tolerance at the Petascale

Example from Facebook

Datacenter 1 Datacenter 3 Replication factor 2.1 size—e.g., 40KB instead of 1GB—reconstructing the using erasure coding BLOB is much faster and lighter weight than rebuilding A within every data center plus cross data center the block. Full block rebuilding is handled offline by A XOR B Datacenter 2 XOR encoding rebuilder nodes. Block B Muralidhar et al. (2014) Link

Rebuilder Nodes At large scale, disk and node failures are inevitable. When this happens blocks stored on the Figure 9: Geo-replicated XOR Coding. failed components need to be rebuilt. Rebuilder nodes are storage-less, CPU-heavy nodes that handle failure detec- tion and background reconstruction of data blocks. Each and parity block in a volume is XORed with the equiv- 20 / 22 rebuilder node detects failure through probing and re- alent data or parity block in the other volume, called ports the failure to a coordinator node. It rebuilds blocks its buddy block, to create their XOR block. These XOR by fetching n companion or parity blocks from the failed blocks are stored with normal triple-replicated index files block’s strip and decoding them. Rebuilding is a heavy- for the volumes. Again, because the index files are tiny weight process that imposes significant I/O and network relative to the data, coding them is not worth the added load on the storage nodes. Rebuilder nodes throttle them- complexity. selves to avoid adversely impacting online user requests. The 2.1 replication factor comes from the 1.4X for the Scheduling the rebuilds to minimize the likelihood of primary single cell replication for each of two volumes data loss is the responsibility of the coordinator nodes. and another 1.4X for the geo-replicated XOR of the two 1.4 2+1.4 Coordinator Nodes A cell requires many maintenance volumes: ⇤ 2 =2.1. task, such as scheduling block rebuilding and ensuring Reads are handled by a geo-backoff node that receives that the current data layout minimizes the chances of requests for a BLOB that includes the data file, offset, data unavailability. Coordinator nodes are storage-less, and length (R6 in Figure 8). This node then fetches the CPU-heavy nodes that handle these cell-wide tasks. specified region from the local XOR block and the remote As noted earlier, blocks in a stripe are laid out on dif- XOR-companion block and reconstructs the requested ferent failure domains to maximize reliability. However, BLOB. These reads go through the normal single-cell after initial placement and after failure, reconstruction, read path through storage nodes Index and File APIs or and replacement there can be violations where a stripe’s backoff node File APIs if there are disk, host, or rack blocks are in the same failure domain. The coordinator failures that affect the XOR or XOR-companion blocks. runs a placement balancer process that validates the block We chose XOR coding for geo-replication because layout in the cell, and rebalance blocks as appropriate. it significantly reduces our storage requirements while Rebalancing operations, like rebuilding operations, in- meeting our fault tolerance goal of being able to survive cur significant disk and network load on storage nodes the failure of a datacenter. and are also throttled so that user requests are adversely impacted. 5.5 f4 Fault Tolerance Single f4 cells are tolerant to disk, host, and rack fail- 5.4 Geo-replication ures. Geo-replicating XOR volumes brings tolerance to Individual f4 cells all reside in a single datacenter and datacenter failures. This subsection explains the failure thus are not tolerant to datacenter failures. To add dat- domains in a single cell, how f4 lays out blocks to in- acenter fault tolerance we initially double-replicated f4 crease its resilience, gives an example of recovery if all cells and placed the second replica in a different data- four types of failure all affect the same BLOB, and sum- center. If either datacenter fails, all the BLOBs are still marizes how all components of a cell are fault tolerant. available from the other datacenter. This provides all of our fault tolerance requirements and reduces the effective- Failure Domains and Block Placement Figure 10 il- replication-factor from 3.6 to 2.8. lustrates how data blocks in a stripe are laid out in a Given the rarity of datacenter failure events we f4 cell. A rack is the largest failure domain and is our sought a solution that could further reduce the effective- primary concern. Given a stripe S of n data blocks and replication-factor with the tradeoff of decreased through- k parity blocks, we attempt to lay out the blocks so that put for BLOBs stored at the failed datacenter. We are each of these is on a different rack, and at least on a dif- currently deploying geo-replicated XOR coding that ferent node. This requires that a cell have at least n + k reduces the effective-replication-factor to 2.1. racks, of roughly the same size. Our current implemen- Geo-replicated XOR coding provides datacenter fault tation initially lays out blocks making a best-effort to tolerance by storing the XOR of blocks from two differ- put each on a different rack. The placement balancer pro- ent volumes primarily stored in two different datacenters cess detects and corrects any rare violations that place a in a third datacenter as shown in Figure 9. Each data stripe’s blocks on the same rack.

8 Towards a Better Implementation of File Systems

Workflows Distributed File Systems

Integration with applications ∙ ∙ ∙ Feature-rich: quotas, permissions, links ∙ Globally federated namespace ∙

Building Blocks

Data Provisioning Object Stores

Very good on local networks ∙ Flat namespace ∙ Turns out that certain types of data ∙ do not need a hierarchical namespace e. g. cached objects, media, VM images

21 / 22 Conclusion

File Systems are at the Heart of our Distributed Computing Most convenient way of connecting our applications to data ∙ The hierarchical namespace is a natural way to organize data ∙

We can tailor file systems to our needs using existing building blocks With a focus on

1 Wide area data federation

2 Multi-level storage hierarchy from tape to memory

3 Erasure coding

4 Coalescing storage and compute nodes

22 / 22 Thank you for your time! Backup Slides Milestones in Distributed File Systems Biased towards open-source, production file systems

2002 2005 1983 1985 1995 2000 2003 2007

AFS NFS Zebra OceanStore GFS Ceph Venti XRootD

25 / 22 Milestones in Distributed File Systems Biased towards open-source, production file systems

2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD

- “AFS was the first safe and efficient distributed com- Roaming home folders puting system, available [. . . ] on campus. It was a ∙ clear precursor to the Dropbox-like software pack- Identity tokens and ∙ ages today. [. . . ] [It] allowed students (like Drew access control lists (ACLs) Houston and Arash Ferdowsi) access to all their Decentralized operation (“Cells”) stuff from any connected .” ∙ http://www.wired.com/2011/12/backdrop-dropbox

25 / 22 Milestones in Distributed File Systems Biased towards open-source, production file systems

2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD

Client-server

Focus on portability ∙ Separation of protocol ∙ and implementation Stateless servers ∙ ∙ Fast crash recovery

Sandberg, Goldberg, Kleiman, Walsh, Lyon (1985)

25 / 22 Milestones in Distributed File Systems Biased towards open-source, production file systems

2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD

∙ Zebra File System File B File C ParallelFile Manager File A File D Client Client Stripe Striping and parity Cleaner ∙ Redundant array of 1 2 3 4 5 6 Client’s Log ∙ inexpensive nodes (RAIN)Network Log-structured data ∙ Storage Storage 1 2 3 1 ⊗ 2 ⊗ 3 Server Server Storage Storage 4 5 6 4 ⊗ 5 ⊗ 6 Server Server Figure 5: Zebra schematic. Clients run applications; File Servers storage servers store data. The file manager and the stripe cleaner can run on any machine in the system, Figure 4. Per-client striping in Hartman, Zebra. EachOusterhout client (1995) although it is likely that one machine will run both of forms its new file data into a single append-only log and them. A storage server may also be a client. stripes this log across the servers. In this example file A spans several servers while file B is stored entirely on a space reclaimed from the logs? Zebra solves this problem single server. Parity is computed for the log, not for with a stripe cleaner, which is analogous to the cleaner in a individual files. log-structured file system. The next section25 /provides 22 a more sequential transfers. LFS is particularly effective for writing detailed discussion of these issues and several others. small files, since it can write many files in a single transfer; in contrast, traditional file systems require at least two 3 Zebra Components independent disk transfers for each file. Rosenblum reported a tenfold speedup over traditional file systems for writing The Zebra file system contains four main components small files. LFS is also well-suited for because it as shown in Figure 5: clients, which are the machines that batches small writes together into large sequential transfers run application programs; storage servers, which store file and avoids the expensive parity updates associated with data; a file manager, which manages the file and directory small random writes. structure of the file system; and a stripe cleaner, which Zebra can be thought of as a log-structured network file reclaims unused space on the storage servers. There may be system: whereas LFS uses the logging approach at the any number of clients and storage servers but only a single interface between a file server and its disks, Zebra uses the file manager and stripe cleaner. More than one of these logging approach at the interface between a client and its components may share a single physical machine; for servers. Figure 4 illustrates this approach, which we call example, it is possible for one machine to be both a storage per-client striping. Each Zebra client organizes its new file server and a client. The remainder of this section describes data into an append-only log, which it then stripes across the each of the components in isolation; Section 4 then shows servers. The client computes parity for the log, not for how the components work together to implement operations individual files. Each client creates its own log, so a single such as reading and writing files, and Sections 5 and 6 stripe in the file system contains data written by a single describe how Zebra deals with crashes. client. We will describe Zebra under the assumption that there Per-client striping has a number of advantages over per- are several storage servers, each with a single disk. file striping. The first is that the servers are used efficiently However, this need not be the case. For example, storage regardless of file sizes: large writes are striped, allowing servers could each contain several disks managed as a them to be completed in parallel, and small writes are RAID, thereby giving the appearance to clients of a single batched together by the log mechanism and written to the disk with higher capacity and throughput. It is also possible servers in large transfers; no special handling is needed for to put all of the disks on a single server; clients would treat either case. Second, the parity mechanism is simplified. it as several logical servers, all implemented by the same Each client computes parity for its own log without fear of physical machine. This approach would still provide many interactions with other clients. Small files do not have of Zebra’s benefits: clients would still batch small files for excessive parity overhead because parity is not computed transfer over the network, and it would still be possible to on a per-file basis. Furthermore, parity never needs to be reconstruct data after a disk failure. However, a single- updated because file data are never overwritten in place. server Zebra system would limit system throughput to that The above introduction to per-client striping leaves of the one server, and the system would not be able to some unanswered questions. For example, how can files be operate when the server is unavailable. shared between client workstations if each client is writing its own log? Zebra solves this problem by introducing a 3.1 Clients central file manager, separate from the storage servers, that manages such as directories and file attributes and Clients are machines where application programs supervises interactions between clients. Also, how is free execute. When an application reads a file the client must Milestones in Distributed File Systems Biased towards open-source, production file systems

2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD

∙ Ideally, a user would entrust all of his or her data to OceanStore; OceanStore in return, the utility’s economies of scale would yield much better pool Peer-to-peer availability, performance, and reliability than would be available pool otherwise. Further, the geographic distribution of servers would “Global Scale”: support deep archival storage, i.e. storage that would survive ma- ∙ 1010 users, 1014 files jor disasters and regional outages. In a time when desktop worksta- pool pool Untrusted infrastructure tions routinely ship with tens of gigabytes of spinning storage, the ∙ management of data is far more expensive than the storage media. pool Based on peer-to-peer OceanStore hopes to take advantage of this excess of storage space ∙ to make the management of data seamless and carefree. overlay network pool 1.2 Two Unique Goals pool Nomadic data through ∙ aggressive caching The OceanStore system has two design goals that differentiate it from similar systems: (1) the ability to be constructed from an un- Foundation for today’s trusted infrastructure and (2) support of nomadic data. ∙ decentral dropbox replacements Kubiatowicz et al. (2000) Untrusted Infrastructure: OceanStore assumes that the infras- Figure 1: The OceanStore system. The core of the system is tructure is fundamentally untrusted. Servers may crash without composed of a multitude of highly connected “pools”, among warning or leak information to third parties. This lack of trust is in- which data is allowed to “flow” freely. Clients connect to one or herent in the utility model and is different from other cryptographic more pools, perhaps intermittently. 25 / 22 systems such as [35]. Only clients can be trusted with cleartext—all information that enters the infrastructure must be encrypted. How- ever, rather than assuming that servers are passive repositories of Objects are replicated and stored on multiple servers. This replica- information (such as in CFS [5]), we allow servers to be able to tion provides availability1 in the presence of network partitions and participate in protocols for distributed consistency management. To durability against failure and attack. A given replica is independent this end, we must assume that most of the servers are working cor- of the server on which it resides at any one time; thus we refer to rectly most of the time, and that there is one class of servers that we them as floating replicas. can trust to carry out protocols on our behalf (but not trust with the A replica for an object is located through one of two mecha- content of our data). This responsible party is financially responsi- nisms. First, a fast, probabilistic algorithm attempts to find the ble for the integrity of our data. object near the requesting machine. If the probabilistic algorithm fails, location is left to a slower, deterministic algorithm. Nomadic Data: In a system as large as OceanStore, locality is Objects in the OceanStore are modified through updates. Up- of extreme importance. Thus, we have as a goal that data can be dates contain information about what changes to make to an ob- cached anywhere, anytime, as illustrated in Figure 1. We call this ject and the assumed state of the object under which those changes policy promiscuous caching. Data that is allowed to flow freely is were developed, much as in the Bayou system [13]. In principle, called nomadic data. Note that nomadic data is an extreme con- every update to an OceanStore object creates a new version2. Con- sequence of separating information from its physical location. Al- sistency based on versioning, while more expensive to implement though promiscuous caching complicates data coherence and loca- than update-in-place consistency, provides for cleaner recovery in tion, it provides great flexibility to optimize locality and to trade off the face of system failures [49]. It also obviates the need for backup consistency for availability. To exploit this flexibility, continuous and supports “permanent” pointers to information. introspective monitoring is used to discover tacit relationships be- OceanStore objects exist in both active and archival forms. An tween objects. The resulting “meta-information” is used for local- active form of an object is the latest version of its data together ity management. Promiscuous caching is an important distinction with a handle for update. An archival form represents a permanent, between OceanStore and systems such as NFS [43] and AFS [23] read-only version of the object. Archival versions of objects are in which cached data is confined to particular servers in particular encoded with an erasure code and spread over hundreds or thou- regions of the network. Experimental systems such as XFS [3] al- sands of servers [18]; since data can be reconstructed from any suf- low “cooperative caching” [12], but only in systems connected by ficiently large subset of fragments, the result is that nothing short a fast LAN. of a global disaster could ever destroy information. We call this highly redundant data encoding deep archival storage. The rest of this paper is as follows: Section 2 gives a system-level An application writer views the OceanStore as a number of ses- overview of the OceanStore system. Section 3 shows sample ap- sions. Each session is a sequence of read and write requests related plications of the OceanStore. Section 4 gives more architectural to one another through the session guarantees, in the style of the detail, and Section 5 reports on the status of the current prototype. Bayou system [13]. Session guarantees dictate the level of con- Section 6 examines related work. Concluding remarks are given in sistency seen by a session’s reads and writes; they can range from Section 7. supporting extremely loose consistency to supporting the ACID semantics favored in databases. In support of legacy code, OceanStore also provides an array of familiar interfaces such as the 2 SYSTEM OVERVIEW file system interface and a simple transactional interface. An OceanStore prototype is currently under development. This sec- tion provides a brief overview of the planned system. Details on the If application semantics allow it, this availability is provided at the expense individual system components are left to Section 4. of consistency. In fact, groups of updates are combined to create new versions, and we The fundamental unit in OceanStore is the persistent object. plan to provide interfaces for retiring old versions, as in the Elephant File Each object is named by a globally unique identifier, or GUID. System [44].

2 Jukebox Bootes: storage size Emelie: storage size Venti 300 450 Active file system 400 Milestones in Distributed File Systems 250 350 200 300 Biased towards open-source, production file systems 250 150 200 2002 2005 Size (Gb) Size (Gb) 100 1501983 1985 1995 2000 2003 2007 100 50 50 ∙ 0

Jul-90 Jan-91 Jul-91 Jan-92 Jul-92 Jan-93 Jul-93 Jan-94 Jul-94 Jan-95 Jul-95 Jan-96 Jul-96 Jan-97 Jul-97 Jan-98 0 AFSJan-97 Jul-97 Jan-98 NFS Jul-98 Jan-99 Jul-99 Jan-00 Jul-00 Jan-01 Jul-01 Zebra OceanStore GFS Ceph Venti XRootD

∙ Venti Bootes: ratio of archival to active data Emelie: ratio of archival to active data Archival storage 5 7 Jukebox / Active 4.5 Venti / Active 6 De-duplication through 4 ∙ content-addressable storage 3.5 5 3 4 Content hashes provide 2.5 ∙ Ratio Ratio 3 intrinsic file integrity 2 1.5 2 Merkle trees verify the 1 1 ∙ 0.5 file system hierarchy 0

0 Jan-97 Jul-97 Jan-98 Jul-98 Jan-99 Jul-99 Jan-00 Jul-00 Jan-01 Jul-01 Jul-90 Jan-91 Jul-91 Jan-92 Jul-92 Jan-93 Jul-93 Jan-94 Jul-94 Jan-95 Jul-95 Jan-96 Jul-96 Jan-97 Jul-97

Quinlan and Dorward (2002) Figure 6. Graphs of the various sizes of two Plan 9 file servers. overlapped and do not benefit from the striping of the Figure 6 depicts the size of the active file system as index. One possible solution is a form of read-ahead. measured over time by du, the space consumed on the 25 / 22 When reading a block from the data log, it is feasible to jukebox, and the size of the jukebox’s data if it were to also read several following blocks. These extra blocks be stored on Venti. The ratio of the size of the archival can be added to the caches without referencing the data and the active file system is also given. As can be index. If blocks are read in the same order they were seen, even without using Venti, the storage required to written to the log, the latency of uncached index implement the daily snapshots in Plan 9 is relatively lookups will be avoided. This strategy should work well modest, a result of the block level incremental approach for streaming data such as multimedia files. to generating a snapshot. When the archival data is stored to Venti the cost of retaining the snapshots is The basic assumption in Venti is that the growth in reduced significantly. In the case of the emelie file capacity of disks combined with the removal of system, the size on Venti is only slightly larger than the duplicate blocks and compression of their contents active file system; the cost of retaining the daily enables a model in which it is not necessary to reclaim snapshots is almost zero. Note that the amount of space by deleting archival data. To demonstrate why we storage that Venti uses for the snapshots would be the believe this model is practical, we present some same even if more conventional methods were used to statistics derived from a decade’s use of the Plan 9 file back up the file system. The Plan 9 approach to system. snapshots is not a necessity, since Venti will remove duplicate blocks. The computing environment in which we work includes two Plan 9 file servers named bootes and emelie. When stored on Venti, the size of the jukebox data is Bootes was our primary file repository from 1990 until reduced by three factors: elimination of duplicate 1997 at which point it was superseded by emelie. Over blocks, elimination of block fragmentation, and the life of these two file servers there have been 522 compression of the block contents. Table 2 presents the user accounts of which between 50 and 100 were active percent reduction for each of these factors. Note, bootes at any given time. The file servers have hosted uses a 6 Kbyte block size while emelie uses 16 Kbyte, numerous development projects and also contain so the effect of removing fragmentation is more several large data sets including chess end games, significant on emelie. astronomical data, imagery, and multimedia files. Milestones in Distributed File Systems Biased towards open-source, production file systems

2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD

Object-based

Co-designed for map-reduce ∙ Coalesce storage and ∙ compute nodes Serialize data access ∙

Google AppEngine documentation

25 / 22 Milestones in Distributed File Systems Biased towards open-source, production file systems

2002 2005 1983 1985 1995 2000 2003 2007 ∙ AFS NFS Zebra OceanStore GFS Ceph Venti XRootD

∙ XRootD open() redirect Namespace delegation open() Client redirect open() Global tree of redirectors ∙ xrootd Flexibility through ∙ cmsd decomposition into pluggable components 641 = 64 xrootd xrootd Namespace independent cmsd cmsd ∙ from data access 642 = 4096 xrootd xrootd xrootd xrootd cmsd cmsd cmsd cmsd

Hanushevsky (2013)

25 / 22 root

row1 row3 row4 choose(1,row) row2

cab22 choose(3,cabinet) cab21 cab23 cab24 Milestones in Distributed File Systems choose(1,disk) ··· Biased··· towards open-source,··· production file systems ··· ··· ··· ··· ··· ··· ··· ··· ··· 2002 2005 1983 1985 1995 2000 2003 2007 Figure 5.1: A partial view of a four-level cluster map hierarchy consisting of rows, cabinets, and ∙ shelves of disks. Bold linesAFS illustrateNFS items selected by each select operationZebra in theOceanStore placement GFS Ceph rule and fictitious mapping described by Table 5.1. Venti XRootD ∙ Ceph File System and RADOS Action Resulting⃗i Parallel, distributed meta-data take(root) root select(1,row) row2 Peer-to-peer file system ∙ at the cluster scale select(3,cabinet) cab21 cab23 cab24 select(1,disk) disk2107 disk2313 disk2437 Data placement ∙ emit across failure domains

Weil (2007) Adaptive workload ∙ distribution Table 5.1: A simple rule that distributes three replicas across three cabinets in the same row. descends through any intermediate buckets, pseudo-randomly selecting a nested item in each bucket using the function c(r,x) (defined for each kind of bucket in Section 5.2.4), until it finds an item of the requested type t. The resulting n⃗i distinct items are placed back into the input⃗i | | 25 / 22 and either form the input for a subsequent select(n,t) or are moved into the result vector with an emit operation.

As an example, the rule defined in Table 5.1 begins at the root of the hierarchy in

Figure 5.1 and with the first select(1,row) chooses a single bucket of type “row” (it selects row2).

91 Hardware Bandwidth and Capacity Numbers

Method and entries marked † from Patterson (2004) Link

Hard Disk Drives SSD (MLC Flash) DRAM Ethernet Year Capacity Bandwidth Capacity Bandwidth (r/w) Capacity Bandwidth Bandwidth 1993 16 Mibit/chip† 267 MiB/s† 1994 4.3 GB† 9 MB/s† 1995 100 Mbit/s† 2003 73.4 GB† 86 MB/s† 10 Gbit/s† 2004 512 Mibit/chip 3.2 GiB/s 2008 160 GB 250/100 MB/s3 2012 800 GB 500/460 MB/s4 2014 6 TB 220 MB/s1 2 TB 2.8/1.9 GB/s5 8 Gibit/chip 25.6 GiB/s 100 Gbit/s 2015 8 TB 195 MB/s2 Increase ×1860 ×24 ×12.5 ×11.2 ×512 ×98 ×1000

1http://www.storagereview.com/seagate_enterprise_capacity_6tb_35_sas_hdd_review_v4 2http://www.storagereview.com/seagate_archive_hdd_review_8tb 3http://www.storagereview.com/intel_x25-m_ssd_review 4http://www.storagereview.com/intel_ssd_dc_s3700_series_enterprise_ssd_review 5http://www.tomshardware.com/reviews/intel-ssd-dc-p3700-nvme,3858.html

HDD: Seagate ST15150 (1994)†, Seagate 373453 (2004)†, Seagate ST6000NM0034 (2014), Seagate ST8000AS0012 [SMR] (2015) SSD: X25-M (2008), Intel SSD DC S3700 (2012), Intel SSD DC P3700 (2014) DRAM: Fast DRAM (1993)†, DDR2-400 (2004), DDR4-3200 (2014) Ethernet: Fast Ethernet IEEE 802.3u (1995)†, 10 GbitE IEEE 802.3ae (2003)†, 100 GbitE IEEE 802.3bj (2014)

26 / 22 Full potential together with ∙ content-addressable storage Self-verifying data chunks, ∙ trivial to distribute and cache

Data Integrity and File System Snapshots

Root (/) Contents: A, B Hash tree with cryptographic ∙ hash(hA, hB ) hash function provides secure identifier for sub trees It is easy to sign a small hash ∙ /A /B value (data authenticity)

Contents: 훼, 훽 Contents: Efficient calculation of changes ∅ ∙ hash(h , h ) =: hA hash( ) =: hB (fast replication) 훼 훽 ∅ Bonus: versioning and data ∙ de-duplication /A/훼 /A/훽

Contents: data훼 Contents: data훽 hash(data훼) =: h훼 hash(data훽) =: h훽

Merkle tree

27 / 22 Data Integrity and File System Snapshots

hash(hA, hB )

Contents: hA [A], hB [B] Hash tree with cryptographic ∙ hash function provides secure identifier for sub trees It is easy to sign a small hash hash(h , h ) =: h ∙ 훼 훽 A hash(∅) =: hB value (data authenticity)

Contents: h훼 [훼], h훽 [훽] Contents: ∅ Efficient calculation of changes ∙ (fast replication) Bonus: versioning and data ∙ de-duplication hash(data훼) =: h훼 hash(data훽) =: h훽 Full potential together with Contents: data훼 Contents: data훽 ∙ content-addressable storage Merkle tree Self-verifying data chunks, ∙ trivial to distribute and cache Content-addressable⊕ storage

27 / 22 System Integration

System-aided

Kernel-level file system fast intrusive ∙ Virtual NFS server easy deployment limited by NFS semantics ∙

Unprivileged options (mostly)

Library (e. g. libhdfs, libXrd, . . . ) tuned API not transparent ∙ Interposition systems: ∙ Application ∙ Pre-loaded libraries open() ∙ ptrace based ∙ FUSE libc (“file system in user space”) system call FUSE upcall OS Kernel

28 / 22 BibliographyI

Survey Articles

Satyanarayanan, M. (1990). A survey of distributed file systems. Annual Review of Computer Science, 4(1):73–104. Guan, P., Kuhl, M., Li, Z., and Liu, X. (2000). A survey of distributed file systems. University of California, San Diego. Thanh, T. D., Mohan, S., Choi, E., Kim, S., and Kim, P. (2008). A taxonomy and survey on distributed file systems. In Proc. int. conf. on Networked Computing and Advanced Information Management (NCM’08), pages 144 – 149. Peddemors, A., Kuun, C., Spoor, R., Dekkers, P., and den Besten, C. (2010). Survey of technologies for wide area distributed storage. Technical report, SURFnet. Depardon, B., Séguin, C., and Mahec, G. L. (2013). Analysis of six distributed file systems. Technical Report hal-00789086, Université de Picardie Jules Verne.

29 / 22 BibliographyII

Donvito, G., Marzulli, G., and Diacono, D. (2014). Testing of several distributed file-systems (hdfs, ceph and glusterfs) for supporting the hep experiment analysis. Journal of Physics: Conference Series, 513.

File Systems

Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., and Lyon, B. (1985). Design and implementation of the sun network filesystem. In Proc. of the Summer USENIX conference, pages 119–130. Morris, J. H., Satyanarayanan, M., Conner, M. H., Howard, J. H., Rosenthal, D. S. H., and Smith, F. D. (1986). Andrew: A distributed personal computing environment. Communications of the ACM, 29(3):184–201. Hartman, J. H. and Osterhout, J. K. (1995). The Zebra striped network file system. ACM Transactions on Computer Systems, 13(3):274–310.

30 / 22 BibliographyIII

Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., and Zhao, B. (2000). OceanStore: An architecture for global-scale persistent storage. ACM SIGPLAN Notices, 35(11):190–201. Quinlan, S. and Dorward, S. (2002). Venti: a new approach to archival storage. In Proc. of the 1st USENIX Conf. on File and Storage Technologies (FAST’02), pages 89–102. Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The Google file system. ACM SIGOPS Operating Systems Review, 37(5):29–43. Schwan, P. (2003). Lustre: Building a file system for 1,000-node clusters. In Proc. of the 2003 Symposium, pages 380–386. Dorigo, A., Elmer, P., Furano, F., and Hanushevsky, A. (2005). XROOTD - a highly scalable architecture for data access. WSEAS Transactions on Computers, 4(4):348–353.

31 / 22 BibliographyIV

Weil, S. A. (2007). Ceph: reliable, scalable, and high-performance distributed storage. PhD thesis, University of California Santa Cruz.

32 / 22