Globalfs: a Strongly Consistent Multi-Site File System

GlobalFS: A Strongly Consistent Multi-Site File System Leandro Pacheco Raluca Halalai Valerio Schiavoni University of Lugano University of Neuchatelˆ University of Neuchatelˆ Fernando Pedone Etienne Riviere` Pascal Felber University of Lugano University of Neuchatelˆ University of Neuchatelˆ Abstract consistency, availability, and tolerance to partitions. Our goal is to ensure strongly consistent file system operations This paper introduces GlobalFS, a POSIX-compliant despite node failures, at the price of possibly reduced geographically distributed file system. GlobalFS builds availability in the event of a network partition. Weak on two fundamental building blocks, an atomic multicast consistency is suitable for domain-specific applications group communication abstraction and multiple instances of where programmers can anticipate and provide resolution a single-site data store. We define four execution modes and methods for conflicts, or work with last-writer-wins show how all file system operations can be implemented resolution methods. Our rationale is that for general-purpose with these modes while ensuring strong consistency and services such as a file system, strong consistency is more tolerating failures. We describe the GlobalFS prototype in appropriate as it is both more intuitive for the users and detail and report on an extensive performance assessment. does not require human intervention in case of conflicts. We have deployed GlobalFS across all EC2 regions and Strong consistency requires ordering commands across show that the system scales geographically, providing replicas, which needs coordination among nodes at performance comparable to other state-of-the-art distributed geographically distributed sites (i.e., regions). Designing file systems for local commands and allowing for strongly strongly consistent distributed systems that provide good consistent operations over the whole system. The code of performance requires careful tradeoffs. The original GlobalFS is available as open source. approach we explore in this work is to trade the performance 1 Introduction of global operations, spanning multiple regions, for the scalability of intra-region operations. We capture this Cloud infrastructures, composed of multiple interconnected compromise with the notion of geographical scalability. datacenters, have become an essential part of modern com- Geographical scalability is motivated by geo-distributed puting systems. They provide an efficient and cost-effective applications that wish to exploit locality without compro- solution to hosting web-accessible services, storing and mising consistency or reducing the scope of operations to a processing data, or performing compute-intensive tasks. single region. This trend is becoming increasingly more im- Large companies like Amazon or Google do not only portant with the wide range of applications that are deployed use such architectures for their own needs, but they also over multiple data centers spanning several regions, e.g., on rent them to external clients in a variety of flavors, e.g., Amazon EC2. Yet, achieving geographical scalability is infrastructure (IaaS), platform (PaaS), software (SaaS), or notoriously difficult. For example, among the few existing data (DaaS) as a service. Such global infrastructures rely file systems with support for geographical distribution, Calv- on geographically distributed datacenters for fault-tolerance, inFS [55] totally orders requests. As a consequence, perfor- scalability, and performance reasons. mance decreases with the number of regions in the system, We focus in this work on the design of a geographically even for operations that access objects in a single region. distributed file system, accessible via a POSIX-compliant This paper introduces GlobalFS, a file system that API. Most previous designs for geographically distributed achieves geographical scalability by exploiting two abstrac- file systems [25, 35] have provided weak consistency tions. First, it relies on data stores located in geographically guarantees (e.g., eventual consistency [15]) to work around distributed datacenters. Files are replicated and stored as the limitations formalized by the CAP theorem [21], which immutable blocks in the data stores, which are organized states that distributed applications can fully support at as distributed hash tables (DHTs). Second, GlobalFS uses most two of the following three properties simultaneously: an atomic multicast abstraction to maintain mutable file metadata and orchestrate multi-site operations. Atomic Applications multicast provides strong order guarantees by partially Client interface (FUSE) ordering operations. Metadata management Data store GlobalFS notably differs from other distributed file Atomic multicast systems by defining a flexible partition model in which files Network and folders can be placed according to access patterns (e.g., Figure 1: Overall architecture of GlobalFS. in the same region as their most frequent users), as well as four execution modes corresponding to the operations time when the system becomes synchronous is called the that can be performed in the file system: (1) single-partition global stabilization time (GST) and is unknown to the operations, (2) multi-partition uncoordinated operations, processes. Before the GST, there are no bounds on the time (3) multi-partition coordinated operations, and (4) read-only it takes for messages to be transmitted and actions to be operations. While single-partition and read-only operations executed. After the GST, such bounds exist but are unknown. can be implemented efficiently by accessing a single region, In practice, “forever” means long enough for the atomic the other two operations require interactions across multiple multicast protocol to make progress (i.e., deliver messages). regions. By leveraging atomic multicast and distinguishing GlobalFS ensures sequential consistency for update between these four modes of execution, GlobalFS can operations and causal consistency for reads. A system is provide low latency for single-region commands while sequentially consistent if there is a way to reorder the client allowing for consistent operations across regions. GlobalFS commands in a sequence that (i) respects the semantics of can therefore exploit geographical locality in new ways to the commands as defined in their sequential specification, combine performance and consistency, and hence propose and (ii) respects the ordering of commands as defined by original contributions in the well-studied design space of each client [5]. A system is causally consistent if the result distributed file systems. of read operations respect the causal ordering of events as We have implemented a complete prototype of GlobalFS defined by the “happens-before” relation [27]. and deployed it on Amazon’s EC2 platform with nodes spread all over the world, across all nine available regions. 3 System architecture We have conducted an in-depth study of its performance. Re- sults show that GlobalFS outperforms other geographically This section presents the overall architecture of GlobalFS distributed file systems that offer comparable guarantees and how the file system can be partitioned and replicated and delivers good performance for single-site commands. across datacenters. The code of GlobalFS is freely available as open source.1 The rest of this paper is organized as follows. Sections2 3.1 Components and3 introduce GlobalFS’s system model and architecture, The architecture of GlobalFS consists of four components: respectively. Section4 presents the protocol design. the client interface, the data store, metadata management, Section5 describes the implementation of our prototype. and atomic multicast (see Figure1). Section6 discusses results of experimental evaluation. The client interface provides a file system API supporting Section7 reviews related work and Section8 concludes. a subset of POSIX 1-2001 [1]. GlobalFS implements file system operations sufficient to manipulate files and 2 System model and definitions directories. Some file system calls change the structure of the file system tree (i.e., the files and directories within We assume a distributed system composed of interconnected each directory). Each file descriptor seen by a client when processes that communicate by message passing. There is opening a file is mapped to a local file descriptor at each an unbounded set of client processes and a bounded set of GlobalFS server. We support file-specific operations: server processes. Processes may fail by crashing, but do not mknod, unlink, open, read, write, truncate, sym- experience arbitrary behavior (i.e., no Byzantine failures). link, readlink; directory-specific operations: mkdir, Client and server processes are grouped within datacen- rmdir, opendir, readdir; and general purpose opera- ters that are geographically distributed over different regions. tions: stat, chmod, chown, rename, and utime. We Processes in the same region experience low-latency support symbolic links, but not hard links. communication, while messages exchanged between Like most contemporary distributed file systems processes located in different regions are subject to larger (e.g., [20, 49, 9, 47]), GlobalFS decouples metadata from latencies. Links are quasi-reliable: if both the sender and data storage. Metadata in GlobalFS is handled by the meta- the receiver are non-faulty, then every message sent is data management layer. Each file has an associated inode eventually

Globalfs: a Strongly Consistent Multi-Site File System

Elastic Storage for Linux on System Z IBM Research & Development Germany

Base Performance for Beegfs with Data Protection

Spring 2014 • Vol

BESIII Physical Analysis on Hadoop Platform

Efficient Implementation of Data Objects in the OSD+-Based Fusion

Membrane: Operating System Support for Restartable File Systems Swaminathan Sundararaman, Sriram Subramanian, Abhishek Rajimwale, Andrea C

Developments in GFS2

Data Integration and Analysis 10 Distributed Data Storage

Add-Ons for Red Hat Enterprise Linux

Big Data Storage Workload Characterization, Modeling and Synthetic Generation

A Decentralized Cloud Storage Network Framework

Storage Systems and Input/Output 2018 Pre-Workshop Document