Building Hierarchical Storage Services in the Cloud

Kunal Lillaney

A dissertation submitted to The Johns Hopkins University in conformity with the

requirements for the degree of Doctor of Philosophy.

Baltimore, Maryland

April, 2019

⃝c Kunal Lillaney 2019

Object storage has emerged as a low-cost and scalable alternative solution in the cloud for storing unstructured data. However, performance limitations often compel users to employ supplementary storage services for their varied workloads.

The result is growing storage sprawl, unacceptably low performance, and an increase in associated storage costs.

We combine the assets of multiple cloud service on offer to develop NDStore, a scalable multi-hierarchical data storage deployment for open-science data in the cloud. It utilizes object storage as scalable base tier and an in-memory cluster as a low-latency caching tier to support a variety of workloads. All programming interfaces to this system are RESTful web-services. However, many applications that are reliant on richer file system APIs and semantics are unable to benefit from object storage.

Users either transfer data between a file system and object storage or use inefficient file connectors over object stores.

One promising solution to this problem is providing dual access, the ability to transparently read and write the same data through both, file system interfaces and

ii ABSTRACT

object storage APIs. We discuss features which we believe are essential or desirable

in a dual-access object storage file system—OSFS. Further, we design and implement

Agni, an efficient dual-access OSFS, utilizing only standard object storage APIsand capabilities. Generic object storage’s lack of support for partial writes introduces a performance penalty for some workloads. Agni overcomes this shortcomings by implementing a multi-tier data structure which temporally indexes partial writes from the file system, persists them to a log, and merges them asynchronously for eventually consistent access through the object interface. Our experiments demonstrate that

Agni improves partial write performance by two to seven times with minimal indexing overhead and for some representative workloads it can improve performance by 20%-

60% compared to S3FS, a popular OSFS, or the prevalent approach of manually copying data between different storage systems.

Advisor: Dr. Randal Burns

Readers: Dr. David Pease, Dr. Yair Amir

iii Acknowledgments

I am truly grateful to my advsior, Randal Burns, for all his advice and support thorughout my years at Johns Hopkins. I am extremely fortunate to have

Randal as an advisor and a mentor. His role as an mentor has been instrumental in shaping my professional and personal growth. He was understanding, extremely patient, and provided me with the requisite independence to chart my own path. This thesis would not be possbile without him.

I am also profoundly thankful to David Pease and Vasily Tarasov, my collborators at IBM Research–Almaden. They played the role of secondary advisors and provided constant guidance during the latter half of my research. In addition, I would like to thank Yair Amir, who played a very important role in shaping my views towards research in particular and life in general. Interacting with Yair has enabled me to look at the larger picture which is essential to succeed.

I am immensely appreciative of my Hopkins Storage Systems (HSSL) lab mates—Disa Mhembere, Da Zheng, Stephen Hamilton, James Browne, Kalin Kanov for all the constructive research discussions; collborators at JHU-APL—William Gray

iv ACKNOWLEDGMENTS

Roncal, Dean Kleissas, Brock Wester; my Hopkins friends—Harisanker Menon, Purn- ima Rajan, Panchpakesan Shyamshankar, Aditya Rao, Naveen Natrajan, Raghvendra

S. V. for all their help.

Finally, I thank my family for their constant love and support. Without them, this endeavor would not have been possible. I am grateful to my parents—Anil and

Meena, sister—Namrata, brother-in-law—Rahul, and uncle—Sanjay for providing me with a strong support ecosystem and being there when I needed them the most.

This work has been supported by grants from National Institutes of Health

(1R01NS092474, 1U01NS090449), National Science Foundation (IOS-1707298, ACI-

1649880, OCE-1633124, IIS-1447639), and IBM Research.

v Dedication

You have been a paragon, a father-figure, a constant source of inspiration and much more to me. You sacrficed your education so others wouldn’t need to.I owe you an unrequitable debt. You are not physically present here today to witness me at the apotheosis of my PhD but your ¯atman is. Grand-father, please consider this thesis as my gurudakshin¯a to you.

vi Contents

Abstract ii

Acknowledgments iv

List of Tables xi

List of Figures xii

1 Introduction 1

1.1 Our contributions ...... 5

1.2 Thesis organization ...... 7

2 NDStore: Deploying scientific web-services in the cloud 9

2.1 Motivation ...... 10

2.2 Background ...... 12

2.3 Data design ...... 13

2.4 Architecture ...... 15

2.4.1 Base tier ...... 16

2.4.2 Caching tier ...... 19

vii CONTENTS

2.4.3 Buffer for random writes ...... 24

2.5 Microservice processing ...... 26

2.6 Experiments ...... 29

2.6.1 Read throughput ...... 30

2.6.2 Write throughput ...... 33

2.6.3 Lambda throughput ...... 34

2.6.4 Storage cost analysis ...... 35

2.7 Experience ...... 37

3 Designing dual-access file systems over object storage 38

3.1 Use cases ...... 39

3.2 Design requirements ...... 41

3.3 Existing systems ...... 46

3.3.1 File systems paired with object storage ...... 48

3.3.2 Object storage File systems ...... 48

3.3.3 Miscellaneous approaches ...... 51

3.4 Design trade-offs ...... 52

3.4.1 File-to-object mapping ...... 52

3.4.2 Object naming policy ...... 53

3.4.3 Metadata ...... 55

3.4.4 Data access ...... 56

3.4.5 Caching ...... 57

3.5 Evaluation ...... 58

viii CONTENTS

3.5.1 Streaming reads ...... 61

3.5.2 Streaming writes ...... 62

3.5.3 Random writes ...... 63

3.5.4 Metadata ...... 64

3.6 Discussion ...... 65

4 Agni’s stand-alone design 67

4.1 Overview ...... 68

4.2 Data structure ...... 69

4.3 Processing logic ...... 75

4.4 Discussion ...... 81

4.5 Implementation ...... 84

4.6.1 Testbed ...... 86

4.6.2 Partial writes ...... 87

4.6.3 Indexing and caching overhead ...... 89

4.6.4 Random read performance ...... 90

4.6.5 Sequential write performance ...... 91

4.6.6 Scaling up merge performance ...... 92

4.6.7 Lambda latency ...... 96

4.6.8 rsync performance ...... 97

4.6.9 ffmpeg and bowtie performance ...... 98

ix CONTENTS

5 Agni’s distributed design 100

5.1 Overview ...... 101

5.2 Modifications in the data structure ...... 102

5.3 Processing logic ...... 105

5.4.1 Read and write throughput performance ...... 116

6 Policies for a Coherent Namespace 118

6.1 Overview ...... 118

6.2 Mode I ...... 121

6.3 Mode II ...... 125

6.4 Mode III ...... 128

7 Conclusion 137

Bibliography 139

Vita 160

x List of Tables

2.1 Storage cost analysis of NDStore ...... 36

3.1 Comparison of existing object storage based file systems...... 47

6.1 Operating modes in Agni ...... 120

xi List of Figures

1.1 Cost comparison between object storage and block storage ...... 2

2.1 Resolution hierarchy of NDStore ...... 14 2.2 NDStore architecture ...... 15 2.3 Example of sparse neuroscience annotations ...... 17 2.4 NDStore data layout ...... 19 2.5 Read request operation in NDStore ...... 23 2.6 Order of writes in NDBlaze ...... 25 2.7 Different phases of the parallel data ingest service...... 26 2.8 Read throughput of NDStore ...... 31 2.9 Write throughput of NDStore ...... 32 2.10 Accelerated write throughput through NDBlaze ...... 33 2.11 Lambda cuboid generation throughput ...... 35

3.1 Architecture of ObjectFS ...... 59 3.2 File structure in ObjectFS ...... 59 3.3 Streaming read performance for ObjectFS ...... 61 3.4 Streaming write performance for ObjectFS ...... 62 3.5 Random write performance for ObjectFS ...... 63 3.6 Rename latency for ObjectFS ...... 65

4.1 Design overview of Agni ...... 68 4.2 Data layout of Agni ...... 69 4.3 Summary of master index in Agni ...... 72 4.4 Processing logic in Agni ...... 76 4.5 Different modules of Agni ...... 85 4.6 Partial write performance for Agni ...... 89 4.7 Partial write throughput to cache for Agni ...... 90 4.8 Random read performance ...... 91 4.9 Sequential write performance ...... 92 4.10 Scaleup merge performance of Agni ...... 93 4.11 Speedup merge performance of Agni ...... 94

xii LIST OF FIGURES

4.12 Object-to-file visibility lag of Agni ...... 96 4.13 rsync run time over DropBox files...... 97 4.14 ffmpeg run time over media files...... 98 4.15 Bowtie run time over Gene sequencing files...... 98

5.1 Design overview of distributed Agni ...... 101 5.2 Data layout of Agni in distributed mode ...... 102 5.3 Processing logic in Agni ...... 108 5.4 Different modules of distributed Agni ...... 116 5.5 Distributed Agni performance ...... 117

6.1 GET performance in Urial ...... 134 6.2 PUT performance in Urial ...... 136

xiii Chapter 1

Introduction

Some estimates predict that by 2021 more data will reside in public clouds than in traditional data centers [1], and by 2020 the world will produce 44 zetabytes of unstructured data each year [2]. Public clouds offer economies of scale, reliability, availabilty, and scaling on demand [3]. Users and enterprises are actively migrating their data and assocaited workflows to the cloud to realize these benefits. Commercial clouds do offer manystor- age services that include object storage, archival cold storage, memory clusters, relational databases, SSDs, and key-value stores. Figure 1.1 illustrates the cost comparison between object storage and block storage across different popular commerical cloud providers. How- ever, there is no single cloud storage service that stasfies all user requirements of scalability, durability and low-cost. Relational databases on disk drives have difficulty providing the high IO bandwidth required by some workloads and are difficult to scale. Key-value stores, such as DynamoDB [4], are designed to hold millions of key-value pairs and scale well. But, key-value stores that integrate solid-state storage and disk drives results in high operating

1 CHAPTER 1. INTRODUCTION costs for peta-bytes of data. Moreover, availability and reliability of data is critical and any data loss is unacceptable.

Figure 1.1: Cost comparison between object storage and block storage across different cloud providers. Block storage is 2 to 9 times more expensive when compared to object storage just for storing data 1

Object storage has gained significant traction in recent years owing to this explosion in the amount of unstructured data. It is predicted that this year (2019) more than 30% of the storage capacity in data centers will be provided by object storage [5], by 2023 the object storage market is expected to grow to $6 billion [6]. Users and enterprises increasingly look to object storage as an economical and scalable solution for storing this unstructured data [7,8]. In addition, object stores can offer simplified management, ease of use, and better data durability than traditional file systems [9,10]. These factors have led to the emergence

2 CHAPTER 1. INTRODUCTION of a rich ecosystem of object stores, both in the cloud and on-premises [11–16]. They are increasingly being used for tasks that have traditionally required file systems, such as data analytics, media streaming, and static web content serving, as well as data archiving [17,18].

Object stores are characterized by RESTful access, flat namespaces, immutable data, relaxed consistency, and rich user-defined metadata. Most object stores support simple data access operations: GET, PUT, and DELETE.

Object store and file system interfaces have a number of fundamental namespace and data access differences. File systems are generally characterized by their support for

POSIX which includes standard data access interfaces, a defined set of metadata, multi- user permissions, and a directory structure with a hierarchical namespace. Limited object storage semantics make them unsuitable for some applications, primarily in the realm of data analytics. For example, object stores do not support partial writes, so any update must rewrite the entire object. As a result, existing workflows that rely on partial writes needto be redesigned for object storage in order to avoid the performance penalty of rewrites [19].

Moreover, they are too slow for some latency sensitive workloads such as visualization.

Object storage systems also do not provide a hierarchical namespace. Many applications, such as genomic workflows [20,21], depend on a namespace hierarchy, and file pointers, none of which are supported by object stores. Given the flat name space in object stores, some space organization operations such as creation and deletion of directories could help analysis applications manage data. In a similar vein, renaming an object requires an entire object rewrite, making data management operations cumbersome for large objects [22].

Users and infrastructure providers have adopted different approaches to overcoming

3 CHAPTER 1. INTRODUCTION the disparity between files and objects, especially for recently emerging use cases. In thefirst approach, distributed file systems are paired with object storage, and data is periodically copied over to a file system from object storage for data analytics [23–25]. This approach offers the full spectrum of file interfaces but is unwieldy for large datasets and canleadto over-provisioning, cost inflation, reduced backup efficiency, additional time for data transfer, and poor quality of service (QoS) due to storage sprawl–duplication of data across different media and interfaces [26]. An alternate approach, which we term object storage file system—

OSFS, overlays file interfaces on top of existing object stores [27–32]. This approach givesa single location for data and provides ease of administration and use. Given its benefits, we focus on OSFS in this thesis.

Based on our exhaustive survey we believe that existing OSFSs fall short of si- multaneously providing two functionalities that are essential in modern multi-application infrastructures: (i) dual access to data and (ii) efficient file system interface. We define dual access, within the context of file systems and object storage, as the user’s ability to directly address data through a native object or file interface without going through an intermediary layer or having to look up the file-to-object mapping. The holy grail of dual access is being intuitive, where the same data can be accessed thorough either interface by the same name.

Based on input from end users, we have identified that dual access is a growing requirement for many tasks which are being migrated to object storage. An example of a workflow that takes advantage of dual access is one in which an MPEG movie file isaccessed by a transcoder 2 using fread, and then by a Content Distribution Network (CDN) service

2A transcoder typically converts a media file from one format to another. This is considered standard

4 CHAPTER 1. INTRODUCTION using GET. As shown in this example and further illustrated in Section 3.1, data is likely to be accessed through both file and object interfaces during its lifetime.

An ideal solution would be a new system, built from the ground up to support both file system and object interfaces. However, we can imagine that such a new system would be fraught with numerous hurdles, including (i) vendor lock in, (ii) user entrenchment with existing systems, and (iii) potential dilution of desirable properties such as cost or scalability. In addition, a solution is only as useful as it is likely to be adopted by users and enterprises, and new solutions often face significant difficulties with acceptance. The current set of both cloud integrated file systems and object storage file systems doesnot efficiently support such a use case. In the case of cloud integrated file systems, theuser has to wait for the data transfer to complete in each direction before it can be accessed using the appropriate interface. Existing object storage file systems offer low performance, with inefficient metadata operations and a full object read-modify-write cycle during partial updates.

1.1 Our contributions

In this thesis, we present two cloud based multi-hierarchical storage systems that overcome the existing limitations of object storage.

NDStore

First, we develop and deploy a hierarchical storage system on the cloud, termed industry practice to support a varied set of tasks such as supporting different screen resolutions and network bandwidths while streaming.

5 CHAPTER 1. INTRODUCTION

NDStore, that replaces traditional NoSQL architecture. The system uses an object store as a base tier for its scalability and low cost and a memory cluster as a caching tier for low-latency I/O. We do not attempt to superimpose any aspects of the existing NoSQL architecture on the cloud. Instead, we reinvent it to exploit cloud services to their full potential. Both, object stores and memory caches cannot be used in isolation because of their respective drawbacks. But they complement each other, negating the others shortcomings.

In addition, a memory-based fast-write buffer, called NDBlaze, accelerates random writes to the object store. We decompose certain aspects of our system using micro-services for scalability, modularity and to avoid a single point of failure. Micro-services include serverless compute, distributed queues, and key-value stores, for ingest, data manipulation, and analysis. The use of these services aids in our quest to keep cost low while extracting the maximum performance possible for different workloads.

Next, we design and implement Agni, an efficient dual-access OSFS, based on the design principles inherited from NDStore. Agni provides eventual consistency through both file and object interfaces while utilizing only commonly implemented object APIs and capabilities for operation. The effect of an operation through either interface is eventually reflected through the other interface. It implements a multi-tier data structure that provides high performance for partial writes by accepting them in memory, persisting them to a log, and aggregating writes when merging logs. Logged updates are indexed temporally which allow the system to update existing files even while flushing data to the log. Agni merges logged updates asynchronously and presents a single object to the user. This overcomes the

6 CHAPTER 1. INTRODUCTION issue with existing systems that either provide dual access or efficient partial writes, but not both. We also implement dual access APIs that, among other functions, allows users to synchronize files with objects on demand. In addition, we develop a set of namespace policies which ensure a coherent namespace across file and object interfaces. In support of this, we implement an optional Urial, which wraps around existing object interfaces to provide intuitive dual-access and prevents an occurrence of namespace incoherence.

We validate the performance of Agni by comparing it to S3FS, a popular OSFS, and the prevalent approach of manually copying data between different storage systems. Agni performs two to seven times faster for partial writes when compared with S3FS. It does so by logging writes and executing an asynchronous merge, limiting the immediate network traffic to the size of the write and reducing eventual network traffic by aggregating multiple writes. For reads, Agni performs comparably to S3FS and displays no drop in performance compared to S3FS even though files are fragmented across multiple logs. Depending on configuration, Agni merges objects at rates from 400MB/s to 800MB/s. Updates through the object interface are reflected in the file system within 350ms. For some representative workloads, Agni can improve performance by 20%-60% compared to existing approaches.

In a distributed mode, Agni reaches a peak of ≈6.9TB/s for reads and ≈6.8TB/s for writes

on a 10 node cluster. Using Urial introduces a manageable 1%-20% overhead based on the

object request and number of threads.

1.2 Thesis organization

The rest of the thesis is organized as follows:

7 CHAPTER 1. INTRODUCTION

∗ Chapter 2 describes NDStore, a scalable multi-hierarchical data storage deployment

for spatial analysis of neuroscience data on the cloud.

∗ Chapter 3 presents the different use-cases and design requisites for a dual-access storage

system. We explore the various design choices that exist with their quantified trade-

∗ Chapter 4 provides insights into the stand-alone version of Agni, it’s architecture, the

multi-tier write aggregating data structures, and details the various internal opera-

tions. Further, we validate the performance of our data structure by comparing it to

existing systems.

∗ In Chapter 5, we extend our data structure across multiple Agni clients. We describe

the specific modifications that enable us to operate the system in a distributed envi-

ronment and verify it’s performance.

∗ Chapter 6 illustrates the different namespace policies, including Urial, which are used

to ensure namespace coherence across both interfaces. In addition, we also quantify

the overhead of using Urial.

∗ Chapter 7 concludes the thesis.

8 Chapter 2

NDStore: Deploying scientific web-services in the cloud

This Chapter describes NDStore, a scalable multi-hierarchical data storage deployment for spatial analysis of neuroscience data on the cloud. The system design is inspired by the requirement to maintain high I/O throughput for workloads that build neural con- nectivity maps of the brain from peta-scale imaging data using computer vision algorithms.

In Section 2.1, we describe our previous system and the need to develop a new hierarchical storage system in the cloud. Next, we present our data design in Section 2.3, the two tier storage architecture and its working in Section 2.4. Further, in Section 2.5, we illustrate how we integrate our system with different cloud microservices. Finally, we provide aper- formance evaluation in Section 2.6 that shows that our design provides good performance for a variety of workloads by combining the assets of multiple cloud services.

9 CHAPTER 2. NDSTORE

2.1 Motivation

In 2013, we developed a scalable cluster called the Open Connectome Project [33] as a response to the scalability crisis faced by the neuroscience community. The system design was based on the principles of NoSQL scale-out and data-intensive computing architecture.

This project currently has grown to 80 unique datasets that total more than 200 TB across different imaging modalities. It was deployed on the Data-Scope storage cluster [34] atJohns

Hopkins University with MySQL and Cassandra as storage back-ends. We have recently migrated all Open Connectome Project data to a new cloud system, called NeuroData Store or NDStore, and continue to provide storage and analytics services to the neuroscience community.

Neuroscience has varied workloads that differ in scale, I/O bandwidth and latency requirements. One workload runs parallel computer vision algorithms at scale on high- performance compute clusters. This workload needs to scale, requires high I/O bandwidth, is not latency sensitive and performs large reads and writes [35]. One exemplar detects

19 million synapses using approximately 14,000 core hours in a 4 trillion pixel image volume [36]. Other workloads related to annotation of imaging data require low I/O bandwidth but are latency sensitive for small writes. Visualization presents a different workload that requires moderate I/O bandwidth and is very latency sensitive with small reads. Visualization platforms such as BigDataViewer [37] and NeuroGlancer [38] generate such workloads.

Our previous NoSQL architecture reached its scaling limit as neuroscience data evolved from terabytes to peta-bytes. In 2015, Intelligence Advanced Research Projects

Activity (IARPA) announced the Machine Intelligence from Cortical Networks (MICrONS) program [39] to reverse-engineer the algorithms of the brain and revolutionize machine learning. This program will image multiple mouse brains, producing peta-bytes of data [40].

Relational databases on disk drives have difficulty providing the high IOPS required by this workload and are difficult to scale. Key-value stores, such as Cassandra, are designed tohold millions of key-value pairs and scale well. But, key-value stores that integrate solid-state storage and disk drives have high operating costs for peta-bytes of data. Other considerations are the availability and reliability of data which are critical. Any data loss is catastrophic and inaccessible datasets can severely impede scientific discovery.

Unfortunately, the overall experience of the scientific community with respect to the cloud has been mixed. Some open science projects have explored the idea of cloud computing for a wide spectrum of applications ranging from understanding Dark Matter [41], creating Montages [8], tracing DNA [42] to utilizing unused cycles on commercial clouds [43].

Mostly, the negatives have largely outweighed the positives with performance playing a major role [44]. Another demerit has been cost-effectiveness [22]. Several factors have led to this negative experience. Most of these explorations were conducted between 2008 and 2011 when there were far few service providers with more basic services. Today, commercial services such as Amazon Web Services (AWS) [45], Microsoft Cloud [46], Google Cloud Platform [47] and IBM Cloud [48] have evolved and offer a mature set of services at competitive prices.

Also, many of explorations tried to port their existing High Performance Computing (HPC) architectures to the cloud, rather than redesigning the software stack around cloud services.

For example, Thakar [44] used SQL-Server to store data and managed SQL server instances

11 CHAPTER 2. NDSTORE manually. Since then, many managed databases services have emerged that would be easier and cheaper.

2.2 Background

We draw our inspiration from hierarchical storage management systems that we adapt to a cloud environment. Many of the challenges we face parallel those that led to the development of early hierarchical storage systems [49]. These systems were designed to use hard-drives in conjunction with other storage media, such as tapes. The multiple levels of storage are managed by the system so that the user sees a single storage address space. They have been widely used for serving videos [50], in file-systems [51] and data archives [52]. The ADSTAR

Distributed Storage Manager (ADSM) [52] is one such system used for backup and archive.

This system operated on multiple platforms and provided an illusion of infinite storage to the user. Another example is Coda [53], a file-system that ensured resiliency against server and network failures in large distributed systems. Coda descended from the Andrew File

System (AFS) and allowed users to cache data locally to ensure constant data availability.

Sprite [51] used client side caching for network file systems to increase I/O performance and reduce network traffic.

NDStore has a different usage pattern when compared with memory caching systems such as RAMCloud [54] and Memcached [55]. These memory caching systems were designed to act as caches over read-only data for applications in which data consistency is not an issue. They cannot ensure that the data will remain consistent when there are updates in the cache. We are not the first to propose the idea of hierarchical storage in

12 CHAPTER 2. NDSTORE the cloud. The idea of using caches over an object store has been used for a network edge cache [22]. However, the motivation behind their design was to alleviate the cost of object storage.

There have been numerous projects in the open science community which have attempted tera or peta-scale analysis in the cloud. CARMEN [56] is a platform developed on AWS for data-sharing and analysis for neuroscientists. The project moves computation close to the data and ensures faster analysis. It also uses scalable compute in the cloud for scaling up its workflows. CARMEN was designed to run data workflows which werenot necessarily time sensitive or frequently updated. This system is incompatible with emerging workloads in neuroscience, such as visualization and manual annotation tools which require low latency I/O for practical use. Other projects [57] have implemented scientific pipelines on the Azure platform. However, none of their workflows are latency sensitive and utilize a single storage tier plus compute.

2.3 Data design

The basic storage structure in our database is a dense multi-dimensional spatial array that is partitioned into rectangular sub-regions [33]. We call these subregions “cuboids” and they are similar to chunks in ArrayStore [58]. Figure 2.4 depicts a sample cuboid structure. Each cuboid is assigned an index using a Morton-order space filling curve. Space filling curves organize data recursively, so that any power-of-two aligned subregion is wholly continuous in the index [59]. They also minimize the number of discontiguous regions needed to retrieve a convex shape in a spatial database [60]. Morton indexes are easy to calculate and cube

Lowest resolution 64x64x64

128x128x16

Highest resolution

Figure 2.1: The resolution hierarchy dictates the dimensions of the cuboids at each scale [33]. addresses are non-decreasing in each dimension so that the index works on sub-spaces [61].

We utilized these indexes for these reasons in the Open Connectome Project [33]. We continue to use Morton indexes for our hierarchical data-store because their properties are applicable in the new architecture as well. Although data is stored as cuboids, we do not restrict the services to cuboid aligned regions. Users can read or write arbitrary subregions of data comprising one or more cuboids. We store a multi-resolution hierarchy for each image dataset as depicted in Figure 2.1. Using this hierarchy, visualization and analysis can choose the appropriate scale on which to operate.

We use a larger variant of cuboids to store the data on an object store. Per our experience, AWS S3 prefers data access to be in chunks of 16 MB. Cuboid sizes are generally smaller than this, about 256 KB. We fuse multiple cuboids into a single large cuboid, denoted super-cuboid, which is typically 4 times as large in each dimension and has 64 times as much data for three dimensions. Figure 2.4 illustrates the difference between cuboids and super-

L Web Cache o Server a Manager d Web Cache Tier Server B Base a Tier l Web Server a n c e Fast Write Buffer r

Figure 2.2: Architecture of NDStore in the cloud. cuboids. Data is always accessed from the object store as supercuboids, ensuring optimized

I/O size to S3. When loading data into the cache, we divide a super cuboid into 64 individual cuboids. Smaller sizes of cuboids are better for low latency memory access, visualization and 2-D projections. The concept of cuboids is akin to pages in virtual memory and that of supercuboids is similar to that of block sizes on file systems.

2.4 Architecture

The hierarchical architecture of NDStore and the interaction between the different storage tiers is presented in Figure 2.2. The load-balancer receives read and write requests as web-service calls from any number of sources: visualization tools, a compute cluster running

HPC workflows or an individual ingesting data. Requests are redirected by the load-balancer

15 CHAPTER 2. NDSTORE based on the nature of the request. Read requests for immutable datasets are directed to the hierarchical storage system, fast writes to mutable datasets are directed to a memory based fast write buffer (see Section 2.4.3), and ingest requests are forwarded to micro-services (see

Section 2.5).

A redirecting load balancer allows for flexibility in deployment. Multiple web- servers can be deployed on demand and the caching tier can span multiple nodes. This allows the system to scale out for a surge in demand during massive HPC runs and scale back when load decreases. We allow user access to both tiers of the storage hierarchy depending on their use case. Non-recurring latency independent requests at scale for HPC workflows will be read directly from the base tier. Recurring latency sensitive requests for visualization will be read from the caching tier. NDBlaze, a fast write buffer, is deployed as a separate instance under the load-balancer for buffering random writes. All requests to data in any tier has to go through the application layer running on the web-server nodes.

The application layer contains meta-data, including volumetric bounds, access-control and logic for reading and writing to unaligned regions in storage.

2.4.1 Base tier

The base tier for peta-scale neuroscience data has to be scalable, reliable and low-cost for long-term storage. I/O latency is not critical because of higher-level caches. We choose S3 as the base tier in our hierarchical storage architecture. S3 is a scalable object storage service with web interfaces. It is durable, available and secure ensuring that data is protected against disk failures. Moreover, it is a low cost option at $0.023 per GB per month in 2018 1. We

1 The prices are for AWS us-east-1 region as of August 2018

Figure 2.3: An example 5000x5000 voxel section of a neuroscience annotation dataset [62] depicting sparsity. did consider other storage services offered by AWS, Glacier and EBS, for this tier. Glacier is a data archive and cold storage service. It has a much lower cost when compared to S3 at $0.004 per GB per month. But, it is not a viable option because data retrievals can take up to 24 hours. EBS is a persistent block storage volume with low latency and capability to provision I/O. But storing all our data on EBS is 5 times more expensive at a cost of

$0.10 per GB per month (see Section 2.6.4). Moreover, there is size limit of 16 TB on each volume.

We prepend a hash to all the object keys to optimize I/O access to S3. S3, unlike a file-system, has a flat structure with no hierarchy. S3 splits it’s storage partitions basedon the key-space in which keys are sorted lexicographically. By prepending a hash to the key for each supercuboid, we ensure that requests for adjacent keys are routed to different partitions.

The hash is generated using the supercuboid metadata and imparts sufficient randomness to the key. This allows I/O to be processed and delivered in parallel. Without randomization,

I/O would be restricted to a partition, each limited to 100 request per second.

Sparsity of data

Many neuroscience datasets have large regions of space either empty or zero valued and we use this property to implement storage optimizations. An example of a sparse annotation dataset is shown in Figure 2.3, which depicts a 5000x5000 voxel section of a neuroscience annotation dataset [62]. We choose to not store any supercuboids for blank or zero data. This is helpful in reducing our storage footprint and drives down storage costs. Any absent data is dynamically materialized on access, if the request is within the volumetric bounds of the dataset.

Indexing base contents

To deal with sparsity, we require a way to generate lists of supercuboids that contain data. This is required for data managements tasks, such as building scaling levels, data-migrations and deleting sub-sections or entire datasets. Each of these tasks requires a different supercuboid list at a different granularity level. S3 does allowa LIST operation over the bucket but this is cost and performance prohibitive for millions of objects and is not recommended. We could also query for each object, but every access to S3 incurs a minor cost and this can be expensive for millions of accesses to non-existing objects.

We construct and maintain an index of the supercuboids in DynamoDB, which is a scalable NoSQL key-value store. It is updated at the time of supercuboid insertion into the S3 bucket. So for every supercuboid that exists in the base tier, there is a corresponding entry in DynamoDB. We include the dataset meta-data and supercuboid Morton index into the S3 object key to make them self describing. This feature allows us to generate S3 object keys without any external lookup. We use the index only for bulk data operations.

For individual supercuboid reads, we query S3 directly without checking DynamoDB first.

Although, there is a cost associated with a GET operation for S3, this is small compared to the cost of a DynamoDB access. Moreover, S3 accesses for object misses have tolerable performance.

2.4.2 Caching tier

Figure 2.4: Data layout across different storage tiers.

A caching tier for peta-scale neuroscience workloads needs to have low latency for random I/O and should be scalable to giga-bytes of data. This tier need not be low cost per

GB of data because it only holds a small fraction of the total data. Also, it can be ephemeral because we commit data to the non-volatile base tier eventually. We choose Redis [63], which

19 CHAPTER 2. NDSTORE is an open-source in-memory data store, as our caching tier. Redis is a good choice from the currently available commercial grade systems for multiple reasons. First, it is entirely in-memory, which offers us low latency I/O access for multiple readers and writers. Second, it has a cluster mode of operation, which allows us to scale out our caching tier and not get confined by the physical memory limitations of a single node. Third, Redis supports sorted sets and perform set operations, such as union, difference, and intersection, which we use for managing cache indexes.

Cuboids are stored in Redis as Blosc compressed [64] strings. Compressing cuboids reduces their size and, thus, their memory footprint. With reduced size per cuboid we can fit many more in the caching tier. We adopt a different approach for empty cuboidswhen compared to empty supercuboids. Empty cuboids are stored as empty strings in Redis.

This allows us to identify that a supercuboid was fetched from the base tier but was empty.

In this way, we can avoid future S3 and DynamoDB accesses while consuming very little memory.

Distributed locking

We implement a distributed Readers-Writer lock on top of Redis services for data consistency in the caching tier. We write-back data to the cache and any writes to a cuboid need to be finished to memory before they can be read. The lock allows concurrent read accesses and exclusive write access. The Readers-Writer lock extends native spin lock in

Redis, using Redis channels and built-in atomic operations to implement wait queues and distribution. As a Redis service, multiple processes running on different web-server nodes share the lock to realize distributed cache consistency.

Indexing cache contents

We build a cache index using sorted sets in Redis that record the contents of the cache. It is used to determine what data are missing on read requests and during cache eviction. This cache index store is different from the supercuboid index in DynamoDB that describes all supercuboids in the base tier. It consists of a project meta-data string followed by a cuboid identifier. With this format, we use the string prefix delete operation inRedisto evict multiple contiguous keys or all keys from a given data set in a single operation. Redis supports sets and sorted sets, which are an unordered collection of non-repeating strings, that provide basic set operations, such as union and intersection, in O(log(n)) time given n elements in the set. We use the intersection operation to determine which supercuboids are missing. Similarly, we use the union operation to add supercuboid indexes when cuboids are inserted or updated. We choose sorted sets over sets because in sorted sets each element can be associated with a score, in our case this is access time. We use the score to determine the least recently accessed elements in the cache by executing a rank operation on the sorted set to select pages for eviction.

Cache manager

The contents of an in-memory cache are managed based on loading and evicting supercuboids, rather than individual cuboids. This decision keeps software simple and reclaims more space with fewer evictions. If a cuboid is evicted from cache then all other cuboids within its supercuboid are also evicted. The logic here is that spatial regions tend to be accessed together (for read) and that when you evict a single cuboid you are going to have to read the entire supercuboid when you take a miss for that evicted cuboid.

Small random write accesses motivate caching cuboids at a finer granularity than the S3 supercuboid. A prevalent workload pattern in neuroscience has computer vision pipelines that reads large contiguous regions of space in an image dataset. It then detects features or structures in that data that are written out as annotations to a co-registered spatial database. There are substantial performance benefits from reading and writing smaller cuboids (256K) to memory rather than larger supercuboids (16M).

Although Redis provides a built-in LRU eviction policy, we implement a custom cache manager to enforce dependencies among all cuboids in a supercuboid. The Redis manager operates on individual keys and would evict individual cuboids. Evicting individual keys results in read-modify-write when a cuboid is dirty and other cuboids in the supercuboid have already been evicted. Our policy of evicting/loading all cuboids in a supercuboid together avoids read-modify-write to S3. Our cache manager daemon runs in the background and periodically queries Redis for it’s memory usage. When the Redis memory usage exceeds an upper bound, we perform cache eviction in three steps. First, we determine the least recently used supercuboids from the cache index stored as sorted sets. Second, we lock the cache to ensure that it remains consistent and no more data is added. Third, we call a delete operation on all the cuboids associated with those supercuboids. Fourth, the cache is unlocked to resume read and write operations. We continue to loop over these four steps until the memory usage of the cache drops below our lower bound. This process competes with other I/O to ensure that a cache eviction does not starve other I/O operations.

Cache operation

Figure 2.5: Operation of the hierarchical storage model for a read request.

We manage the multiple tiers of storage for the user and present a single transparent layer for access during a read request. Figure 2.5 depicts the operation of the cache for a read request in 11 steps. (i) The cuboids for that region are identified using Morton indexes and are mapped to their respective supercuboid indexes. (ii) We acquire a read lock on the cache for identified supercuboid indexes and (iii) use the cache index store to identify the missing supercuboids. (iv) The missing supercuboids are fetched from the base tier and broken into cuboids. (v) We acquire a write lock over the cache for the missing supercuboid indexes to ensure there is no inconsistency. (vi) Cuboids are inserted in the caching tier and (vii) the respective supercuboid indexes added to the cache index store.

(viii) At this point, we update the access times on the cache indexes for those supercuboids that were requested but not fetched. This ensures that we can correctly evict least recently

23 CHAPTER 2. NDSTORE used supercuboids. (ix) The write lock is released. (x) We read all the requested cuboids from the caching tier and (xi) release the read lock. (xii) The cuboids are organized into the volume requested and returned to the user.

Similarly, when a sub-region of data is written to the system, the multiple tiers of storage are transparent to the user. We identify the respective cuboids using their Morton indexes. We acquire a writer lock over the cache for the respective cuboids. These cuboids are merged with any cuboids present in caching tier and then written back to the cache.

The cuboid indexes are mapped to their respective supercuboid indexes and updated in the cache index store. For cuboid indexes already present in the cache index store, the access times are updated. Finally, we release the write lock and return a success to the user.

The system also allows a direct I/O mode to read and write supercuboids to the base tier. This mode is used when ingesting data through web services and performing sequential reads. A direct write does not overwrite the object present in S3, rather it is merged with the existing object with a read-modify-write process. In this mode the cache index lookup, readers-writer lock and cache index store are all bypassed.

2.4.3 Buffer for random writes

We develop NDBlaze, a memory based fast write buffer, to accelerate bursts of random writes to spatial data. NDBlaze improves upon the performance of writing to the in-memory caching tier by storing written data to memory directly without de-serialization. It then asynchronously de-serializes data, merges multiple writes to the same spatial region, aligns the written data to supercuboids and writes supercuboids to the base tier. NDBlaze inherits the principle of asynchronous merging from amortized write data structures, such as the log-

Figure 2.6: The different levels in NDBlaze. Writes arrive in random order at the first level.

They are sorted, and merged at the second level to match the layout on disk. In the third level, they are written out to disk avoiding read-modify-write. structured merge tree [65]. Neuroscience workloads for machine and manual annotation of data generate many small random writes, e.g. writing small patches of sparse data as seen in

Figure 2.3. These workloads degrade I/O performance and decrease utilization of processors that have to wait for I/O. NDBlaze improves user-perceived write throughput many-fold. It is implemented with Redis and Spark.

NDBlaze minimizes write latency and maximizes the peak throughput of write bursts. NDBlaze sits between the load balancer and the base tier. It is independent of the caching tier and utilizes its own Redis instance. This ensures that in case of a burst of random writes we do not overflow our cache and cause disruption to latency sensitive reads. We also choose not to write data from NDBlaze to the caching tier so that the asynchronous merging and write-back process reduces memory pressure. We make two key modifications for write optimization. We do not break the data blobs into cuboids before

Figure 2.7: Different phases of the parallel data ingest service. inserting them into memory. Instead we place written data into memory directly, saving time on de-serialization of data. Corresponding timestamps for each write are baked into the key of the blob to ensure correct ordering. We do build secondary index using sets in

Redis to maintain a mapping between data blobs and Morton indexes. Second, there is no cache locking for writes in NDBlaze; it is unnecessary. We order the writes based on their timestamps and the secondary index determines which writes affect need to be merged into each supercuboid based Morton index. We use Spark to merge these writes. NDBlaze does allow consistent reads over the buffered data, albeit at a slower rates than the caching tier because multiple writes may need to be merged to serve a read request.

2.5 Microservice processing

Several micro-services offered by AWS are used to overcome workload bottlenecks. We utilize

AWS Lambda, which is an event-driven server-less computing service. It runs workflows in

26 CHAPTER 2. NDSTORE parallel without provisioning servers and scales applications based on the number of triggers that data requests generate. Lambda’s pay-as-you-go models offers a cost benefit as well, because you pay for the number of triggered events at a milli-second granularity. We also use the AWS Simple Queue Service (SQS), reliable and scalable message queuing, to coordinate workflows across multiple servers and services in the cloud. SQS ensures that enqueued workloads are processed reliably.

Data ingest

Modern microscopes can generate several terabytes of data in an hour [66] and storage buffers located at data collection points are not managed storage and are not large enough to store data over long periods of time. It is essential to move this data to a remote reliable data-store quickly. It is a challenge to deposit this data in the cloud, because we are limited by network and I/O speeds. We use AWS Lambda to help overcome these challenges. Figure

2.7 depicts a sample data ingest workflow, divided into three phases: tile collection, cuboid generation and tile cleanup.

The tile collection phase transfers microscope data (image tiles) from point of collection to the cloud. Initially, the data collection site populates the tile upload queue with a manifest of all tile names to be uploaded. Then, multiple processes at the data collection site use this task queue to select and upload tiles to an S3 bucket in parallel. The task queue ensures that after a client failure the upload process can be resumed. Every upload of tile to the tile bucket triggers a Lambda job. This Lambda job updates the received tile in an tile index DynamoDB table and removes the task from the upload queue.

We maintain the tile index table to ensure that all tiles for a supercuboid are uploaded

27 CHAPTER 2. NDSTORE before ingest. When a Lambda job confirms that the bucket holds all tiles necessary forthe supercuboid, it inserts a supercuboid generation task in the supercuboid queue and triggers another Lambda job for ingest.

The supercuboid generation phase converts the transferred tiles into supercuboids.

The Lambda job reads all the relevant tiles from the tile bucket, packs them into a supercuboid, inserts the packed data into the supercuboid bucket, and updates the DynamoDB supercuboid index table. Then, this Lambda job inserts a cleanup job in the cleanup queue, triggers a cleanup Lambda job, and removes the supercuboid generation task from the cuboid queue.

The Tile Cleaning phase deletes tiles from S3 for ingested data to reclaim storage.

The Lambda cleanup job removes all the tiles from the tile bucket, removes the tile indexes from the tile index table, and removes the cleanup task from the cleanup task queue.

The multiple, staggered phases of ingest ensure reliability and maximize parallelism with low resource consumption and fine-grained control. We prefer three phases toone monolithic Lambda job. Memory consumption in phases 1 and 3 is low, whereas phase 2 uses lots of memory. Different Lambda jobs settings are employed for each phase inorder to fine tune memory consumption and drive down costs. We choose to dequeue tasksin each phase only when we have inserted another task in the next phase. This ensures that tasks are not lost in case of Lambda job failures. Also, each phase has different amounts or granularities of parallelism. Tile upload performs a Lambda for each tile whereas phases 2 and 3 perform a Lambda for each supercuboid. In this way, tiles can be uploaded in parallel, independently of their order in the supercuboid. This also gives a user the desired control

28 CHAPTER 2. NDSTORE over each phase and more Lambda jobs can be allocated for a particular phase. A user may prioritize tile collection and delay the other phases to reduce storage at the microscope.

Similarly, one can delay tile-cleanup if supercuboid generation is a priority.

Building scaling levels

We use DynamoDB indexing and Lambdas to generate scaling levels over the stored data.

For large neuroscience data, it is customary to store data in a resolution hierarchy with each level downscaling images by a factor of 2 in all dimensions for isotropic data or just the X and Y dimensions for anisotropic data, e.g. serial section samples. This practice allows map- style visualization tools to zoom in quickly, loading the minimum amount of data for the screen resolution [38]. Every four (anisotropic) or eight (isotropic) supercuboids at any given level, create a single supercuboid at a lower level. Using the DynamoDB supercuboid index, we build a manifest of all supercuboids to be generated. For each entry in the manifest, we insert a task in an SQS propagate queue and create a Lambda job to build that cube. The

SQS queue allows us to detect and rerun failed Lambdas. Lambda jobs fetch four or eight supercuboids, generate data at the lower scaling level, and insert it into S3.

2.6 Experiments

The principal performance measure for NDStore is I/O throughput. We conduct throughput experiments against several different deployments; one, two and four web-server nodes with a single cache tier node running Redis. (We found that Redis provides higher throughput for a single node than in any cluster.) These experiments provide a view of the scale-up capability of NDStore. All experiments are conducted with supercuboid of 16 MB. Each

29 CHAPTER 2. NDSTORE thread reads a different subregion in the volume so that we do not read the same region twice and try to have all S3 reads go to disk. Cuboids and supercuboids are compressed for

I/O across both tiers. Neuroscience imaging data in general has high entropy and tends to compress between 10-20% with Blosc compression used in these experiments.

A load balancer is deployed in front of our web-server nodes. We use compute optimized instances for the web-severs, general purpose instances for the Spark cluster and memory optimized instances for the Redis deployments. Each web-server node is a single c4.4xlarge instance with 16 vCPU’s and 30 GB of memory. The cache tier is a single r4.4xlarge with 16 vCPU’s and 100 GB of memory. NDBlaze is a single r4.4xlarge instance with 16 vCPU’s and 100 GB of memory. The Spark cluster consists of a single master and four slave m4.4xlarge instances, each with 16 vCPU’s and 64 GB of memory. We run a total of 16 Spark workers each with a single executor: 4 vCPU’s and 16 GB of memory. We set the PySpark serializer to Kyro and the serialization buffer is allocated 1 GB. All instances have a default single root drive of 8 GB on EBS. The Lambda functions are deployed with a memory of 128 MB, minimum possible, and a maximum possible running time of 100 seconds. AWS limits concurrent Lambda job executions to a 1000 per region but these can be increased on request. We use the default limit of 1000 concurrent job executions for these experiments.

2.6.1 Read throughput

NDStore achieves read throughput of 600 MB/sec for cached data from Redis and 450

MB/sec from S3. Figure 2.8 shows the read throughput directly from S3 compared to that from a cold and a hot Redis cache as a function of number of threads. In the case of hot cache,

Figure 2.8: Read Throughput with each thread reading 16 MB of data. The numbers 1, 2,

4 denote the number of web-servers used during benchmarks. all of the requested data is already present in the caching tier. For cold cache experiments, the data has to be fetched from the base tier. Cold cache has the slowest performance owing to additional processing. The data are read from S3, decompressed, broken up into cuboids, re-compressed and inserted in to Redis. The cold cache peak throughput is about

225 MB/sec with 128 threads and is CPU bound. We see increases in throughput until we reach the vCPU count of the web servers, 16 per server, at which point performance flattens.

These experiments show that it is better to read directly from the base tier for non-recurring sequential reads. We achieve good I/O parallelism; the S3 object store was

Figure 2.9: Write Throughput with each threads writing 16 MB of data. The numbers 1, 2,

4 denote the number of web-servers used during benchmarks. designed for this workload. Serving recurring reads from the cache achieves maximum performance. NDStore supports high read throughput in the cloud for a variety of workloads and knowledge of the data access pattern is important for readers to use the right web service, direct I/O to S3 or cached I/O.

Figure 2.10: Accelerated write throughput to NDBlaze using a single Redis node deployment.

2.6.2 Write throughput

Using NDBlaze, we can accept bursts of random writes that exceed the write throughput of S3. We preform two sets of experiments for write throughput. First, we test the write throughput to both tiers of our hierarchical storage system. Figure 2.9 show the write throughput directly to S3 compared to that from a Redis cache as a function of number of threads. A peak write throughput of about 400 MB/sec can be achieved for the caching tier with 512 threads. We achieve 300 MB/sec directly to S3. Second, we test the write

33 CHAPTER 2. NDSTORE throughput for NDBlaze, our fast write buffer. Figure 2.10 represents the write throughput for NDBlaze as a function of data blob size. NDBlaze achieves a peak burst throughput of about 30 GB/sec for 512 MB request with 128 threads. NDBlaze is an order of magnitude faster for write bursts when compared with the caching tier, because its write optimizations and avoidance of serialization. The time integrated performance of NDBlaze (and of the caching tier) will eventually decrease to S3 performance, because all data must eventually make it to disk.

2.6.3 Lambda throughput

The cuboid generation phase of data ingest jobs achieves a write throughput of 1GB/sec, which exceeds the data generation rates of state-of-the-art high-throughput microscopes.

Cuboid generation includes I/O to and from S3 and Lambda processing. It does not include data transfer from the microscope to S3. We choose not to measure end-to-end performance for data ingest, because network performance from neuroscience labs into the cloud tends to be quite slow and variable. Figure 2.11 shows the Lambda write throughput for the cuboid generation phase as a function of Lambda jobs. In this experiment, each Lambda job reads

64 512x512 PNG tiles of 256 KB each from S3. Tiles are combined to form a supercuboid of 16 MB, which is written back to S3. The write throughput scales well for about 1000 concurrent Lambda jobs and then tapers off slightly. By default, AWS limits an account to 1000 concurrent Lambda jobs, queuing additional requests. We see throughput increases beyond 1000 jobs. We see a peak throughput of 1 GB/s at 4096 concurrent Lambdas.

Having outstanding Lambda requests ensures that queues are full in the presence of skew.

The low cost of Lambda, $0.000000208 per 100ms for 128 MB of memory, ensures that we

Figure 2.11: Lambda memory throughput for the cuboid generation phase. ingest data at a fraction of the storage cost.

2.6.4 Storage cost analysis

The hierarchical storage model implemented in NDStore is 5 to 11 times cheaper per MB/sec of I/O when compared with EBS. We do a sample analysis on storing 1 TB of data based on AWS us-east-1 region prices (as of August 2018), which is where our production system is deployed.

A monthly storage cost for a 1 TB of EBS is $125 at a price of $0.125 per GB

Table 2.1: Cost Analysis of EBS vs our model for 1 TB of data. Cost for read and write are indicated by R and W respectively.

EBS S3

Storage cost per 1 TB $125 $23

Operating Cost $1300 $79.60

Total Cost $1425 $102.60

Cost of 1 MB/sec I/O $4.45R,W $0.82R, $0.41W per month. The corresponding cost for S3 is $23, at a price of $0.023 per GB per month;

≈ 5 times lower than EBS. For EBS, we provision 20,000 IOPS, maximum possible per volume, so that we attain a maximum raw throughput of 320 MB/sec. This costs $1300 per month at the price of $0.065 per provisioned IOPS per month. The read access cost for S3 GET ’s is $4 a month, assuming ten million reads, at a price of $0.0004 per 10,000 requests. The write access cost for S3 PUT ’s is $50 a month, assuming the same number of writes as reads, at a price of $0.005 per 1000 requests. For the S3 approach, we incur additional costs on DynamoDB to maintain supercuboid indexes. On DynamoDB, our cost comes to about $25.60 per month for a provisioned capacity of the 50 read and 50 write units. Our hierarchical storage system costs are about 13 times lower at $102.60 per month as compared to EBS at $1425 per month. Our system, assuming a single node deployment with co-located cache, can achieve a maximum read throughput of about 125 MB/sec for cold cache reads and a maximum write throughput of about 250 MB/sec to cache. Our read

I/O cost is $0.82 MB/sec per dollar and write I/O cost is $0.41 MB/sec per dollar. The

36 CHAPTER 2. NDSTORE comparative read and write I/O cost for EBS is 4.45 MB/sec per dollar.

Our analysis omits compute costs on EC2. These are similar for both deployments and the amount spent on EC2 varies widely depending upon the chosen instance type. We also omit costs for NDBlaze.

2.7 Experience

We have developed a hierarchical storage system in the cloud, NDStore, to economically provide high I/O throughput for different neuroscience workloads. Furthermore, we utilize multiple cloud micro-services to build new services and redesign current workflows.

This system is currently deployed for our project called NeuroData (http://neurodata.io) and the Open Connectome Project (http://openconnecto.me). All software described in this paper are open-source and available for collaborative development and reuse at our github repositories (NDStore: https://github.com/neurodata/ndstore, NDBlaze: https://github.com/neurodata/ndblaze, Readers-Writer Lock: https://github.com/ kunallillaney/blaze-lock). Limitations on local cluster deployments and cost of a cloud deployment had been two major factors in our ability to process peta-bytes of neuroscience data. The deployment of NDStore overcomes these shortcomings and has transformed our ability to support large-scale neuroscience.

37 Chapter 3

Designing dual-access file systems over object storage

In this Chapter, we present a class of applications that could potentially benefit from dual-access storage system in Section 3.1. We further identify requisites to design this system such that it has a high probability of adoption in Section 3.2. We conduct an exhaustive survey of existing dual access storage systems in Table 3.1 and highlight that none of the existing systems in Section 3.3 fulfill the desired feature set. Further, we explore the various existing trade-offs in Section 3.4 and explore the corresponding performance impacts in Section 3.5. In the following text we refer to an implementation of an abstract file system working on top of an object storeas ObjectFS.

38 CHAPTER 3. DUAL ACCESS STORAGE SYSTEMS

3.1 Use cases

Based on our interactions with real customers we present two motivational use cases for a dual-access storage system. These use cases help us understand and describe the essential design requirements for such a system. We do not limit the applicability of dual-access storage to just the applications discussed below. Rather the contrary, we believe that these examples represent a growing class of applications which can greatly benefit from a dual access.

Media transcoding, editing and analytics

The vast majority of media-based workloads are characterized by the generation of massive numbers of media files, some form of data processing on those files, and finally movement to an archive for cold storage or dissemination to end users. Data processing can be in the form of video editing, trimming, inserting cinematic effects, overlaying different audio channels, and transcoding into different media formats. Most of the software involved in the processing stage relies on file interfaces. At the same time, cost effectiveness, near- limitless capacity, and rich embedded metadata drive the media and entertainment industry for the large-scale adoption of object storage [67]. Although some cloud providers do provide transcoding software with object store connectors packaged into proprietary solutions, these only support a small fraction of media processing tasks [68–70]. Moreover, data sovereignty is a serious concern for media producers, who prefer to run their own solutions and rely on cloud providers only for storage [67].

A simple example workflow entails the generation of raw video footage from film studios which is directly moved into object storage using object interfaces. That raw footage is then edited and transcoded into different media formats and resolutions to support com- patibility with different platforms. When processing is complete, both the original raw videos and the different transcoded formats are moved back into object storage. Data intended for user consumption is delivered directly from object storage using CDNs.

Gene alignment

Object storage can play an important role in alleviating some of the data management issues around processing for human genomic analysis. Currently, scientists are struggling with massive data sets, petabytes and growing, in the domain of human genomics [71].

The challenge here is to effectively store large amounts of data, conduct genome analysis at scale and eventually share this data among the research community. Object storage is perfectly suited to solve these issues and to serve as the central data repository for all genomic data. However, most of the current genome analysis pipelines rely on file system capabilities for analysis [20, 21, 72–74]. A typical example of a gene sequencing pipeline involves generation of human genome sequences at laboratories. This data is then moved to a central data archive for durability and dissemination. The data is downloaded over the network to a local file system or network-attached device for gene sequence alignment.

Different researchers run their own versions of sequence aligners over the unaligned data with tools such as Bowtie2 [73]. Sequence-aligned data is re-uploaded to the central archive

(based, e.g., on object storage) and shared with other researchers for analysis. Scientists re-download the data and analyze it using tools such as SAMtools [75] and BCFtools [76].

Common traits

Use cases in other domains such as neuroscience [77–79], geoinformatics [80, 81], machine learning [82], industrial process control [83], and computer vision [84,85] can also potentially utilize a dual-access storage system. These use cases are united by discernible traits, such as: (i) migration of workloads to cloud environments, (ii) and a desire to utilize scalable cloud storage services, (iii) but a continued dependence on file system based applications

(iv) which generates the need to access data from both interfaces.

3.2 Design requirements

We distill down the design requirements for a system which would be effective for the discussed use cases. These are based on our analysis of the use cases, interactions with domain experts and prior experience in developing storage systems. We present them in descending order of priority.

Object based

The system should employ object storage to store its data. This allows the system to realize the benefits of scalable, low-cost object stores. While it is technically easierto provide object interfaces over file systems, that design choice does not inherit the desirable properties of object stores. An object storage-based system can also benefits from the improved performance that comes from relaxing consistency guarantees. Furthermore, users and enterprises are already entrenched in object storage and thus an object based system would require minimal changes to either the vendor or storage technology.

Intuitive dual access

To address storage sprawl issue the system should provide intuitive dual access to the data. We stress being intuitive because from our perspective this is a vital property for high usability and wide adoption. We define object interface intuitiveness as the user’s ability to directly address data through a GET/PUT interface without going through an intermediary layer or having to look up the object mapping. The most intuitive approach to dual access is to adopt a 1⇒1 mapping. For example, the content of a file should be fully stored in and accessible through the object with the same name. We do not consider a system which splits a file into multiple objects with either⇒ 1 N or N⇒N mapping as one that provides intuitive dual access. In these systems, the user has to go through an intermediary layer or first fetch the object layout before accessing the object itself, which requires application or infrastructure changes.

Generic and backend-agnostic

The system should be deployable over standard APIs, which is important for high flexibility, avoiding vendor lock-in, and ability to use object stores that are already deployed in customer environments. We assume a limited, yet representative, set of operations:

(i)PUT(name, data) adds a named object;

(ii) GET(name, range) fetches a specific range from the object or the entire1 object ;

(iii) DEL(name) deletes an object by name;

(iv) MULTI-PART DOWNLOAD(name) retrieves an object in parallel2;

1A ranged GET is also referred to as a partial GET 2Large objects can be broken into chunks and downloaded/uploaded in parallel. A chunk is referred to as a PART.

(v) MULTI-PART UPLOAD(name, data) adds a named object in parallel.

Based on our experience, these are a common subset of object interfaces available across all

popular object stores.

Efficient file interface

In-place updates and metadata operations form an integral part of numerous application

workloads [77,86]. For example, media workloads routinely perform in-place updates during

video editing and gene sequencing workloads may create, delete or rename multiple files

during the course of a single workflow. Such workloads cannot benefit from a system where

specific file interface are either not supported or are too slow. Given the disparities between

object and file interfaces, providing both dual access and good performance is challenging. In

fact, the goal of dual access is often in direct conflict with that of achieving good performance.

Existing object storage file systems support either intuitive dual access or good performance

for specific file interfaces, but not both. For instance, a simple 1⇒1 mapping that maintains a file in a single object offers the most intuitive dual access from the user’s perspective.

Many of existing dual-access systems implement this mapping [29–31,87–90]. However, the

1⇒1 mapping can drastically reduce the performance of partial file writes and file renames.

This is because even a small modification to a file or a rename of the file requiresa GET

and a PUT of the entire object. These systems either do not support partial writes or incur

significant performance overhead. Other systems sacrifice intuitive dual access andadopt

1⇒N, or N⇒1, or N⇒N object-to-file mapping policies [32,91–93].

Coherent namespaces

Data is accessed through both interfaces and the system should support a uniform namespace to enable intuitive access. Object storage features a flat namespace with key-value style semantics. A system could use object names with embedded path components to mimic a directory structure for the user’s convenience, considering ’/’ in the object names as delim- iters. But this can lead to various incoherencies in the file system namespace during creation and deletion of objects. For example, assume that there exists a directory named /A/ in the

file system namespace; it would lead to an inconsistency if the user wereto PUT an object

named A in the object namespace. Given the semantics, it would be naive to directly map

object interfaces PUTs and DELs to file interfaces creat and remove. These inconsistencies

only get compounded when we start to add support for directories, hard links, and symlinks.

Moreover, this could lead to data either becoming unreachable or accidentally overwritten,

which is unacceptable. The current set of systems rely on the user’s cognizance to prevent

any inconsistency, and do not take this issue into consideration.

Eventual data access consistency

The system should ideally ensure that data access is consistent across both, object and

file interfaces. But a closer examination of the use cases indicates that though datais

accessed through both interfaces, rarely is this concurrent. In our view, this property should

be exploited for performance improvement when designing the system. We further believe

that eventual consistency is a suitable model, especially given that object storage users and

applications already expect it.

Distributed

Object storage is inherently scalable and only a distributed file system can fully utilize this characteristic. The massive datasets of the presented use cases also indicate that a solution would need to scale out beyond the compute power of a single node. One approach here is to implement a full fledged distributed file system which performs byte-range locking.

However, on closer examination of the use cases, the datasets would need distributed access, but this would not necessarily be to the same data. A potential file system would probably be able to work with weak cache consistency, since individual nodes would likely work on distinct parts of the data. This belief is further reinforced by the absence of any distributed object storage file systems. Some cloud providers even offer NFS over object storage [94,95].

We believe that a distributed file system with NFS v3-like semantics, but not weighed down by the scalability limitations of NFS, would provide a good balance between consistency and performance.

Unified access control

The differences between file systems and object storage extend to access control methods.

Access control in object storage is more detailed than Access Control Lists (ACLs) that are available on standard file systems, and both tend to be incongruent. The current OSFSs do not provide a unified access control between files and objects, e.g., the permissions set for an object are not reflected in its file equivalent (or vice-versa). However, enterprises desire this to be unified across both interfaces. Moreover, there is a vast diversity inthe nature of access control supported by different object storage vendors, which presents a dilemma for a backend-agnostic system. For example, some object stores support object

45 CHAPTER 3. DUAL ACCESS STORAGE SYSTEMS level permissions [11–13] while others only offer permissions at the container level [14, 96].

One system can even limit access to objects based on the IP address of the user [97].

3.3 Existing systems

46 CHAPTER 3. DUAL ACCESS STORAGE SYSTEMS c ✘ ✘ ✘ ✘ Access control Future b b ✘ ✘ ✘ ✔ ✔ ✔ ✔ Distr- ibuted d ✘ ✘ ✘ ✘ ✔ space Name Future updates metadata updates metadata updates No support for partial writes Incomplete support. Optimal file and Incomplete support. Inefficient file and Full support. Optimal file and metadata Full support. Optimal file and metadata updates Full support. Optimal file and metadata updates File-system interface support and efficiency No support for partial writes, renames and metadata a a ✘ ✘ ✔ ✔ ✔ ✔ eric Gen- ✘ ✘ ✔ ✔ Dual access Limited Table 3.1: Comparison of existing object storage based file systems. Agni Name S3QL [91] CephFS [99] YAS3FS [31] OIO-FS [100] ProxyFS [101] MarFS [28], SCFS [98] Goofys [30], SVFS [90], Riofs [87] Currently support NFSv3 semantics. Consistent cache coherency can be achieved with the use of distributed locking mechanisms which are currently Currently supports a single object store, but uses only generic object interfaces. We consider it to be generic since, in theory, it could be ported to This is being actively discussed and future versions are expected to support it. S3FS [29], Blobfuse [88], Gcsfuse [89] OFS [92], Cumulus [93], BlueSky [32] Elastifile [24], CloudFS [25], [23] FSX a other object stores. b absent. This is distinct from runningc NFS or equivalent services over existing object storage.

We surveyed existing systems and classified them into two broad categories in

Section 3.3.1 and Section 3.3.2 based on the approaches they adopt. We limit our comparison to file systems that have adequate information in the public domain. The results ofthe survey are summarized in Table 3.1, illustrating that no existing system meets all of the requirements.

3.3.1 File systems paired with object storage

In this approach, distributed file systems are paired with object storage. These systems are essentially hierarchical storage management systems except that they are based on block

(NVMe or HDD) and object storage instead of HDD and tape [50, 102]. A distributed file system over block based storage is deployed over object storage and data is either transparently or manually migrated between them. This approach offers the full spectrum of file interfaces but tends to become cumbersome for massive datasets and leads to storage sprawl with the constant transfer of data between different storage systems. They are indicated by

in Table 3.1.

3.3.2 Object storage File systems

These systems acronymed as OSFSs operate over existing object storage systems, not as an integral part of them. We limit comparisons to object storage-based file systems, which are not the same as file systems designed to use object-based storage devices (OSDs), suchas

Lustre [103]. Object storage-based file systems operate over existing object storage systems, not as an integral part of them. Similarly, a system such as SwiftOnFile [104] (formerly called Gluster-Swift [105]) does not qualify because it is an object store deployed over a

48 CHAPTER 3. DUAL ACCESS STORAGE SYSTEMS distributed file system and has several limitations in terms of dual access and genericness.

We further subdivide this approach into three types 3.3.2, 3.3.2, and 3.3.2 based on their support for dual access and efficient file interfaces.

Inefficient file system interface

These systems offer intuitive dual access and convenience but only support simple ornon- efficient file interfaces. They are more object store like with rudimentary file systemlike properties. NFS gateways over object storage are typically such OSFSs deployed on a single node and mounted over NFS [95]. Most systems that support dual access do not store metadata. Rather, they reconstruct it from objects on file system mount or to store metadata selectively within the user-defined metadata3 associated with each object. They are designated by in Table 3.1.

No dual access

These systems offer full spectrum of efficient file interfaces but do not support intuitive dual access. They are essentially distributed file systems with little or no object like features. Most of these systems adopt intricate mapping schemes and operate an independent metadata service. They are efficient for partial writes since they treat objects as blocksin a log-structured file system. Read performance is less affected by data fragmentation than in disk-based file systems, because object storage performs well for parallel fetches tonon- contiguous data, These systems do not reconcile the fragments back into a single object for dual access. A prominent example in this category is CephFS, which is neither generic nor

3 Each object has system and user-defined metadata. The user assigns metadata, as key-value pairs, to each object. Object metadata can be fetched from the object store independently.

49 CHAPTER 3. DUAL ACCESS STORAGE SYSTEMS does it support dual access. It only operates over RADOS, its own object store, and though it does store data in RADOS, this data cannot be accessed via object interfaces. RADOS natively supports partial writes and does not require a log. We do not compare against proprietary systems that do not describe internals, such as Cloudian HyperFile [106] and

MagFS [107]. To our knowledge, these systems are POSIX compliant but do not support dual-access because they adopt a 1⇒N mapping and are intended to be used in Network

Attached Storage (NAS) environments. They are depicted by in Table 3.1.

Best of both worlds

These systems offer both, intuitive dual access and full spectrum of efficient file interfaces.

They have the perfect blend of file system and object store features. They are specified by in Table 3.1. ProxyFS and OIO-FS stand out for providing limited dual access with efficient partial updates and supporting most file interfaces. We discuss them in detail further to better understand their limitations.

ProxyFS is not generic and operates only on OpenStack Swift. It only offers a limited set of object interfaces in the dual access mode and updates via object interfaces are not reflected via the file interface. Moreover, it has no support for distributed access or a unified access control. It adopts⇒ aN N object-to-file mapping and relies on Swift’s ability to present a scattered object as a single entity via the object interface. Swift does not natively support partial writes so ProxyFS implements a circular log buffer. Swift containers 4 need to be pre-configured for dual-access operation and this feature cannot be enabled retroactively. ProxyFS is also the only other existing system which seriously

4Containers are analogous to buckets in S3.

50 CHAPTER 3. DUAL ACCESS STORAGE SYSTEMS considers the issue of maintaining a coherent namespace and is attempting to tackle it.

However, the namespace discussions appear to be limited in scope and indicate only a limited set of interfaces will be supported.

OIO-FS is a proprietary OSFS that operates only on Open-IO. Open-IO does natively support partial writes allowing OIO-FS to adopt a 1⇒1 object-to-file mapping. It can support efficient file updates without using a log. OIO-FS supports dual accessonly for reads thus limiting it’s utility. There is also no support for a unified access control or distributed access.

3.3.3 Miscellaneous approaches

Beyond file systems, efforts in data analytics connect high-level file system interfaces (read() and write()) directly to object storage APIs [108,109]. These are designed for and used with analytics frameworks, such as Spark [110] and Map Reduce [111]. Other systems provide metadata services to track created objects for analytics workflows [112, 113]. With weak consistency, object storage cannot list all created objects immediately. There have been other efforts to change object placement within object storage for better performance in analytics [114,115].

We conclude that none of the existing systems can fulfill the desired feature set discussed in Section 3.2. This prompted us to design and develop Agni to overcome these limitations. It is annotated by in Table 3.1.

3.4 Design trade-offs

Object and file-system interfaces have a number of fundamental namespace and data access differences. For instance, in object stores users cannot create directories and subdirectories of objects, and cannot operate on directories as a whole, e.g., rename them. (Though buckets of objects are supported by many object stores they cannot be nested and their number is often limited, e.g., to around 100 per account in AWS S3 [11]). Data access differences include the inability of object storage to perform an in-place partial update of the data, a common operation in many file system workloads.

In light of these disparities, providing dual access is challenging. There are many design considerations that expose trade-offs in the quality and performance dual access. In fact, the goal of dual access is often in direct conflict with that of achieving high performance.

In the following text we refer to an implementation of an abstract file system working on top of an object store as ObjectFS.

3.4.1 File-to-object mapping

A fundamental question in the design of ObjectFS is how to map files to objects.

1⇒1 mapping

1⇒1 mapping represents a whole file with a single object in the object store. This mapping allows simple and intuitive dual access to the data from the user perspective. 1⇒1 mapping can drastically reduce the performance of file writes because a small modification toafile requires a GET and a PUT on the complete object.

1⇒N mapping

1⇒N mapping splits an individual file into multiple objects, each storing a segment ofthe file. The segments can be of a fixed size, as in a traditional block-based file system,orof variable size, as in an extent-based file system. Splitting a file into multiple objects enables faster in-file updates by only writing smaller-sized objects which map to the updated parts of the file. However, accessing the data from the object interface in1⇒N mapping is no longer intuitive and requires additional metadata in object-based user applications.

N⇒1 mapping

N⇒1 mapping packs multiple files into a single object. This can improve performance when a subset of small-sized files tend to be accessed together. Accessing data through the object interface is even more complicated with N⇒1 mapping than with 1⇒N mapping.

Hybrid mapping

Hybrid mapping varies the mapping within the same file system. For example, ObjectFS could create new objects for each incoming write (as extents) and then reassemble them into complete objects in the background. This hybrid mapping trades consistency of object and file system views of the data for performance.

3.4.2 Object naming policy

Although the naming of the objects is tightly coupled to file-to-object mapping, we discuss naming separately to isolate and demonstrate relevant difficulties. For simplicity, we assume

1⇒1 mapping.

File-name file-name policy names an object identically to the corresponding file. Such a policy allows intuitive dual access to the data, but with a substantial caveat. Two files with identical names but in different directories cause a conflict in the flat object namespace. Thus,this policy is applicable only in limited scenarios, e.g., when a read-only file system is deployed on a pre-populated object storage to perform analytics.

File-path file-path policy creates an object named after the file’s complete path. This policy is both convenient for dual access and avoids conflicts in the object storage namespace. However, a rename of a file requires a GET and a PUT, making metadata operations slow. A directory rename requires a GET-PUT sequence for every file in the directory and performance scales down as the total size of all files in a directory grow. Another limitation of file-path policy is its inability to support hardlinks (different files referring to the same data).

Inode-number inode-number policy names the file using the file system inode number as the object name.

The assumption here is that ObjectFS, similar to a traditional UNIX file system, maintains a mapping of file paths to inode numbers. File paths are translated to inode numbers using a lookup procedure. The inode-number policy hinders dual access, because the inode number needs to be looked up. Similarly, files created through the object store will needto reference file system metadata for a name. inode-number performs renames quickly as no objects need to be moved: only the mapping is updated.

User-defined user-defined policies allow the user to drive the naming scheme. In one potential implementation, the inode in ObjectFS records the name of the corresponding object. When a new file is created, ObjectFS executes a user-defined naming policy to derive theobject name. The naming policy can take as an input such file system information as the file name and path, owner, inode number, and more. A corresponding naming policy is required to generate full file paths based on the properties of any objects created directly intheob- ject store. user-defined policies are more flexible than inode-number but need to be carefully designed to be convenient, and avoid naming conflicts.

3.4.3 Metadata

ObjectFS could potentially use several different locations to persistently store its metadata.

In-object in-object placement stores file metadata (e.g., owner, permissions, timestamps) in theobject store itself. One option is to store metadata in the same object as data, but this requires cloud-native applications to deal with metadata during GETs and PUTs, which compromises dual access. Another option maintains separate metadata objects: one per file or one per a group of files. In this case, dual access is not directly hindered but “confusing” metadata objects are visible in the results of a LIST request. Furthermore, object stores typically exhibit high latency, which would metadata operations (e.g., accessing or updating atime or uid). This typically leads to poor overall performance.

In-object-meta in-object-meta relies on the fact that the majority of object storage implementations can store user-defined metadata in association with an object. Access to user-defined metadata is independent of access to the object data, and has comparatively lower latency. The concept is similar to extended attributes in file systems. This approach offers for dual metadata access in addition to dual data access. Object-based applications can request file-system metadata through the object interface. However, it relies of a richer object API that is not generic.

Independent independent stores the file system metadata in a storage solution separate from the object store. A key-value store with high scalability and low access latency is one feasible configuration. In this case, metadata operations like inode lookup or stat() would not require slow accesses to the object storage. A downside is the higher system complexity and the need to maintain additional storage system for metadata. independent precludes accessing metadata through the object interface. Although, ObjectFS could asynchronously write metadata from the metadata store to corresponding objects in one of the other formats.

3.4.4 Data access

The optimal design for an object-based file system depends heavily on how users plan to access the data.

File-system access only

File-system access only implies that all data is ingested, manipulated, and retrieved

56 CHAPTER 3. DUAL ACCESS STORAGE SYSTEMS through a file-system interface. In this case, object storage acts merely as a provider ofraw bytes to a file system. Such usage of an object storage is justified because of itslowcost and high scalability.

Complete dual access

Complete dual access provides an ability to ingest and process the data through both object and file-system interfaces. E.g., a newly created file appears in the object storage as a fresh object and, vice versa, a newly created object manifests itself as a file in the file system. Complete dual access allows users to freely select, switch between, and combine data processing applications.

Partial dual access

Partial dual access is provided by the systems that integrate in their design a limited ability to access the data through both interfaces. In one setup the data may be ingested solely through the object interface (e.g., images from the IOT devices), but consumed using the file-system interface (e.g., POSIX-based pattern recognition software). Furthermore, the file system might display only the objects that existed before the file system creation, but not the ones that were added later.

3.4.5 Caching

Caching plays an integral role in file system performance. For ObjectFS, both read and write caches are important because the underlying object storage has high latencies and operates efficiently only when transferring large objects. We limit our discussion to two fundamental design options:

Local local cache has its independent instances on every node where the file system is mounted.

Each cache instance buffers data read or written by the local node. RAM or a local SSDcan be used for cache space. For local cache, ObjectFS needs to maintain cache consistency between nodes using, e.g., lock tokens [116]. Since object-based accesses do not go through the file system cache, cloud-native applications could see outdated versions of the datauntil caches are synced.

Unified unified cache is a distributed and shared tier between file system clients. Data cached by one client can quickly be fetched from the cache by other clients. Redis [63] and Mem- cached [55] are systems suitable to implement a unified cache. Caching nodes can be collocated with file system mount nodes or deployed in a separate cluster. A unified cache may re-export an object interface so that object-based applications access the same data consistently and realize the benefits of the cache.

3.5 Evaluation

To illustrate the design choices, we developed an ObjectFS prototype using FUSE [117]. Our implementation is simple and modular to facilitate experimentation with various ObjectFS configurations. Figure 3.1 depicts ObjectFS’s high-level architecture. ObjectFS supports all major POSIX operations, and we are able to successfully boot a Linux OS directly from an object store using ObjectFS.

ObjectFS’s user-space daemon is responsible for the main logic of the file system:

ObjectFS KV Cache Object User Application User FUSE daemon library library library CLI

FUSE Kernel VFS Network kernel

KV Object Cache Storage Storage

Figure 3.1: An overview of ObjectFS and the data flow across different components.

Figure 3.2: An overview of the file structure in ObjectFS. The data is stored in object storage and metadata is stored in a key-value store.

59 CHAPTER 3. DUAL ACCESS STORAGE SYSTEMS to perform file lookups, reads, writes, etc. It uses an independent metadata service that is abstracted as a key-value store. Figure 3.2 presents an overview of the file structure in

ObjectFS. For these evaluations, we use Redis, an in-memory key-value store, because of its low latency, distributed design, support of transactions, and ability to persistently store in-memory data. The object library communicates with object storage using a common subset of object operations. Many object stores support multi-part upload and download of large objects. We utilize these feature to improve performance in our design evaluation.

ObjectFS can be configured to run with or without a cache. By default, data is fetchedinto the cache on a file open and is flushed back to the object store on a file close. WeuseRedis to cache data in our design evaluations.

Testbed

Our evaluation quantitatively demonstrates some key design trade-offs presented so far.

We chose four basic workloads—streaming reads, streaming writes, random writes, and renames—and measured performance on various ObjectFS designs. We used Amazon Web

Services (AWS) as a testbed [45]. An ObjectFS client was mounted on a t2.2xlarge compute instance with 8 vCPUs, 32GB of RAM, and AWS moderate network bandwidth. ObjectFS’s metadata server was deployed on the same instance. We used AWS S3 object storage with the default standard class of storage as a backend. The S3 buckets had default settings with object logging enabled and object versioning and transfer acceleration disabled.

Figure 3.3: Streaming read performance on native S3 and ObjectFS. mp represents

MULTI-PART download with the corresponding number of threads used.

3.5.1 Streaming reads

Read experiments demonstrate that ObjectFS tracks the performance of the underlying object store for sequential workloads. We perform sequential reads, in 4 MB record sizes, on files stored as objects in 1⇒1 mapping. We measure the I/O throughput of S3 and ObjectFS when varying the file size from 64MB to 1GB and using S3’s MULTI-PART download with

2, 4, and 8 threads. MULTI-PART downloads divide the object access into parts, parallelized over multiple threads. Figure 3.3 shows that MULTI-PART downloads mitigate the performance overhead of ObjectFS and that ObjectFS realizes a large fraction of S3’s potential bandwidth. The small remaining overhead comes from metadata operations and caching

61 CHAPTER 3. DUAL ACCESS STORAGE SYSTEMS overhead.

3.5.2 Streaming writes

Figure 3.4: Streaming write performance on native S3 and ObjectFS. mp represents

MULTI-PART upload with the corresponding number of threads used.

Write experiments demonstrate that caching is critical to realizing performance in

ObjectFS. The experiment performs sequential writes, in 4 MB record sizes, on files stored as objects in 1⇒1 mapping. We measure the I/O throughput of S3 and ObjectFS when varying the file size from 64MB to 1GB and using S3’s MULTI-PART upload with 2, 4, and

8 threads. ObjectFS implements a write-back cache: a write is stored locally in a Redis memory store and written back to the object store when the object is closed.

Figure 3.4 shows that caching enables reasonable write throughput when compared with native S3 bandwidth, and that multi-part uploads reduce overhead and increase throughput. Multi-part uploads overlap data transfer with metadata operations in multiple

62 CHAPTER 3. DUAL ACCESS STORAGE SYSTEMS threads. Without a write-back cache, each write results in a read-modify-write in the object store, reducing throughput to less than 2MB/sec. With caching, we aggregate writes in the cache and issue many transfers in parallel. Increasing multi-part uploads beyond 8 threads shows no more performance improvement. We theorize that the physical footprint of an object is limited to a few storage servers and that more threads result in smaller messages to the same set of servers.

3.5.3 Random writes

Figure 3.5: Random write performance for different file object mapping designs without caching (left) and with write-back caching (right).

Evaluating the performance of random writes demonstrates the performance trade- offs among the different file to object mappings. We perform writes of 4MBtorandom

63 CHAPTER 3. DUAL ACCESS STORAGE SYSTEMS file offsets aligned to 4 MB using different mapping schemes:1⇒1, 1⇒N (1 MB chunks) and 1⇒N (4 MB chunks). Figure 3.5 (left) shows throughput for write-through workloads without caching. In this scenario, 1⇒1 is much worse than 1⇒N, 1.7 MB/s versus 15 MB/s.

With 1⇒1 mapping, each 4 MB write performs a partial write or read-modify-write against

the underlying object, whereas 1⇒N mappings write entire object(s). More importantly, all

data rates are remarkably low without caching.

With write caching, Figure 3.5 (right), ObjectFS defers individual writes and avoids

read-modify-writes to realize an order-of-magnitude performance improvement. This experi-

ment uses 8 I/O threads, doing multi-part upload for 1⇒1 and parallel transfers to mulitple

objects for 1⇒N. The 1⇒N mappings are slightly slower than 1⇒1 due to overhead for

RESTful calls to more objects. Caching raises the random-write throughput of ObjectFS

close to the sequential performance of S3.

3.5.4 Metadata

We also examine the performance associated with different file naming conventions that

affect dual access. We perform two experiments: The first, Figure 3.6 (left), renamesfiles

of different sizes. With full path naming, a rename results in an S3 server-side copyofthe

object; the performance of rename operations thus scales with the file size, taking 2 seconds

for a 64MB file and 30 seconds for 1GB file. When naming by inode number, rename

is fast (less than 0.005 seconds) and does not depend on file size; in this case renames in

ObjectFS are metadata-only operations. The second experiment, Figure 3.6 (right), renames

directories with varying number of files, each 1MB in size. With full path naming, latency

increases with the number of files. With inode naming, latency is consistent and low.

Figure 3.6: Rename latency for different object naming designs. Experiments look atre- naming a single large file (left) and many 1 MB files (right).

3.6 Discussion

Dual access to data through object and file system APIs is feasible with a judicious choice of design options. Based on our evaluation, we propose a specific design that preserves object APIs and incurs only minor overheads when accessing data through the ObjectFS file system.

• A 1⇒1 file to object mapping allows the object store APIs to access data without

assembling data from multiple objects. It is also the only file to object mapping that

can support intuitive dual access without modifying the object storage.

• Write-back caching is a critical technology for deploying object-based file systems.

Caching aggregates multiple writes in memory, converting many synchronous writes

into fewer larger asynchronous writes. Without caching, object file systems have low

throughput and high latency. This would limit the applications that could adopt

object based file systems to those that perform synchronous writes infrequently.

• However, ObjectFS aggregates writes in memory which has an advantage of low latency

but has other critical limitations. Memory is volatile and any writes aggregated there

could be potentially lost in case of a system failure. Moreover, memory has space

limitations thereby limiting the efficacy of write aggregation itself because data would

have to frequently written out to object storage.

• An indirect naming scheme based on naming objects by file system inode number.

This choice enables the system to perform metadata operations without copying, but

adds complexity to object access, which must resolve the file system name to an inode

number.

66 Chapter 4

Agni’s stand-alone design

ObjectFS was a prototype and had some limitations in terms of design. One fundamental shortcoming was the reliance on aggregating writes in memory to perform efficient partial writes over immutable objects in object storage. We tackle this deficiency by developing a multi-tier write aggregating data structure which temporally indexes partial writes from the file system, persists them to a log and merges them asynchronously. Moreover, we implement this data structure within ObjectFS and term the new system Agni.

In this Chapter, we present an overview of our approach in Section 4.1 and describe the main components of the data structure in Section 4.2 for a stand-alone design of Agni.

Section 4.3 explains how the Agni’s internals work for different I/O operations in a stand- alone design. We discuss some of our design choices in Section 4.4 and we end with an in-depth evaluation of our data structure in Section 4.6.

67 CHAPTER 4. STAND-ALONE AGNI

Applications File Interface Agni Object Memory Cache Flush Interface Object Log Merge Storage Base

Figure 4.1: Design overview of the Agni with different storage tiers and dual data access interfaces.

4.1 Overview

Agni commits all writes to a log that resides in object storage in order to avoid read-modify- write overhead for partial writes. The system maps a single file to a single object, but does so across three distinct storage tiers: (i) cache, (ii) log, and (iii) base. Applications read data from and write data to the top-most tier—the cache. Agni fetches data into the cache on demand. Asynchronously, the flush process writes data out from the cache to the intermediary tier—the log. Periodically, a merge process asynchronously reconciles data from the log with the object in the bottom-most tier—the base. The resting state of the system is when all data is located in the base. At times, parts of file system data can reside in all three tiers, but flush and merge processes eventually return the system to its resting state. We call this approach eventual 1⇒1 mapping, and the associated delay file-to-object visibility .lag Figure 4.1 depicts the three tiers and the relationship between

Logical File A 1 2 3 B 1 2 3 4 C 1 2 File View Cache Memory 1 2 3 Tier 2

Log 1 2 3 2 1 4 1 3 4 Tier Object A1 A2 A3 B1 B2 Storage Base A B C Tier

Figure 4.2: Data layout across storage tiers. File A has data in all three tiers, file B has no data in cache, and file C has no data in log. them and applications. The object and file system interfaces see consistent data when there is no dirty data in the cache or the log.

The log provides a transient staging ground for partial writes. We defer the read- modify-write until the file is merged into the base tier. To ensure a faster and simpler merge,

Agni maintains a separate log for each file.

Updates made in the object interface are reflected to the file system with the help of object notifications 1. This time delay is referred to as object-to-file visibility lag.

4.2 Data structure

The cache provides low-latency access for frequently accessed data. Cache data is held in

1 Most object storage can be configured to notify applications for specific object events [118–120].

69 CHAPTER 4. STAND-ALONE AGNI memory and is volatile. Every file is represented in the cache as a collection of cache blocks, and the cache block size matches the file system block size. Reads from a file result ina range of blocks being fetched into cache from the object store. File data is fetched into the cache from both the base and the log. The location of file data is determined by the fragment map. For example, in Figure 4.2 file A has all three blocks in cache, file B is not

cached but has multiple logged writes, and only the second block of file C is cached.

The log captures cache flushes into persistent object storage. During a cache flush, all

uncommitted blocks of a file are amalgamated into a single object and written tothe file-

log. A file-log consists of multiple log-objects each containing one or more (potentially

non-contiguous) file blocks. The same blocks may be written to the file-log multiple times

across different log-objects. For example, in Figure 4.2, File B consists of two log-objects,

B1 and B2. Blocks #1 and #4 in B2 supersedes the previous write. When log-objects have

been merged, they are deleted (File C in Figure 4.2). Log-objects are created periodically

in an asynchronous manner or when an application calls the sync() system call.

The base is the ultimate location for data. Agni consolidates file data across the different

tiers into a single object, designated the base-object, to maintain a 1⇒1 mapping. Agni implements eventual consistency, so there is a period of time when the cache or the file-log has not yet been merged with the base-object. During this period, data accessed by an object interface is outdated with respect to the file interface.

All or part of the data in the base-object might be superseded by data in the cache and file-log. In Figure 4.2 the base-object of file A has been superseded in its entirety. File

B is partially updated. The updated blocks #1, #3 and #4 for file B are now located in log-objects B1, and B2. Block #2 in the base-object of file B is still relevant. Agni

also provides applications an interface to force a synchronous merge, termed coalesce(). On

completion, updates from the file interface are reflected in the object interface.

Master index

Agni stores metadata in a generic key-value format outside of object storage, not within

object attributes. Key-value stores provide low-latency access, scalability, and persistence

of data [63, 121, 122]. Agni is key-value store agnostic and can be deployed on key-value

databases or NoSQL systems. Using key-value stores, metadata operations such as inode

lookup or stat() do not require slow accesses to object storage.

Agni stores all keys of all types in a single namespace called the master index

(Figure 4.3). Each file has an inode key, indexed by inode number, that provides single-

access lookup for inode metadata such as access time, modification time, group id, and user

id. A lookup key combines the parent directory inode number and a file name and is used to

look up inode numbers by name. The children key points to a list of files in a directory and

exists only when the inode is a directory. Figure 4.3 shows a directory /animals/ with inode

5, a lookup key 1 → animals (1 is ithe root directory’s inode number), and a children key

with a value < 10, dog >. This indicates that this directory contains a file /animals/dog

with inode 10 which is represented with inode 10, lookup key 5 → dog, and no children key because it is not a directory.

Block pointers

Agni maintains a collection of block pointers, one pointer for each block of a file, and each block pointer references a fragment map for a specific block. Both block pointers and fragments maps are stored in the same namespace as file metadata, the master index. The key for each block pointer is built from the inode and block number of the file, for quick key construction and search. In Figure 4.3, 10#3 refers to the fragment map of block #3 for file with inode number 10.

Fragment map

The fragment map describes the location and versions of a block across the storage tiers.

Master index Fragment map

Cache Key Value entry 5 inode metadata Directory Log entry 1→animals 5 metadata 5→Children [<10, dog>] Base entry 10 inode metadata File Auxiliary data structure metadata 5→dog 10 <10, #1> … <10, #2> … Dirty Set #2, #3 Block <10, #3> Fragment map Clean Set #1 pointers 2 … … Merge T _#3#4@base10, List T1_#2#3@base10 <10, #N> …

Figure 4.3: Master index summary for a single directory /animals/ with inode 5, and a file

/animals/dog with inode number 10. The index contains directory metadata, file metadata, block pointers, a fragment map per block, and three auxiliary data structures—DirtySet,

CleanSet, and MergeQueue.

It is a time-ordered set of values for each block that consists of (i) at most one cache entry,

(ii) zero to multiple log entries, and (iii) at least one base entry. The format of each entry varies depending upon which tier (cache, log, or base) it records. Each entry has a timestamp that orders it by creation and modification time. Entries are updated to reflect the current location of a block as it transitions from one tier to another. Figure 4.3 depicts the fragment map for block #3 of inode 10 (key < 10, #3 > in the master index). Below we describe the format of each entry type:

Cache entry < TUpdate, DirtyF lag >

TUpdate = Time when this cache block was updated

DirtyF lag = Dirty data indicator

A cache entry records the existence of a block in cache and its dirty state—whether it has been written. Entries in the cache can be located by inode and block number, which are implicit in the key.

Log entry < TFlush, LogObjectName >

TFlush = Time when the log-object flush was initiated

LogObjectName = TFlush_BlockNumberList@BaseName Agni names every log-object in the following way:

TFlush_BlockNumberList@BaseName where:

TFlush = Time when the log-object was flushed

BlockNumberList = List of block numbers in the log-object

BaseName = The name of the base-object associated with this file-log Each fragment map can have multiple log-entries. In Figure 4.3, T 2_#2#3@base10 and T 1_#3#4@base10 represent two log-objects that update block #3. T 2_#2#3@base10

indicates that flush was initiated at time T 2, consists of blocks #2 and #3, and belongs to

inode 10 in object “base10”. The name “base10” is illustrative in this example, as base-objects

can have any user-specified name in Agni. Log-object names are self-describing so that frag-

ment maps can be located during recovery. The order of the blocks in the BlockNumberList

indexes the offset of that block in the log object. For example, in T 1_#3#4@base10 the

blocks are ordered as #3, #4.

Base entry < TMerge, BaseName >

TMerge = Time when the file-log was merged with the base-object

BaseName = Name of base-object Each file is eventually stored as a single base-object with a unique name. By convention, we use either the inode number or the entire file path. Using file paths for object names causes poor performance for file rename, because the object must be re-uploaded under a different name. Some object stores support server-side copy, which avoids network

I/O. For pre-populated object storage, Agni uses the existing object names to make data available without copying.

Auxiliary data structures

Agni maintains three auxiliary data structures for each file to assist with flush, merge and

74 CHAPTER 4. STAND-ALONE AGNI cache management. (i) DirtySet stores dirty blocks which have not yet been persisted to

log. (ii) CleanSet stores clean blocks that have either been fetched into cache from the log

or base tier, or persisted to the log. During cache eviction clean blocks can be discarded.

(iii) MergeList is a list of unmerged log-objects that ensures the correct sequence of updates

are applied to the base-object. By using this list, we can merge without querying the

fragment map. The merge list can be reconstructed from the file-log during recovery.

4.3 Processing logic

Here we describe how Agni performs read, write, flush, and merge, and manages the cache.

We also detail how we notify the file system of changes from the object interface.

Block #1 T4, #1@clean Block #1 A 1 2 3 File A 1 2 3 0 T0, A Write Read T , A Cache Cache Block #2 2 3 1 2 3 Flush Tier blocks Block #2 T5, #2@dirty 3 A2 Log File T , #2@clean 1 5 5 1 Load A 2 3 2 Merge T , T _#2@A A 2 3 1 1 Tier object T , T _#2#3@A T1, T1_#2#3@A Base T0, A Base A 1 2 3 1 2 3 A T0, A Tier A 1 2 3 object Block #3 Block #3 2 Dirty Set T2, #3@clean Dirty Set Write #2 Flush T , #3@clean Clean Set #2, #3 T1, T1_#2#3@A Clean Set #1, #3 T1, T1_#2#3@A Merge List T1_#2#3@A T0, A Merge List T5_#2@A, T1_#2#3@A T0, A

(a) Initial state where blocks #2 and #3 of (b) Intermediate state where block #1 is file A are clean in cache and log-object A1 is loaded into cache, #2 updated and flushed located in file-log. as A2 and finally merged with base-object A.

Block #1 File A 1 2 3 T5, #1@clean Cache T5, A 1 2 3 Tier Block #2 Log T5, #2@clean Tier T5, A Base A 1 2 3 Tier Block #3 5 Dirty Set T , #3@clean 5 Clean Set #1, #2, #3 T , A Merge List

(c) Final state where log-objects A1 and A2 are merged with base-object A and assumes the resting state again.

Figure 4.4: Processing logic on file A. The tables represent the fragment maps and auxiliary structures.

On a read request, Agni determines the block(s) to access based on offset and utilizes the block pointer list to fetch the corresponding fragment map. For each block, if the most recent entry in the fragment map is a cache entry then data is read from cache and returned to the user. If the data is not in cache, it is loaded from the log and base tiers. In Figure

4.4a, a read of block #3 in file A returns block #3 from cache and a read called to block

#1 results in an object load.

The VFS lookup() and create() calls do not access the fragment map. lookup()

requires two metadata accesses: Agni first looks up the file inode number using the lookup

key, then fetches metadata using the inode number. During create() three new keys are

inserted in the master index: an inode key, lookup key and children key (for a directory).

Load downloads blocks from the object store into cache on a cache read miss. Agni uses

the fragment map to locate a block in the file-log or base-object. In either case, load uses

a ranged GET to read the block from object storage and populate cache. A cache entry is

created if needed, TUpdate is set and DirtyF lag turned off. After index updates, data is

written to the cache and the block identifier is inserted into CleanSet. In Figure 4.4a, a

read called on block #1 executes a load from A’s base-object at T0. Figure 4.4b shows the

load, marks the cache entry clean, and updates the fragment map and CleanSet.

On write(), Agni determines the block being written based on the offset, and uses the block pointer to fetch the fragment map. (i) If no cache entry is present in the fragment map, a cache entry is created with TUpdate of current time and is marked as dirty. The block identifier is inserted in DirtySet. All write I/O presented to Agni is block-aligned; small writes are converted to blocks in the virtual file system that does read-modify-write on blocks. (ii) If a cache entry exists but is marked clean, this indicates a clean cache block. The entry is marked dirty and TUpdate is updated. The block identifier is moved from CleanSet to DirtySet. (iii) If a cache entry exists and is marked dirty, TUpdate is updated and no further action is required.

In Figure 4.4a, a write to block #2 of file A triggers case (iii) and T3 is updated to

T5 as shown in Figure 4.4b. There is no communication with object storage during write.

Flushing the cache

Flush persists dirty blocks by uploading them from the cache to the log. Flush is triggered:

(i) periodically on an opened file, (ii) when a file is closed by all applications, (iii) under memory pressure, (iv) when an application calls sync(), or (v) on coalesce(). Cases (I)

and (II) occur asynchronously and depend on user workload.

Flush identifies the dirty blocks for a file using the DirtySet and uses MULTI-PART

UPLOAD or PUT (depending on the size of the data) to upload the dirty blocks into a log-

object. After the file-log is uploaded, the log-object name is inserted into the MergeList

for eventual merging, and a log entry is added to the fragment maps of the flushed dirty

blocks with time TFlush. TFlush is set to the maximum of TUpdate among all flushed cache

78 CHAPTER 4. STAND-ALONE AGNI blocks. This ensures that subsequent updates to the cache are preserved, including updates that are concurrent with the flush process. Agni retains recently flushed blocks incache, anticipating their reuse, except when reclaiming memory space.

In Figure 4.4b, Agni flushes the only dirty block of file A—block #2—to log-object

A2 named T 5_#2. The log-object name is inserted into MergeList, log entry T 5_#2@A is then inserted into the fragment map for block #2, and the identifier is moved from DirtySet to the CleanSet.

Merging the base and log

Merge integrates log-objects into the base-object, making the 1⇒1 mapping eventually consistent. Merge happens: (i) periodically, (ii) after the number of log-objects exceeds a user configured threshold, or (iii) when the user calls coalesce() on a file.

Merge identifies the log-objects in the file-log using the MergeList. A ranged GET reads the latest blocks from a log-object and the base-object, then uses MULTI-PART UPLOAD to create a new base-object. This can be carried out by multiple threads on the same object.

The order of the blocks in the file-log is determined by TFlush in the log-object name. For blocks that have been overwritten multiple times, only the latest version is uploaded. Each file merge consumes network traffic equal to twice the file size as every blockhastobe downloaded once and uploaded once.

On completion, Agni updates the fragment map. Agni defines TMerge as the maximum of all TFlush of the merged object-logs. It removes fragment entries for blocks in which

TUpdate or TFlush is less than TMerge. The remaining fragment entries were updated after the merge process started. The timestamp of each base entry in each fragment map is set

to TMerge. As a last step, Agni removes the log-objects from the MergeList and deletes the log-objects.

In Figure 4.4b, merge begins after A2 has been flushed to log. The merge identifies logs A1 and A2 for file A. It reads block #1 from the base-object A, block #3 from A1, and

ignores block #2 in A1 which was overwritten in A2. Two worker threads GET the blocks

from the respective log-objects and MULTI-PART UPLOAD them to a new base-object A. In

5 Figure 4.4c, after the merge, base-object entries are updated to T and log entries for A1

and A2 are removed from fragment maps. Log-object names are removed from MergeList

and the file-log is deleted. The system ends up in the resting state with no dirty datainthe

log or cache (Figure 4.4c).

Cache and Log management

The Agni management daemon prevents memory overflow and limits space amplification

in the log. If cache usage exceeds a user-defined limit the daemon evicts clean blocks from

cache using the CleanSet, and updates the fragment maps. If this is insufficient, it then

flushes dirty blocks using the DirtySet and evicts them. In Figure 4.4b, blocks #1 and #3

would be evicted first followed by a flush and evict ofblock #2. The daemon also monitors

the log, garbage collecting log-objects that have no active blocks in the MergeList. In

Figure 4.2, B1 would be reclaimed because of B2.

Object notification

The Agni notification processor ensures that object store operations are reflected inthefile

80 CHAPTER 4. STAND-ALONE AGNI system. These include: (i) creating a new object, (ii) deleting an existing object, and

(iii) updating an existing object. On create, Agni inserts the metadata for the object into the master index, creates the block pointers based on the object size, and initializes the fragment map with the base entry. The file metadata is derived from the object name and any user-defined object metadata. For example, if an object with name /images/world.png is created with a PUT, then a file called world.png is created in directory /images/. For

delete, Agni identifies the inode number using the file path, file name, and the lookupkey,

deletes the metadata for the file from the master index, uses the fragment map to identify

referenced log-objects, deletes the block pointers along with the fragment maps, and finally

deletes the file-log and the base-object. It also signals an asynchronous worker toremove

cache blocks for that file.

We distinguish between create and update based on the presence of file metadata.

The object creation time (TCreate) orders the PUT data with respect to previous and subse-

quent writes. During an update, Agni iterates over the block pointer list and checks entries

in the fragment map for TUpdate and TFlush less than TCreate. Any such values are re-

moved, and corresponding logs and cache blocks deleted. Data written after TCreate is not

overwritten.

4.4 Discussion

In this Section, we discuss the reasoning behind some of Agni’s key design decisions and

their implications.

Per-file log

We choose to maintain a log per file to make merge fast and scalable. Logging data from multiple files into a single log-object makes the eventual merge more complicated and leadsto space and write amplification as data from multiple files are intertwined, i.e. either multiple files would need to be merged together or unmerged objects would have to be rewritten as new logs. With a log per file, each file merge is independent and trivially parallelizable, and storage can be reclaimed on a per-file basis. Our approach does have disadvantages. For example, it can create many smaller log-objects because multiple small writes to different files are not aggregated into a larger log update to a single object.

Consistency model

Agni implements relaxed consistency guarantees for higher performance and reduced time to reflect updates between the object and file system interfaces. This is in the spirit ofobject storage’s weak consistency. At times an object cannot be read after a create, or can still be read after a delete, unlike POSIX. Agni does not implement locking between the two interfaces, and file system and object operations race and produce inconsistent dataand metadata results, including missing files, missing data, and data availability after delete.

Failure recovery

On client failures, Agni requires a fsck-like process to recover the file system. Transient I/O and network failures, such a failed flush or merge, are not fatal and can be retried without recovery. Agni uses a key-value store to persists the metadata in the master index. After failure, all volatile cache blocks are assumed lost, but the fragment map may still contain

82 CHAPTER 4. STAND-ALONE AGNI cache entries. The fsck-like recovery deletes cache entries from the fragment map.

Agni does not use or rely on key-value transactions. This means that, e.g., failure between insertion of an inode key and lookup key leaves an unreachable inode key. There are also cases, when a metadata is successfully updated but the object creation fails, resulting in a dangling file pointer. These cases are also handled by fsck.

Agni can recover from flush and merge failures even while running. A flush process is only marked completed after the DirtySet is updated. If a flush fails prior to this update, its actions can be repeated with no change in the eventual state of data. Similarly, if a merge process fails prior to the removal of log-objects from the DirtyQueue, it can be repeated

without change in the eventual state of data.

Storage and network overheads

The eventual merge approach adopted by Agni’s eventual merge approach requires at most

an additional size which is equivalent to the sum of all the partial writes. The additional

data transmitted, compared to read-modify-write approach, is the sum of all the partial

writes, since in both approaches there is one read-modify-write. Space amplification for

n ∑ Agni can be calculated by this equation: O(SizeF ile + SizeP artialW rite) The merge is i=1 always O(Read(SizeF ile) + W rite(SizeF ile)) irrespective of n or SizeP artialW rite. The user perceived write is O(W rite(SizeP artialW rite)).

Metadata

Agni was designed to support large directories. It requires a single master index lookup, with the children key, to access all files in a given directory. Adding a file to a directory

83 CHAPTER 4. STAND-ALONE AGNI requires an update to the children key and to the lookup key. Similarly, removing a file from a directory involves an update to the children key removal of that file’s metadata.

To estimate the memory footprint of Agni’s master index, we make an assumption that a key and value are 512 B each. In case of fragment maps, we assume that each entry is 512 B. For a single file, the basic metadata stored is 2 KB (3 KB for a directory) and the auxiliary data structures are an additional 3 KB. The block pointers and the associated fragments maps size are dependent on block sizes and file sizes. Each fragment may vary from a minimum of 512 B to 5 KB, assuming 10 entries on average. Assuming a block size of 4 MB (default in Agni) and a file size of 1280 MB (file size used in our evaluation), the index master for a file could range from 325 KB to 1765 KB.

Temporal Indexing

We use system time to index our order of updates enabling us to order them correctly irrespective of the interface they originate from. However, this has limitations and works best for a single machine. If we wish to order these updates across multiple machines, then we are going to encounter issues of clock drift. Our solution works for the set of discussed use-cases because we do not anticipate concurrent updates to the same data from across multiple machines. Typically, data is accessed one interface at a time from a single machine.

4.5 Implementation

We developed Agni using FUSE [117], as did all of the major object storage file systems [29–31,87–92,99,101]. Our implementation is simple and modular to facilitate various

ObjectFS Object KV Cache User Application User FUSE library library library CLI daemon

VFS FUSE kernel Kernel KV Cache Storage

Object Object Storage Notification Processor

Figure 4.5: The different modules of Agni and the data flow among them. configurations. Figure 4.5 shows the software components of Agni and their relationship.

Our system supports multiple file systems concurrently, one per bucket. The user-space daemon implements the main logic of the file system: lookups, reads, writes, flushes, and merges. For this thesis, we chose Redis [63] for an in-memory key-value store, because of its low latency, distributed design, support for transactions, and ability to persist in-memory data. Agni supports both Redis and the Linux in-memory file system (tmpfs) [123] as caches.

Our experiments use tmpfs.

Agni currently runs on AWS S3 [11], OpenStack Swift [96], and Google Cloud

Storage [13]. We experiment on AWS. It can potentially support multiple others via

S3Proxy [124] (See Note b attached to Table 3.1). The object notification processor uses the serverless compute service [125]. In our implementation, the notification processor runs on a different host than the file system and accesses Agni’s key-value store over the network;it could have been deployed on the same host. Our implementation is open-source, consists of about 4,000 lines of Python code, and is available for collaborative development and reuse

85 CHAPTER 4. STAND-ALONE AGNI on GitHub (https://github.com/objectfs/objectfs).

4.6 Evaluation

We evaluate the performance of Agni on I/O benchmarks and application workloads, com-

paring it to S3FS as a file system and the S3 object store to compare with raw readand

write throughput.

Of the 14 other file systems in Table 3.1, only 7 completely support dualac-

cess. Among those, only four support partial writes—S3FS, Blobfuse, Gcsfuse, YAS3FS.

We choose to compare Agni to S3FS because of its popularity and maturity [29]. It uses a

1⇒1 mapping and complete file paths as object names for dual access. S3FS stores allfile

system metadata in local memory and caches data locally on tmpfs.

We configure S3FS and Agni similarly. We use 4 worker threads for MULTI-PART

operations and set the PART size 16 MB. On read() or write(), S3FS fetches into the cache

data equal to the number of workers times the PART size—64 MB in our experiments.

FUSE only allows a maximum of 128 KB I/O requests irrespective of the block

size of the file system [117]. For the partial write benchmarks, we mount both file systems

in two different configurations; the smallest request size, 4 KB and the largest request size,

128 KB.

4.6.1 Testbed

We deploy all systems on Amazon Web Services (AWS) for our benchmarks. Agni and S3FS

are mounted on an m5.4xlarge instance with 16 vCPUs, 64 GB of RAM, and AWS high

86 CHAPTER 4. STAND-ALONE AGNI network bandwidth. For both, we use a 36 GB in-memory file system as a cache For Agni, we allocate 4 GB to the Redis server. The cache and Redis are on the same node as the file system and Redis persists metadata to local disk periodically every 60 seconds. AWS S3 is configured with the default standard class of storage. We use FIO [126] to perform allfile system benchmarks, except of the rsync experiment.

4.6.2 Partial writes

We designed Agni to be efficient for workloads that contain partial writes. We first evaluate

Agni’s performance for this workload against S3FS. We configured FIO [126] so that it performs random writes of 4 MB I/O size to 25 files of 1280 MB size each. We choose afile size of 1280 MB so that there are at least two 16 MB PARTs per thread. We vary the percent of partial writes as a fraction of the file size and evaluate the effect on an application’s write throughput.

Figure 4.6a shows that Agni minimizes write I/O by logging and achieves a high fraction (70%–78%) of the maximum throughput. (The S3 data shows the maximum throughput when transferring objects in parallel without a file system.) Agni performs

2 to 7 times faster than S3FS, varying with the percent of data updated. S3FS performs a read-modify-write and uploads complete files on every I/O. In contrast, Agni only uploads data equivalent to the size of the update. Figure 4.6b shows the received and transmitted network traffic. S3FS always performs network I/O to the whole file. Agni writes tothelog in proportion to the changed data. Although Agni flushes data to the log from the cache asynchronously, for this benchmark, we mimic S3FS’s synchronous operation and flush data on every write in order to provide a fair comparison.

(a) Partial write throughput.

(b) Network data received and transmitted.

(c) Breakup of time.

Figure 4.6: Partial write performance as a function of the percent of data written. S3 is the upload bandwidth to S3.

4.6.3 Indexing and caching overhead

Maintaining metadata, cache and data indexes adds a small amount of overhead to writes.

Figure 4.6c breaks down the fraction of time for each major action during I/O for a single file in the partial writes experiment. Indexing incurs a 3.6% overhead for a 1280 MBfileand this result scales for larger files (not shown). The percentage overhead is stable for allI/O large enough to saturate the network (not shown). Cache management consumes another

12% of the write time.

To expose maximum Agni’s indexing and caching overhead, we also conduct an experiment in which the file remains fully cached (Figure 4.7). In total, this reduces the throughput from application to memory to 480 MB/sec from the raw throughput of 620

Figure 4.7: Partial write throughput to cache as a function of the percent of data written.

MB/sec from FUSE to tmpfs realized by S3FS. We expect that much of this overhead comes from inefficiencies in the Python implementation of Agni and an optimized C++ implementation (like S3FS) would close much of this gap.

4.6.4 Random read performance

Agni’s read throughput of 217 MB/sec is about the same as that of S3FS at 202 MB/s. This benchmark performs random reads of 4 MB on S3, S3FS and Agni. Figure 4.8 shows the I/O throughput when reading increasing percentages of files of size 1280 MB. For Agni, 100% of the data for each file is evenly spread across the file-log in 10 log-objects. The dataforS3 and S3FS are located in a single object. There is a small degradation in performance from

S3 to Agni. This degradation occurs from Agni’s additional metadata lookup and loading

Figure 4.8: Read throughput performance as a function of percent of file data read. S3 is the download bandwidth from S3. of data into cache. The fragmentation of data across different log-objects has no negative effect on I/O performance. The accesses to different log-objects use parallel I/O resources in multiple threads.

4.6.5 Sequential write performance

Agni realizes a sequential write performance of 78 MB/s when writing 25 files of 1280 MB each in their entirety. This is about 6 MB/s slower than the performance of 84 MB/s realized by S3FS, shown in Figure 4.9. This reduction comes from staging data to logs before sending to the base.

Figure 4.9: Sequential write throughput performance for a single large file.

4.6.6 Scaling up merge performance

Agni scales merge linearly to many threads and many files. Scaling is limited only by the available network bandwidth and number of cores. The merge rate is important from the user’s perspective because it determines time delay before updates are accessible through the object interface. For this benchmark, the file system is pre-populated with 64 files of

1.6 GB each. The worst case scenario, in which 100% of the data for a file is fragmented in the file-log, is evaluated. We look at two configurations: (i) “scaleup” in which each thread merges a separate file, and (ii) “speedup” in which multiple threads merge the same file.

In Figure 4.10a, Agni merges a 1.6 GB file every 4 seconds with 32 threads in the scaleup configuration, reaching a throughput of ≈800 MB/s. The file merge rate scales

(a) Throughput and file merge rate with varying the number of log-objects and worker threads.

(b) CDF of file-to-object visibility lag for a file with 100 log-objects and a varying number ofworker threads.

Figure 4.10: Scaleup merge performance in Agni for 64 files of 1.6 GB. Each thread merges a distinct file where 100% of the data is located in the file-log.

(a) Throughput and file merge rate with varying the number of log-objects and worker threads.

(b) CDF of file-to-object visibility lag for a file with 100 log-objects with varying number ofworker threads.

Figure 4.11: Speedup merge performance for 64 files of 1.6 GB. All worker threads shared a single file. 100% of the data is located in the file-log.

94 CHAPTER 4. STAND-ALONE AGNI ideally until 16 threads, it doubles as the number of threads doubles. From 16 to 32, limited network bandwidth reduces gains. The merge rate is independent of the number of log- objects.

In Figure 4.11a, we parallelize the upload of a single file. Agni merges a 1.6 GB file every 7 seconds with 32 threads. The merge rate scales ideally to 8 threads. Itdoes not double beyond 8 threads and peaks at ≈450 MB/s for 32 threads. We suspect network limitations. AWS S3 is more efficient writing to multiple objects than a single object. During each merge Agni performs a total ≈206 GB of network I/O independent of the percentage

of fragmented data, number of threads, or number of log-objects, because the system needs

to read and then write the full contents of each file.

These experiments demonstrate different usage scenarios. The scaleup configura-

tion maximizes throughput across multiple files for throughput-oriented workloads. The

speedup configuration minimizes the time to merge a single file. This will also minimize the

consistency lag for file updates to be reflected in the object interface.

To characterize the file-to-object visibility lag, we present data from the previous

experiments that show when merge completions happen. Figures 4.10b and 4.11b present a

cumulative distribution of merge completion times for scaleup and speedup configurations.

Merging files one at a time in parallel makes updates visible more quickly. Agni canmerge

80% of the 1.6 GB files within 12 seconds using 32 threads. Having each thread workon

its own file defers the completion of any file. It takes 120 seconds for 80% ofupdatesto

be visible. The long tail for merge times in Figures 4.10b and 4.11b arises from variance in

object store performance. Variance is also visible in Figures 4.10a and 4.11a as instability

95 CHAPTER 4. STAND-ALONE AGNI in the merge rate.

4.6.7 Lambda latency

Figure 4.12: CDF of object-to-file visibility lag for different object notifications.

We turn to update latency from the object interface to the file system. This experiment looks at the delay between the completion of an object write and object notification in Agni. For the benchmark, we create, delete and update 1,000 objects of size 1,280 MB each using the object interface. For update and delete, we make updates to existing files that are 100% fragmented across log-objects—10 log-objects per file. This is the worst case.

Figure 4.12 shows the visibility delay for the three object interface operations in Agni. Most created objects have a visibility delay under 60ms. A majority of deleted objects have a delay under 110ms. The additional delay for delete comes from the time to delete the file-log

96 CHAPTER 4. STAND-ALONE AGNI and the base-object. The update function has a longer delay of about 400ms (for all objects) owing to a large number of fragment map updates.

4.6.8 rsync performance

Figure 4.13: rsync run time over DropBox files.

To evaluate the performance of Agni on applications, we take performance mea- surements on an rsync workload using traces collected from cloud storage services, such as

DropBox and Google Drive [86, 127]. The concept is that rsync on a file system is used to rectify changes to the storage system, taking advantage of partial write performance. The major consumers of the data through the object interface are Web services and Web pages.

Agni and S3FS are both populated with 10,000 randomly selected files (from the

DropBox dataset) whose size range from 120 KB to 903 MB. The total dataset is ≈10 GB.

From 1 to 10 KB of data in each file is randomly updated in a second copy of these files stored

on a local file system. We then run rsync to update the files on object storage throughthe

file system interfaces, based on the changes to the local folder. This workload was described

as important to DropBox by Xiao [86]. Figure 4.13 shows that Agni performs about 3 times

97 CHAPTER 4. STAND-ALONE AGNI faster than S3FS, because it avoids read-modify-write. Performance gains arise on the larger files. Smaller files read less data and files less than one block readnodata.

4.6.9 ffmpeg and bowtie performance

Figure 4.14: ffmpeg run time over media files.

Figure 4.15: Bowtie run time over Gene sequencing files.

We validate the performance of Agni for two different applications by comparing it to S3FS, a popular OSFS, and to a script which imitates manually copying data. We prioritized our comparison to these approaches because they satisfy two principal criteria: dual access and being backend-agnostic.

We simulate the media use case by running ffmpeg [128], a transcoding application, to transcode 14 MPEG files (total 32 GB) to MOV files. Media files were downloaded from

Internet Archive under the Sci-Fi movies section [129]. Gene sequence alignment is simulated using bowtie2 [73], a gene sequence aligner, to align a single large genome file of 8 GB. A human genome dataset [130] was downloaded from the Sequence Read Archive hosted at

National Center for Biotechnology Information [131]. The manual data transfer workload attempts to mimic the case for both file systems paired with object storage and amanual user performing the actions. Figures 4.14 and 4.15 depicts the results. Agni performs better because it overlaps I/O and computation. It can upload data to object storage as parts of it are under processing. Other systems have to wait for the entire file to be processed before their upload can begin. The merge time denotes the file-to-object visibility lag after the data is uploaded. Agni performs 40%-60% faster in case of ffmpeg and 20%-40% faster in case of bowtie2 when compared to both of the existing approaches.

99 Chapter 5

Agni’s distributed design

In the previous Chapter, we looked in great detail at the stand-alone design of

Agni and evaluated in depth the different aspects of our data structure. In this Chapter, we modify our multi-tier data structure for a distributed deployment of Agni. Further, we use a combination of object events, and scalable cloud based publish-subscribe messaging service to ensure a weak cache coherency across all clients. We briefly describe the overview of our approach in Section 5.1, and illustrate the modifications to the components of the data structure in Section 5.2. Finally, we present the internal working for different I/O operations in Section 5.3 and end with a brief evaluation of the system in Section 5.4. In the distributed mode, the cache tier is local to each client of Agni while the base and the log tiers are shared across all clients.

100 CHAPTER 5. DISTRIBUTED AGNI

Application Application Object File interface interface Agni Pub-sub messages Memory Cache Flush

Log Merge Object storage Base

Figure 5.1: Design overview of Agni in distributed mode with different storage tiers and dual data access interfaces.

5.1 Overview

The data structure remains more or less the same in distributed mode, with three distinct storage tiers: (i) cache, (ii) log, and (iii) base. However, there are minor modifications to accommodate multiple clients. Applications read data from and write data to the top- most tier—the cache. The cache is local to each client and any updates in the cache are only accessible to applications running on that client. Asynchronously, the flush process writes data out from each cache to the intermediary tier—the log. The log is shared among all clients and is the top-most tier where updates to existing files are visible to all clients.

Whenever a new object is flushed out to the log, an object notification is generated and all the clients are notified through the publish-subscribe messaging service. Periodically, a merge process asynchronously reconciles data from the log with the object in the bottom- most tier—the base. Another object notification is generated whenever a object is merged

Client 1 Client 2

Logical File File File View 1 2 3 1 2 3 Cache Cache 1 2 2 3 Tier Tier

Log 1 2 2 3 1 Tier A1 A2 A3 Base A Tier

Figure 5.2: Data layout across different storage tiers and clients. File A has data in all three tiers but not all cache blocks are populated across all clients. Client 1 has blocks #1 and

#2 while Client 2 has blocks #3 and #3. in the base. This is also transmitted to all the clients through the same publish-subscribe messaging service. Figure 4.1 depicts two client, the three tiers and the relationship between them and applications. Applications see consistent data when there is no dirty data in another client’s cache

5.2 Modifications in the data structure

Data structures are broadly unmodified in the distributed mode. However, we do modify some aspects of the design to ensure that they can operate in a distributed environment.

Most of the optimizations are designed to prevent frequent high latency requests to the metadata server. In this section, we only illustrate these differences and avoid unnecessary repetition in content.

The cache in the distributed mode is similar to the stand-alone mode in terms of role, organization and functionality. However, there are some differences in terms of cache coherency.

In Agni, cache is local to each client and there is a weak access consistency across the different clients. Any updates to the local cache of a specific client are not visible to others until the data is flushed out to the log. For example, in Figure 5.2file A has blocks #1,

#2 in cache on Client 1, and has blocks #2, #3 for the same file in cache on Client 2.Any updates to block #2 on Client 1 would not be visible to Client 2.

Log and Base

The log and base in the distributed mode are common across all the clients. Log acts as the first tier where any changes to file is visible to all the clients. Both tiers are identicalin organization, role and functionality to their counterparts in the stand-alone version with one exception. In the distributed mode, flushes to the log and base generate object notifications which are processed by the notification processor. The notification processor uses a publish- subscribe mechanism to broadcast these updates to all the clients. For example, in Figure

5.2, updates to blocks #1, #2 for file A would be propagated from client 1 to client 2 when log-object A1 is flushed to the log. When log-objects A1, B2, and A3 are merged and base-object A is created another notification is created for clients 1 and 2. This notification updates the fragment maps first and finally the log-objects are deleted.

Master index and Block pointers

In the distributed mode, Agni continues to store the metadata in a distinct metadata server.

The inode key, lookup key and the children key are stored in the master index as before.

However, for certain metadata operations such as update of file size and block pointers, a copy of the inode is cached locally to the client for the duration the file is open. The inode metadata is read from metadata server on fopen(). Any updates to inode metadata are reflect locally and only written back to the metadata server upon fclose(). This helps in providing good performance by avoiding network latency for frequently updated metadata.

We choose to locally cache inode size and block pointers for now since we expect them to be frequently updated. However, updates to other inode metadata such as uid and gid are reflected immediately because we expect them to be less frequent. It is possible that there might be some metadata inconsistencies which might arise when the same file is updated across multiple clients.

Fragment map and Auxiliary data structures

The fragment map continues to be similarly organized and the entries retain the same format.

The same auxiliary data structures are also maintained per inode. We modify the location where these data structures are stored to optimize for the distributed mode of operation. If a file is open on a client, then a copy of the block pointers for that file are stored locally. The fragment maps associated with these block pointers are the same as before with a cache, log and base entry. However, the fragment maps stored on the metadata server do not have a cache entry associated with them and only store entries for log and base. The reason for this is that the cache tier is only accessible to a single client and propagating the cache entry to the metadata server serves little purpose. We extend this optimization further and for DirtySet and CleanSet as well. These are also stored locally on each client and not

104 CHAPTER 5. DISTRIBUTED AGNI propagated back to the metadata server.

We do maintain the log and the base entries, in both locations, clients and the metadata server. We could possibly store them only in the metadata server because there is only one log and one base irrespective of the number of clients. But storing the log and base entries locally ensures that reads to uncached data are relatively efficient. Similarly, the MergeList is not store locally on the client, rather it is only stored in the metadata server. We choose to apply this optimization for this because MergeList is only utilized during the merge which may not occur from that specific client. If a specific client does call coalesce, then we we fetch the MergeList from the metadata server.

5.3 Processing logic

Here we describe how Agni performs read, write, flush, and merge, and manages the cache in a distributed mode. We also detail how we notify the different clients of changes from the object interface.

On a file open request, (i) Agni starts to keep track of any messages from the publish-service messaging service for that file. (ii) Next, it fetches the inode metadata of the file from the metadata server. (iii) Further, it fetches the block pointers and associated fragment maps.

(iv) Lastly, it initializes a local CleanSet and DirtySet. This ensures the Agni on that client has a complete view of the data for that file across the different tiers. Any updates received for that file in the middle of this process are buffered and applied in retrospect.

Client 2 Block #1 Block #1 T0, A A File A 1 2 3 1 2 3 File T0, A Block #2 Cache Cache Block #2 2 3 2 3 T2, #2@clean Tier Tier T2, #2@clean T1,T 1_#2_#3@A Log 1 A 2 3 1 1 Tier T ,T _#2_#3@A T0, A T0, A Base A Block #3 Tier 1 2 3 Block #3 T2, #3@clean Block #1 Block #2 Block #3 T2, #3@clean T1,T 1_#2_#3@A T0, A T1,T 1_#2_#3@A T1,T 1_#2_#3@A T1,T 1_#2_#3@A T0, A T0, A T0, A T0, A

Merge List Dirty Metadata Dirty server T1_#2_#3@A Clean #2, #3 Clean #2, #3

(a) Initial state where blocks #2 and #3 of file A are clean in cache of both Clients 1 and 2. The

log-object A1 is located in file-log.

On a read request, Agni determines the block(s) to access based on offset and utilizes the

local block pointer list to fetch the corresponding fragment map. For each block, if the

most recent entry in the local fragment map is a cache entry then data is read from cache

and returned to the user. If the data is not in local cache, it is loaded from the log and

base tiers. If the same block(s) are accessed across multiple clients, then the file is loaded

multiple times. In Figure 5.3a, on Client 1, a read of block #3 in file A on returns block #3

from cache and a read called to block #1 results in an object load.

Load downloads blocks from the object store into cache on a cache read miss. Agni uses the

local fragment map to locate a block in the file-log or base-object. In either case, load usesa

Client 1 Client 2 Block #1

3 A T , #2@clean File A 1 2 3 1 2 3 File Block #1 0 T , A Cache Write Write Cache T0, A 1 2 3 2 3 Block #2 Tier Tier Block #2 4 Log T , #2@dirty A1 2 3 T6, #2@dirty Tier 1 1 T ,T _#2_#3@A 1 1 Base Load T ,T _#2_#3@A 0 1 2 3 A T , A Tier T0, A Block #3 Block #1 Block #2 Block #3 Block #3 T2, #3@clean T0, A T1,T 1_#2_#3@A T1,T 1_#2_#3@A T5, #3@dirty T1,T 1_#2_#3@A T0, A T0, A T1,T 1_#2_#3@A 0 T , A T0, A Metadata Merge List Dirty #2 server T1_#2_#3@A Dirty #2, #3 Clean #1, #3 Clean #1

(b) Block #1 is loaded into cache, #2 is updated on Client 1 and blocks #2, #3 are updated on

Client 2.

Block #1 Client 1 Client 2 Block #1 T3, #2@clean A T0, A File A 1 2 3 1 2 3 File T0, A Block #2 Cache Cache 1 2 3 2 3 T6, #2@clean Block #2 Tier Flush Tier 4 T6,T 6_#2_#3@A T , #2@clean Log 2 3 A1 2 3 2 A A 2 3 Flush T6,T 6_#2_#3@A Tier T4,T 4_#2@A

Base 1 1 T4,T 4_#2@A A T ,T _#2_#3@A Tier 1 2 3 T1,T 1_#2_#3@A T0, A Block #1 Block #2 Block #3 T0, A Block #3 0 T6,T 6_#2_#3@A T6,T 6_#2_#3@A Block #3 T , A T6, #3@clean T4,T 4_#2@A T1,T 1_#2_#3@A T2, #3@clean T6,T 6_#2_#3@A T1,T 1_#2_#3@A T0, A T1,T 1_#2_#3@A T1,T 1_#2_#3@A T0, A T0, A T0, A Merge List Dirty Metadata Dirty 1 4 6 Clean #1, #2, #3 server T _#2_#3@A, T _#2@A, T _#2_#3@A Clean #1, #2, #3

(c) Block #2 is flushed from Client 1as A2 and blocks #2, #3 are flushed from Client 2as A3.

Block #1 Client 1 Client 2 Block #1

3 0 T , #2@clean A T , A File A 1 2 3 1 2 3 File T0, A Block #2 Load Block #2 Cache Cache Tier 1 2 3 2 3 Tier T6, #2@clean 6 T , #2@clean 6 6 Log 2 T ,T _#2_#3@A 1 A A3 6 6 A 2 3 2 2 3 T ,T _#2_#3@A Tier T4,T 4_#2@A T4,T 4_#2@A Base A 1 2 3 1 2 3 Merge T1,T 1_#2_#3@A 1 1 Tier T ,T _#2_#3@A T0, A 0 T , A Block #1 Block #2 Block #3 Block #3 0 6 6 6 6 Block #3 T , A T ,T _#2_#3@A T ,T _#2_#3@A T6, #3@clean 6 4 4 1 1 T , #3@clean T ,T _#2@A T ,T _#2_#3@A T6,T 6_#2_#3@A 6 6 1 1 0 T ,T _#2_#3@A T ,T _#2_#3@A T , A T1,T 1_#2_#3@A 1 1 0 T ,T _#2_#3@A T , A T0, A 0 T , A Merge List Dirty Metadata Dirty server Clean #1, #2, #3 Clean #1, #2, #3

(d) Log-objects A1, A2, and A1 are merged with base-object A.

Client 1 Client 2 Block #1 Block #1

6 A 6 T , #2@clean File A 1 2 3 1 2 3 File T , A 6 T , A Cache Cache Block #2 1 2 3 2 3 Block #2 Tier Tier T6, #2@clean T6, #2@clean Log T6, A Tier T6, A Block #3 Base 1 2 3 Block #3 Tier T6, #3@clean T6, #3@clean T6, A Block #1 Block #2 Block #3 T6, A T6, A T6, A T6, A

Merge List Dirty Metadata Dirty server Clean #1, #2, #3 Clean #1, #2, #3

(e) Final state where all the log-objects are merged and Agni assumes the resting state again.

Figure 5.3: Processing logic on file A. The tables represent the fragment maps and auxiliary structures.

108 CHAPTER 5. DISTRIBUTED AGNI ranged GET to read the block from object storage and populate cache. A cache entry is only created locally if needed, TUpdate is set and DirtyF lag turned off. After index updates, data is written to the cache and the block identifier is inserted into the local CleanSet. In

Figure 5.3a, a read called on block #1 from Client 1 executes a load from A’s base-object at

T0. Figure 5.3b shows the load, marks the cache entry clean, and updates the fragment map and CleanSet only on Client 1. Since the cache is local only a single client, the fragment map and CleanSet on Client 2 remain unchanged. A subsequent read (not depicted in

Figure) of block #1 on Client 2 will also result in an object load.

On write(), Agni determines the block being written based on the offset, and uses the block pointer to fetch the fragment map. (i) If no local cache entry is present in the local fragment map, a local cache entry is created with TUpdate of current time and is marked

as dirty. The block identifier is inserted in the local DirtySet. All write I/O presented to

Agni is block-aligned; small writes are converted to blocks in the virtual file system that

does read-modify-write on blocks. (ii) If a local cache entry exists but is marked clean, this

indicates a clean cache block. The entry is marked dirty and TUpdate is updated. The block

identifier is moved from the local CleanSet to local DirtySet. (iii) If a local cache entry exists and is marked dirty, TUpdate is updated and no further action is required. In all the above cases, it is possible that there exists another cache entry marked dirty on another client for the same block even if there is no local cache entry, or the local cache entry is marked clean or dirty. However, we only consider the local cache entry in this scenario for performance reasons and thus only offer weak cache coherence.

In Figure 5.3a, a write to block #2 of file A in Client 1 triggers case (iii) and T2 is updated to T4 as shown in Figure 5.3b. Similarly, a write to blocks #2, #3 of file A in Client

2 also triggers case (iii). T2 is updated to T5 and T2 to T6 for blocks #2, #3 respectively.

There is no communication with object storage or between different clients during write.

Flushing the cache

Flush persists dirty blocks by uploading them from the cache to the log and happens individually at each client. There is no explicit synchronization of a flush across multiple clients. Flush is triggered at a client: (i) periodically on an opened file, (ii) when a file is closed by all applications on that client, (iii) under memory pressure, (iv) when an application calls sync(), or (v) on coalesce(). Cases (I) and (II) occur asynchronously and depend on user workload.

Flush identifies the local dirty blocks for a file using the local DirtySet and uses

MULTI-PART UPLOAD or PUT (depending on the size of the data) to upload the dirty blocks into a log-object. After the file-log is uploaded, a log entry is added to the local fragment maps of the flushed dirty blocks with time TFlush. We set the TFlush to the maximum of

TUpdate among all flushed cache blocks with the same logic as before. We do not explicitly add the log-object name to the MergeList located at the metadata server or the fragment maps of other clients where the file might be open. Rather, we rely on object notifications to be generated from the log-object creation. This notification is processed and the creation is broadcasted to all clients and the metadata server using a scalable publish-subscribe service.

This permits us to not store a list of clients where a specific file is open. Clients only process messages that concerns them and update the relevant fragment maps accordingly. The

110 CHAPTER 5. DISTRIBUTED AGNI metadata server processes all messages since it is concerned with all files. Upon receiving the message for a log-object being written to the log, the metadata server updates it’s fragment map. Additionally, the log-object is inserted into the MergeList located at the metadata server for eventual merging. Agni retains recently flushed blocks in cache only on local client, anticipating their reuse, except when reclaiming memory space. Clients where the file is open also verify if the cache-blocks they hold have been overwritten or not.Ifthe log-object message contains a block which has been cached locally then it is loaded again.

We choose not to pre-fetch other cache-blocks because we assume that if the block was not in cache then it is probably not being accessed on that specific client.

In Figure 5.3c, Agni on Client 1 flushes the only dirty block of file A—block #2— to log-object A2 named T 4_#2@A. Log entry T 4_#2@A is then inserted into the local fragment map for block #2, and the identifier is moved from DirtySet to the CleanSet

locally. The log-object write generates a notification which is processed and a message is

broadcasted to all the clients. Client 2 also has file A open and processed the message

by adding T 4_#2@A in it’s local fragment map. The metadata server also updates the

MergeList with the log-object name T 4_#2@A. Similarly, Agni on Client 2 flushes the

two dirty blocks of file A—blocks #2, #3—to log-object A3 named T 6_#2_#3@A. The

messages from this log-object write are processed by Client 1 which updates it’s local frag-

ment map by adding T 6_#2_#3@A for blocks #2 and #3. The metadata server updates

updates the he MergeList with the log-object name T 6_#2_#3@A. In Figure 5.3d, Client

1 determines that it has outdated copy for blocks #2 and #3. Agni on Client 1 loads those

blocks in cache from the log-object T 6_#2_#3@A using the updated local fragment map.

Merging the base and log

Merge integrates log-objects into the base-object, making the 1⇒1 mapping eventually consistent. Merge happens: (i) periodically, (ii) after the number of log-objects exceeds a user configured threshold, or (iii) when the user calls coalesce() on a file.

In the distributed mode, the metadata server is responsible for the merge for cases i and ii. The client is responsible for case iii. Merge occurs same as before. The only difference is on completion with respect to the update of the local and remote fragment maps. The fragment entries are not updated and neither are the log-objects removed after the merge process is completed. We use the same mechanism as the flush to broadcast the update using the scalable publish-subscribe messaging service. The clients and the metadata server update the fragment map only after their receive this message. This ensures that everyone is aware of the updated state before the log-objects are deleted. Moreover, it also ensures that any client opening the file before the broadcast is complete can still access data. Oncethe messages are received, all the client update their respective fragment maps. The metadata server in addition also deletes the log-objects and updates the MergeList.

In Figure 5.3d, merge begins after A2 and A3 have been flushed to log. The merge identifies logs A1, A2, and A3 for file A. It reads block #1 from the base-object A, and blocks #2, #3 from A3. It ignores blocks #2 in A1 and A2 which was overwritten in A3.

It also ignores blocks #3 in A1 which was overwritten in A3. Two worker threads GET the

blocks from the respective log-objects and MULTI-PART UPLOAD them to a new base-object

A. Updates to the fragment map and cleanup of the auxiliary data structures also happens

in similar fashion as before. The write of the new base object generates a notification which

112 CHAPTER 5. DISTRIBUTED AGNI is broadcasted to all the clients as messages. In Figure 5.3d, after the merge, base-object

6 entries are updated to T and log entries for A1, A2, and A3 are removed from fragment

maps of individual fragment maps. Metadata servers also removes the log-object names

from MergeList and deletes the file-log. The system ends up in the resting state withno

dirty data in the log or cache (Figure 5.3e).

On a file close request, (i) Agni checks if there are any dirty cache blocks that need to be

flushed. (ii) Next, it flushes the dirty blocks and conveys the file close to the metadata

server. (iii) Further, it removes the local copies of the fragment maps, CleanSet, DirtySet

and empties the cache. (iv) Finally, it switches to ignoring any updates for that file received

over the publish-subscribe messaging service. In a way, a file close performs the actions ofa

file open in the reverse order.

Delete

A file delete is performed differently in the distributed mode to ensure that a file openacross

multiple clients case is handled. On a file delete request, Agni performs the same functions

as a file close. In addition, it also conveys the file delete request to the metadata server.

The metadata server checks if the file is still open on another client. If the file isnotopen

anywhere else, then we remove the inode metadata, log-objects and base-object. However,

if it is indeed open, then it is marked for deletion but the metadata is not removed from the

metadata server and is only accessible to that specific client. Any other client cannot access

the file. We also do not delete any of the log-objects or the base. The metadata, log-objects

113 CHAPTER 5. DISTRIBUTED AGNI and base are only removed when the file is closed across all clients. This ensures closeto

POSIX semantics that a file is still accessible across multiple clients even if it is deletedat one. In this case, any data written out to file after the deletion is lost.

Cache and Log management

The Agni management daemon in the distributed mode also functions similarly expect that there is one located on each client responsible for local cache management. If cache usage on the client exceeds a user-defined limit the daemon evicts clean blocks from cache usingthe

CleanSet, and updates the fragment maps. If this is insufficient, it then flushes dirty blocks using the DirtySet and evicts them. In Figure 5.3c, blocks #1 and #3 would be evicted

first on Client 1 followed by a flush and evict ofblock #2. Similarly, on Client 2 blocks #2

and #3 would be flushed and then evicted because there are no clean blocks. There isalso

a central daemon that monitors the log, garbage collecting log-objects that have no active

blocks in the MergeList. In Figure 5.3c, A1 and A2 would be reclaimed because of A3.

Object notification

In the distributed mode, the Agni notification processor handles the create exactly the same

as before. It does handle the deletion and update of an object differently. For delete, Agni

does the same steps as before with slight differences. It first deletes the metadata for thefile

from the metadata server, then broadcasts the deletion of the file over the publish-subscribe

service to all clients. Once the metadata server receives the message, it then deletes the

fragment map, block-pointers, log-objects, the file-log and the base-object. Similarly, when

individual clients receive the delete message, they remove their local fragment maps and

114 CHAPTER 5. DISTRIBUTED AGNI cache blocks.

We continue to distinguish between create and update based on the presence of file metadata in the distributed mode as well. The object creation time (TCreate) orders the

PUT data with respect to previous and subsequent writes. However, during an update, we do not rely on the notification processor to update the fragments maps by itself. Rather, it first updates the fragments on the metadata server first and then uses the publish-subscribe message service to broadcast the relevant information to all clients. The client checks entries in the fragment map for TUpdate and TFlush less than TCreate. Each client removes also local cache blocks. Once the metadata server also receive this message it proceeds to remove corresponding log-objects. Data written after TCreate is not overwritten.

5.4 Evaluation

Figure 5.4 depicts the different components of in a distributed design and the interaction between them. All data requests and metadata requests interact directly with object storage and key-value storage respectively. Notifications generated by the object storage are processed by the notification processor and broadcoasted to all the clients and the key-value storage using the publish-subscribe service.

Testbed

We evaluate Agni’s performance in a distributed configuration. The benchmarking environment is the same as before (AWS) to limit variance in our benchmarks. Agni clients are mounted on ten instances of m5.4xlarge instances. Each instance has 16 vCPUs, 64 GB of RAM, and AWS high network bandwidth. We use we a 36 GB in-memory file system on

Metadata requests Metadata requests Client 1 Client N KV storage Cache Cache

Pub/sub

Object notification processor

Notifications

Data Data requests requests Object storage

Figure 5.4: The different modules of Agni and the data flow among them. each instance as a cache. AWS actively restricts the number of instances that can be of this type to ten and we are unable to scale our benchmark beyond this number. The metadata server is deployed on a r5a.2xlarge instance with 8 vCPUs, 64 GB of RAM, and AWS high network bandwidth. We allocated about 30GB of RAM to the Redis server and it persists metadata to local disk periodically every 60 seconds. AWS S3 is configured with the default standard class of storage. AWS SNS [132] is used as the publish-subscribe messaging service and AWS Lambda [125] is utilized as the object notification processor. We use IOR [133] to perform all distributed file system benchmarks.

5.4.1 Read and write throughput performance

Agni can linearly scale read and write throughput to individual files. Performance is only

Agni I/O mode Read Write 7000

2000 I/O throughput (MB/sec)

0 1 2 4 8 16 32 64 80 Number of threads

Figure 5.5: Read and write throughput performance as a function of number of threads. We keep the number of nodes constant as we increase the number of threads. limited by the available network bandwidth on each instance. In Figure 5.5, we configure

IOR to perform random reads and writes of 4 MB I/O size to individual files of 1600 MB each. We choose a file size of 1600 MB and the same instance type as earlier toensure limited variance. Read and write throughput performance scales linearly in Agni until 32 threads. Read throughput peaks at ≈6.9 TB/s and write throughput peaks at ≈6.8 TB/s for

80 threads. It does not scale beyond 80 threads because we congest the network bandwidth

on each instance at ≈700 MB/s An increase in threads beyond this point does not result in

any increased in IO throughput (not shown).

117 Chapter 6

Policies for a Coherent Namespace

The namespace semantics supported by object storage differs significantly from that of file systems. In this Chapter, we explore the different policies we adopt toensurea coherent namespace across both interfaces.

6.1 Overview

We support three different operating modes in Agni. These modes are depicted in Table 6.1.

Each mode represents a set of different trade-offs to provide support for a diversity of potential user requirements and choices of object storage. We have currently implemented namespace coherence aspects of these modes and plan to implement unified access control in the future. However, in the following discussion, we have briefly include the nature of access control each mode is expected to support.

The two lower numbered modes (Sections 6.2 and 6.3) offer complete intuitive dual access and best object interface performance. But they only supports basic object interfaces.

118 CHAPTER 6. NAMESPACE COHERENCE

Moreover, they have a higher possibility of namespace incoherence, which will not lead to loss of data but will require user intervention to resolve. The access control in these modes will be dependent on the support offered natively by the object store. The highest numbered mode (Sections 6.4) offers no namespace incoherence, and file-like operations such as the creation and deletion of directories, symlinks, and hardlinks via the object interface.

The access control will be uniform and independent of the object store. All these features are possible because of the use of special library which wraps around the existing object interface libraries. We term this library as Urial1. But they have decreased object interface performance and limited intuitiveness.

Agni must be set to a specific mode at the times of initialization and it continues to operate in that mode until it is either deleted or migrated to a different mode. The modes are not interoperable since they adopt different object naming policies. But they can migrated to another mode using a special migration process.

1Agni is typically depicted riding on a wild sheep called Urial. The name of our library is inspired by this animal. Just as the deity rides on his mount, our system use the special library to enable additional functionality.

119 CHAPTER 6. NAMESPACE COHERENCE ✘ ✘ ✔ Independent access control ✘ ✘ ✔ renames Optimum ✘ ✘ ✔ Urial ✘ ✘ ✔ Hardlink ✘ ✘ ✔ Symlink Object interface ✘ ✔ ✔ Table 6.1: Operating modes in Agni Directory Operations ✔ ✔ ✔ Hardlink ✔ ✔ ✔ Symlink File interface ✔ ✔ ✔ Directory Operations I II III Mode

Incoherence directory

Allowing for dual access can lead to multiple namespace inconsistencies, primarily in a file system. There are multiple ways to resolve this issue. One way could be toremove the object using the notification processor as invalid. However, this could potentially lead to data loss which is not desirable. Agni adopts a pragmatic approach and treats this in different ways based on the capability of the file system. In the lower numbered modes, I and IIa

/.unresolved/ directory is created at the location where the inconsistency was created. This is used to act as an indication for the user to intervene and fix the incoherence. Attaching the unresolved prefix ensures that existing data in the correct location is not overwritten.

In addition, a symlink is also created at the /.unresolved/ located at the root of the file system. This provides a single location to the user for all namespace inconsistencies in Agni.

6.2 Mode I

Summary

We designed this mode for those use-cases which have no requirement for renames and can deal with some degree of namespace incoherence. It supports native object interfaces and has no support for directory creation, deletion, symlinks or hardlinks through the object interface. It offers the best object interface performance because it does not rely onUrial for any object interface execution. Mode I also offers complete dual intuitiveness because it adopts file-path object naming policy and there is a direct 1⇒1 file-to-object mapping.

These policies also cause it have expensive file and object renames and the highest possibility of a namespace incoherence.

PUT object without a trailing slash

Object storage does not distinguish between a create or an update unlike a file system. We treat this as a valid operation, since this equates to the either the creation of a new file or an update to an existing file. If no object exists, then one is created and if an object already exists then it is overwritten. Any missing directories in the file path are populated to ensure that the data is reachable through the file system. There are cases when this could lead to an incoherence such as when the path name is already an existing file. The file system metadata is populated via the notification processor after the object has been created. Any incoherences are detected during this stage and marked as .unresolved.

We further describe this with an example where the user calls PUT with an object named A/B/C.

1. An object with the name A/B/C is created at TCreation. A corresponding event for

object creation is generated by the object store and transmitted to the notification

processor for further actions.

2. If object A/B/C does not exist then this is classified as a CREATE.

(a) Inode metadata such as size, uid are determined by object metadata.

(b) The block pointers are populated with a single Base entry in the fragment map

with TCreation.

3. Else, if object A/B/C exists then this is classified as a UPDATE.

(a) Inode metadata is determined by new object.

(b) The fragment maps of are updated with a single Base entry with TCreation.

(c) Any other entry with TEntry less than TCreation, is removed.

(d) Any corresponding log objects and cache blocks are removed. MergeList is also

updated.

4. The notification processor checks if directories A/ and A/B/ exist. Agni, unlike Prox-

yFS, does not differentiate between directories created via either interface.

(a) If both the directories exist, then metadata for C is populated.

(b) If one or more directories do not exist, then they are populated.

5. If either A or A/B exists as individual objects this translates to individual files A or

A/B in the file system. But the user request indicates that A/ and A/B/ are directories

leading to an incoherence. For example, if A existed then the notification processor

would create .unresolved/A/B/D/ at /. If A/B had existed then the notification

processor would create .unresolved/B/D/ at A/.

PUT object with a trailing slash

Object storage considers an object with a trailing slash as a valid operation and allows it’s creation. However, from a file system’s perspective this is a directory which makes thedata unreachable though the file interface. This mode does not support the creation of directories through either interface. Thus, we assume this case to be a namespace incoherence and an inadvertent mistake by the user. The object is renamed with a prefix /.unresolved/ and it can be resolved by user intervention. There are three possible actions that the user could take for this example: (i) rename the object, (ii) move it to a different location, or (iii) remove it altogether.

We illustrate this further with an example where the user wishes to insert an object named A/B/D/ through the object interface.

1. An object with name A/B/D/ is created. Any data existing in object A/B/D/ is

overwritten. This is treated as a file creation and the same set of actions are performed

as mentioned above.

2. In the file system the file is named as A/B/.unresolved/D//. If there is a file named

A/B/D or a directory named A/B/D/, the new name does not overwrite it.

GET object without a trailing slash

We treat this as a valid read operation and do not perform any additional steps. If an object if it exists it returned back to the user else an error is reported back to the user. An object might not exist because either there is no corresponding file in the file system or the file exists but the base object for it has not been created yet.

GET object with a trailing slash

We do not support any directory operations in this mode. In most cases the request returns an error because no such object exists. There is one case where an object might exist if it was inadvertently created and the incoherence has not been resolved yet. Here, some piece of data will be returned in the request. The above mentioned cases will occur independent of the existence of a directory of that name in the file system.

DEL object without a trailing slash

We treat this as a valid operation because it equates to the deletion of a file. If an object

124 CHAPTER 6. NAMESPACE COHERENCE exists then it is deleted from the object storage else an error is reported to the user. The notification processor removes all the inode metadata, block pointers, auxiliary data structures from the master index. It also deletes any cache blocks or log objects that might exist.

DEL object with a trailing slash

We are presented with a dilemma for this operation because we do not support any directory operations through the object interface. In the case when there is no incoherence an error is reported back to the user since no such object should exist. However, an object can exist if the user created it inadvertently through the object interface and it has not been resolved yet by user intervention. We choose to treat this case as any other file deletion and follow the same set of steps discussed above.

6.3 Mode II

Summary

We designed this mode for those use-cases which can operate in Mode I but might also want the ability organize their data through the object interface. It supports the creation and deletion of directories through the object interface. We support creation and deletion of directories via the object interface in addition to the other interfaces supported by Mode

II. PUT and DEL of objects with trailing slashes are thus handled differently as compared to

Mode I. Objects with trailing slashes names and zero content length are treated as directories not files.

PUT object with trailing slash

We equate this operation to a directory creation if the content length of the object is zero.

The content length being zero indicates that the object is just a place holder and contains no data. However, if object has a non-zero content length, we assume that there is data and treat it has an incoherence to ensure that the data is reachable in the file system. The checks are carried by the notification processor after the object has been created. In addition, it also checks if any other incoherence might have been caused such as missing directories.

We demonstrate this further with an example below where the user calls PUT with an object named A/B/C.

1. An object called A/B/C is created at TCreation. A corresponding event for this is

generated and sent to the notification processor.

2. If the content length is zero, then corresponding metadata, auxiliary data structures,

and block pointer are populated in the master index.

3. If A/ or A/B/ directories do not exist they are also populated.

4. Else, it is treated as an incoherence. An incoherence can also occur if an object named

A/B/D also exists. In both cases, it is placed in the /.unresolved/ directory.

DEL object with trailing slash

We associate this operation to a directory deletion if the content length of the deleted object is zero. The content length zero indicates that this was a placeholder object for a directory and did not contain any data. In this case, we not only delete the object but also all objects which are nested as sub-files and sub-directories under this directory path. This

126 CHAPTER 6. NAMESPACE COHERENCE operation is the same as executing a rm -rf in a file system. However, the scenario becomes interesting if an object deleted had a non-zero content length. This case can only occur if there was an incoherence to begin with because this mode does not support non-zero length objects. We surmise that the user inadvertently might have caused an incoherence and is now attempting to rectify it. Similarly, we also validate if the directory path is valid or not.

There is a possibility that the directory creation caused an incoherence which has not been rectified yet. In that case we only delete the object not all the sub-files and sub-directories.

In case the object itself does not exist, then an error is reported back to the user.

We elucidate this further with an example below where the user issues GET on an

object named A/B/D/.

1. If the content length is greater than zero, the object the same as Mode I.

2. Else, if the content length is zero, we check if this is an /.unresolved/ directory. If the

directory is found here then we remove the directory entry assuming this was an error.

3. Finally, we verify if the directory path is valid or not:

(a) If the path is invalid then the request is ignored. For example, a path is invalid

if A exists or A/B because this invalidates the existence of A/ and A/B/.

(b) Else, if the path is valid, we remove all directory, files objects and unresolved

inconsistencies encompassed under this file path. The same steps are followed for

log objects, cache blocks, fragment maps as Mode I.

6.4 Mode III

Summary

We designed this mode for those use-cases which have a requirement for optimal renames and cannot tolerate any namespace incoherence. It supports additional object interfaces compared to earlier modes such as the ability to create or delete symlinks and hardlinks.

Mode III does so by using Urial and adopting an inode object naming policy and continues to use a 1⇒1 file-to-object mapping. These policies and Urial allow for optimal renames of objects and directories and also remove any possibility of a namespace incoherence. But

Urial needs to access the master index frequently for checking the file path to inode mapping or determining if an operation will lead to an incoherence or not. This can cause a drop in object interface performance. We quantify this overhead in Section 6.5.

PUT object without trailing slash

We equate this operation to the creation of a new file or an update of an existing file. This is in essence similar to operation in Mode I and II. However, there are major differences between how these operations are handled. Urial first checks if the creation of this object will lead to an incoherence or not. If it does then an error is reported back to the user and the object is never created. This incoherence can occur in case of file path names which will overwrite existing file names. Once it is determined that there is no incoherence, Urial checks if this file exists or not. If it does not then Urial fetches unique inode number from the master index. A new object is created with that unique inode number. The metadata population for this object is then handed over the notification processor. If the file does

128 CHAPTER 6. NAMESPACE COHERENCE exist, then Urial fetches the inode number of the existing file and overwrites the object.

This also causes the removal of existing cache blocks and log object as described in Mode I.

We illustrate this below with an example of a PUT on an object A/B/C.

1. First, we validate the file path A/B/C in the master index using the special library.

Here, we check for the existence of A or A/B which would potentially cause an inco-

herence. If an incoherence exists, then an error is returned back to the user. We also

check if A/B/C itself exists.

2. Next, if the path is valid and A/B/C does not exist then this is identified as a CREATE

(a) We fetch a unique inode number IN+1 from the master index such that the last

inode number used is IN . The created object is named IN+1.

(b) The notification processor populates the metadata to map A/B/C to IN+1. Any

missing directories such as A/ or A/B/ are also populated.

3. Else, if A/B/C does exist, then this is treated as an UPDATE.

(a) We fetch the specific inode number that maps to A/B/C and overwrite the

corresponding object.

(b) The notification processor updates the file metadata such as size, and uid. Similar

to Mode I, we update block pointers, fragment maps and delete log objects, cache

blocks.

PUT object with trailing slash

We identify this operation as a directory creation, similar to Mode II. Objects with content

129 CHAPTER 6. NAMESPACE COHERENCE length zero are assumed to be valid directory creations. However, there is a difference on its implementation and no objects are created in Mode III. Urial fetches a unique inode number and then performs the same steps as Mode I to populate the directory metadata.

We describe this with an example of a PUT on an object A/B/D//.

1. If the content length is zero, this is treated as a valid object creation operation and

Urial checks if A/B/D/ exists in the namespace. It also checks if this is indeed a valid

directory path and no files such as A/B or A exist.

(a) If A/B/D/ exists, then an error is reported back to the user that states the

directory already exists.

(b) Else, if it is not a valid path, then an error is reported back to the user that states

the path is not valid.

(c) Else, if A/B/D/ does not exist, the master index is populated. No object is

created in this operation. Any missing directories such as A/B/ and A/ are also

populated.

GET object without trailing slash

We treat this as a valid read operation but need to perform additional steps compared to

Mode I to read this object. This mode adopts a inode object naming policy and Urial needs to fetch the corresponding inode number for a given file path before it can read from the object store. If the file exists then corresponding object is read else an error isreported back to the user.

GET object with trailing slash

Reading a directory is not very useful from a file system or an object storage perspective.

We interpret this as a ls of the directory. Urial checks if the directory exists or not. If it does, then a list of sub-files and sub-directories, if any, are returned. An error is reported back to the user if no such directory exists.

DEL object without trailing slash

We equate this operation as deletion of a file. The implementation of this functionality is similar to that of Mode I. Urial does not check the existence of the object and let’s the object storage report an error if the object does not exist. Notification processor removes the file metadata from master index and cleans up the cache blocks and log objects.

DEL object with trailing slash

We treat this operation as a directory deletion, same as Mode II. The difference in Mode

III is that we do not need to check the object content length since none exists. Instead

Urial checks if the directory path indeed is valid and removes the file metadata from the master index. It also triggers the notification processor to delete all sub-files, sub-directories, associated metadata, cache blocks and object logs. The implementation of this function is similar to that of Mode II. If the directory does not exists then an error is reported back to the user.

Symlinks and Hardlinks through object interfaces

Object storage does not support the concept of symlinks or hardlinks. Mode III is able to support them in all aspects using Urial. When fetching data through a GET the symlinks

131 CHAPTER 6. NAMESPACE COHERENCE or hard links are not treated any differently then a normal GET. The only exception to this is if the symlink points to no existing objects in which case an error is reported back to the user. The deletion of symlinks and hardlinks are treated with the same logic. In case of symlinks a delete only results in the deletion of inode metadata associated with that symlink not the file entirely. Similarly, when deleting a file, it is checked if there ishardlink to the file or not. If a hardlink exists then the contents of the file are not deleted. Wealso support additional interfaces which are not traditional object interfaces but are supported by Urial. These interfaces allow is to create a symlink or a hardlink via the object interface.

There interfaces are termed as PUT-SLINK and PUT-HLINK to insert a symlink and hardlink

respectively. Urial updates the master index for them and does not create any additional

objects in case of hardlinks unlike a file system.

6.5 Evaluation

Urial experiences some degradation in object requests rate. The request rate is important

from the user’s perspective because it determines the file system mode a user might want to

adopt for their application. For this benchmark, we issue 10,000 object interface requests for

zero content-length objects. This allows us to accurately isolate the overhead from Urial by

removing potential bottlenecks from the network. We explore the performance of two main

requests (i) GETs and (ii) PUTs. For each type of request we compare the performance of a

direct object interface with that of Urial for different lengths of file path. We benchmark for

two specific file path lengths, 1. Length 1 where all the file are located in the root directory

and, 2. Length 10 where all the file are located in 10 nested directories. The file path

(a) Rate of GET requests with varying number of worker threads. lengths are varies to determine the effect of file paths on Urial’s performance. We chooseto stop at directory length 10 because files rarely have file paths longer than this.

In Figure 6.1a, direct interface can reach a peak of ≈396/s GETs for 64 threads.

Urial suffers a degradation in request rate and reaches a peakof ≈390/s and ≈398/s for file

paths of length 1 and 10 respectively. In general, Urial degrades the GET performance by

about 1%-20% depending on configuration, file path and number of threads. The request

rate suffers some degradation between differing file lengths because a longer the filepath

requires more requests to the master index. The request rate scales ideally until 32 threads

and is independent of the object size. It does not double beyond 32 threads and peaks for

64 threads. The specific AWS instance has 32 cores and we are limited by the numberof

simultaneous threads we can launch.

(b) CDF of GET request latency with 64 threads.

Figure 6.1: GET performance in Urial compared to native interface. Each workload generates

10,000 requests.

In Figure 6.2a, direct interface can reach a peak of ≈393/s PUTS for 64 threads.

Urial suffers a degradation in request rate and reaches a peakof ≈372/s and ≈373/s for

file paths of length 1 and 10 respectively. In general, Urial degrades the PUT performance

by about 1%-10% depending on configuration, and number of threads. The request rateis

independent of object size and the file path because Urial only makes a single request to

fetch a unique inode number for every PUT request. The request rate scales ideally until 32

threads but it does not double beyond 32 threads and peaks for 64 threads. Limited number

of cores is again the reason for this.

To characterize the request latency on the user, we present data from the previous

(a) Rate of PUT requests with varying number of worker threads. experiments that depict the latency for individual requests. Figures 6.1b and 6.2b present a cumulative distribution of request latencies for GET and PUT requests. We prune the requests beyond 0.25 seconds and 0.30 seconds in Figures 6.1b and 6.2b respectively to depict the latency different between different modes and remove any straggler requests. In Figure 6.1b,

GET requests through direct object interfaces have smaller latency by about 0.01-0.05 seconds compared to Urial. There is a visible different between requests for differing file paths which arises from more requests. In Figure 6.2b, PUT requests through direct object interfaces have smaller latency by about 0.02 seconds compared to Urial. There is no visible difference between request with differing file paths because they make the same number of requests to the master index. The long tail for request latency in Figures 6.1b and 6.2b arises from the variance in object store performance. Variance is also visible in Figures 6.1a and 6.2a

(b) CDF of PUT request latency with 64 threads.

Figure 6.2: PUT performance in Urial compared to native interface. Each workload generates

10,000 requests. because of the same reason.

136 Chapter 7

Conclusion

In this thesis, we have presented two hierarchical storage systems in the cloud that overcome the existing limitations of object storage. We choose the AWS platform to develop and deploy both our systems. This choice is motivated by our prior experience with the platform. However, our design principles and data structures are not limited to this platform and could be implemented with other cloud providers.

First, we develop NDStore by complementing object storage with other low latency cloud storage service such as memory clusters. This design (i) supports latency sensitive workloads, and (ii) can sustain high data I/O performance in presence of random write bursts by aggregating writes in memory. NDStore has been adopted by Johns Hopkins

University—Applied Physics Laboratory (JHU-APL) as the Block object storage serivce

(bossDB) [134].

Next, we design Agni, an efficient dual-access object storage file system basedon the principles used in NDStore. We implement a write aggregating multi-tier data structure

137 CHAPTER 7. CONCLUSION in Agni such that (i) it offers intuitive dual access, yet (ii) efficiently supports the full spectrum of file interfaces without moving data between disparate systems, and (iii) is based on an existing set of well established systems and interfaces. Further, we develop Urial that supports Agni and prevents namespace incoherence arising out of discrepancies between file and object semantics in dual-access systems.

Agni requires further development, and lacks some attributes such as unified access control that are essential for its deployment in a production environment. As cloud object storage matures and user requirements grow, the specifications for dual-access storage systems will also evolve. Many enterprises are looking to move towards multi-cloud or hybrid-cloud setups and this will pose new challenges. In the near future, dual access systems will also need to accommodate applications that require a strong consistency during data access across both interfaces.

138 Bibliography

[1] D. Reinsel, J. Gantz, and J. Rydning, “The Digitization of the World From

Edge to Core,” International Data Corporation, Tech. Rep. S44413318, Nov 2018.

[Online]. Available: https://www.seagate.com/files/www-content/our-story/trends/

files/idc-seagate-dataage-whitepaper.pdf

[2] V. Turner, J. F. Gantz, D. Reinsel, and S. Minton, “The Digital Universe of Oppor-

tunities: Rich Data and the Increasing Value of the Internet of Things,” IDC Analyze

the Future, p. 5, 2014.

[3] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski,

G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, “A View of Cloud

Computing,” Commun. ACM, vol. 53, no. 4, pp. 50–58, Apr. 2010. [Online]. Available:

http://doi.acm.org/10.1145/1721654.1721672

[4] “AWS DynamoDB,” https://aws.amazon.com/dynamodb/. [Online]. Available: https:

//aws.amazon.com/dynamodb/

[5] A. Chandrasekaran, R. Bala, and G. Landers, “Critical Capabilities for Object Stor-

age,” Gartner, 2017.

139 BIBLIOGRAPHY

[6] S. Chavan, T. Banerji, and R. Nair, “Cloud Object Storage Market Research Report

- Global Forecast 2023,” Wise Guy Reports, Tech. Rep. WGR3496538, October 2018.

[7] A. Cockroft, C. Hicks, and G. Orzell, “Lessons Netflix learned from the AWS outage,”

Netflix Techblog, 2011.

[8] E. Deelman, G. Singh, M. Livny, B. Berriman, and J. Good, “The Cost

of Doing Science on the Cloud: The Montage Example,” in Proceedings of

the 2008 ACM/IEEE Conference on Supercomputing, ser. SC ’08. Piscataway,

NJ, USA: IEEE Press, 2008, pp. 50:1–50:12. [Online]. Available: http:

//dl.acm.org/citation.cfm?id=1413370.1413421

[9] E. Ottem, “Don’t Get Left Behind! Top Reasons You Need Object Storage,”

Western Digital Blog, Jun 2017. [Online]. Available: https://blog.westerndigital.com/

reasons-need-object-storage/

[10] A. Alkalay, “Object storage benefits, myths and options,” IBM Blog, Feb

2017. [Online]. Available: https://www.ibm.com/blogs/cloud-computing/2017/02/

01/object-storage-benefits-myths-and-options/

[11] “Amazon Simple Storage Service (S3),” https://aws.amazon.com/s3/. [Online].

Available: https://aws.amazon.com/s3/

[12] “Azure Blob Storage,” https://azure.microsoft.com/en-us/services/storage/blobs/.

[Online]. Available: https://azure.microsoft.com/en-us/services/storage/blobs/

[13] “Google Cloud Storage,” https://cloud.google.com/storage/. [Online]. Available:

https://cloud.google.com/storage/

140 BIBLIOGRAPHY

[14] “IBM Cloud Object Storage,” https://www.ibm.com/cloud/object-storage. [Online].

Available: https://www.ibm.com/cloud/object-storage

[15] “Minio,” https://www.minio.io/. [Online]. Available: https://www.minio.io/

[16] “Caringo Swarm,” https://www.caringo.com/products/swarm/. [Online]. Available:

https://www.caringo.com/products/swarm/

[17] A. Potnis, “Worldwide File-Based Storage Forecast, 2018–2022: Storage by Deploy-

ment Location,” International Data Corporation, Tech. Rep. US44457018, Dec 2018.

[18] E. Yuen, “Unlocking the Power of Analytics with an Always-on Data Lake,” The

Enterprise Strategy Group, Tech. Rep., Nov 2017.

[19] K. Vahi, M. Rynge, G. Juve, R. Mayani, and E. Deelman, “Rethinking Data Manage-

ment for Big Data Scientific Workflows,” in 2013 IEEE International Conference on

Big Data, Oct 2013, pp. 27–35.

[20] A. O’Driscoll, J. Daugelaite, and R. D. Sleator, “‘Big data’, Hadoop and cloud com-

puting in genomics,” Journal of Biomedical Informatics, vol. 46, no. 5, pp. 774–781,

[21] B. Langmead, K. D. Hansen, and J. T. Leek, “Cloud-scale RNA-sequencing differential

expression analysis with Myrna,” Genome biology, vol. 11, no. 8, p. R83, 2010.

[22] M. R. Palankar, A. Iamnitchi, M. Ripeanu, and S. Garfinkel, “Amazon S3 for Science

Grids: A Viable Solution?” in Proceedings of the 2008 International Workshop on

141 BIBLIOGRAPHY

Data-aware Distributed Computing, ser. DADC ’08. New York, NY, USA: ACM,

2008, pp. 55–64. [Online]. Available: http://doi.acm.org/10.1145/1383519.1383526

[23] “AWS FSx,” https://aws.amazon.com/fsx/. [Online]. Available: https://aws.amazon.

com/fsx/

[24] “Elastifile Cloud File System,” https://www.elastifile.com/product/. [Online].

Available: https://www.elastifile.com/product/

[25] “Panzure CloudFS,” https://panzura.com/technology/panzura-cloudfs/. [Online].

Available: https://panzura.com/technology/panzura-cloudfs/

[26] H. Smith, Data Center Storage: Cost-Effective Strategies, Implementation, and Man-

agement. CRC Press, 2016.

[27] O. Ozeri, E. Ofer, and R. Kat, “Keeping Deep Learning GPUs Well Fed Using

Object Storage,” in Proceedings of the 11th ACM International Systems and Storage

Conference, ser. SYSTOR ’18. New York, NY, USA: ACM, 2018, pp. 128–128.

[Online]. Available: http://doi.acm.org/10.1145/3211890.3211910

[28] J. Inman, W. Vining, G. Ransom, and G. Grider, “MarFS, a Near-POSIX Interface

to Cloud Objects,” ; Login, vol. 42, no. LA-UR–16-28720; LA-UR–16-28952, 2017.

[29] R. Rizun, “S3Fuse,” https://github.com/s3fs-fuse/s3fs-fuse. [Online]. Available:

https://github.com/s3fs-fuse/s3fs-fuse

[30] K.-H. Cheung, “Goofys,” https://github.com/kahing/goofys. [Online]. Available:

https://github.com/kahing/goofys

142 BIBLIOGRAPHY

[31] D. Poccia, “YAS3FS,” https://github.com/danilop/yas3fs. [Online]. Available:

https://github.com/danilop/yas3fs

[32] M. Vrable, S. Savage, and G. M. Voelker, “BlueSky: A Cloud-backed File System for

the Enterprise,” in Proceedings of the 10th USENIX Conference on File and Storage

Technologies, ser. FAST’12. Berkeley, CA, USA: USENIX Association, 2012, pp.

19–19. [Online]. Available: http://dl.acm.org/citation.cfm?id=2208461.2208480

[33] R. Burns, K. Lillaney, D. R. Berger, L. Grosenick, K. Deisseroth, R. C. Reid,

W. G. Roncal, P. Manavalan, D. D. Bock, N. Kasthuri, M. Kazhdan, S. J. Smith,

D. Kleissas, E. Perlman, K. Chung, N. C. Weiler, J. Lichtman, A. S. Szalay, J. T.

Vogelstein, and R. J. Vogelstein, “The Open Connectome Project Data Cluster:

Scalable Analysis and Vision for High-throughput Neuroscience,” in Proceedings of

the 25th International Conference on Scientific and Statistical Database Management,

ser. SSDBM. New York, NY, USA: ACM, 2013, pp. 27:1–27:11. [Online]. Available:

http://doi.acm.org/10.1145/2484838.2484870

[34] A. S. Szalay, K. Church, C. Meneveau, A. Terzis, and S. Zeger, “MRI: The Devel-

opment of Data-Scope—a multi-petabyte generic data analysis environment for sci-

ence,” Available at https://wiki.pha.jhu.edu/escience_wiki/images/7/7f/DataScope.

pdf, 2012.

[35] W. R. G. Roncal, D. M. Kleissas, J. T. Vogelstein, P. Manavalan, K. Lillaney,

M. Pekala, R. Burns, R. J. Vogelstein, C. E. Priebe, M. A. Chevillet et al.,

“An automated images-to-graphs framework for high resolution connectomics,”

143 BIBLIOGRAPHY

Frontiers in neuroinformatics, vol. 9, p. 20, 2015. [Online]. Available: https:

//www.frontiersin.org/article/10.3389/fninf.2015.00020

[36] D. M. Kleissas, W. Gray Roncal, P. Manavalan, J. T. Vogelstein, D. D.

Bock, R. Burns, and R. J. Vogelstein, “Large-Scale Synapse Detection Using

CAJAL3D,” Neuroinformatics, 2013. [Online]. Available: http://www.frontiersin.org/

neuroinformatics/10.3389/conf.fninf.2013.09.00037/full

[37] T. Pietzsch, S. Saalfeld, S. Preibisch, and P. Tomancak, “BigDataViewer:

visualization and processing for large image data sets,” Nature Methods, vol. 12,

no. 6, pp. 481–483, 2015. [Online]. Available: https://doi.org/10.1038/nmeth.3392

[38] “NeuroGlancer,” https://github.com/google/neuroglancer. [Online]. Available: https:

//github.com/google/neuroglancer

[39] “Machine Intelligence from Cortical Networks (MICrONS) ,” https://www.iarpa.

gov/index.php/research-programs/microns/microns-baa, 2014. [Online]. Available:

https://www.iarpa.gov/index.php/research-programs/microns/microns-baa

[40] A. Abbott et al., “Solving the brain,” Nature, vol. 499, no. 7458, pp. 272–274, 2013.

[41] K. R. Jackson, K. Muriki, L. Ramakrishnan, K. J. Runge, and R. C. Thomas,

“Performance and Cost Analysis of the Supernova Factory on the Amazon AWS

Cloud,” Sci. Program., vol. 19, no. 2-3, pp. 107–119, Apr 2011. [Online]. Available:

http://dx.doi.org/10.1155/2011/498542

[42] M. C. Schatz, B. Langmead, and S. L. Salzberg, “Cloud computing and the DNA

144 BIBLIOGRAPHY

data race,” Nature Biotechnology, vol. 28, no. 7, p. 691, 2010. [Online]. Available:

https://doi.org/10.1038/nbt0710-691

[43] J. L. Hellerstein, K. J. J. Kohlhoff, and D. E. Konerding, “Science in the Cloud:

Accelerating Discovery in the 21st Century,” IEEE Internet Computing, vol. 16, no. 4,

pp. 64–68, Jul. 2012. [Online]. Available: http://dx.doi.org/10.1109/MIC.2012.87

[44] A. Thakar, A. Szalay, K. Church, and A. Terzis, “Large Science Databases - Are

Cloud Services Ready for Them?” Sci. Program., vol. 19, no. 2-3, pp. 147–159, Apr

2011. [Online]. Available: http://dx.doi.org/10.1155/2011/591536

[45] “Amazon Web Services,” https://aws.amazon.com/. [Online]. Available: https:

//aws.amazon.com/

[46] “Microsoft Azure,” https://cloud.microsoft.com/en-us/. [Online]. Available: https:

//cloud.microsoft.com/en-us/

[47] “Google Cloud Platform,” https://cloud.google.com/pricing. [Online]. Available:

https://cloud.google.com/pricing

[48] “IBM Cloud,” https://www.ibm.com/cloud-computing/. [Online]. Available: https:

//www.ibm.com/cloud-computing/

[49] C. Johnson, “IBM 3850: Mass Storage System,” in Proceedings of the

May 19-22, 1975, National Computer Conference and Exposition, ser. AFIPS

’75. New York, NY, USA: ACM, 1975, pp. 509–514. [Online]. Available:

http://doi.acm.org/10.1145/1499949.1500051

145 BIBLIOGRAPHY

[50] D. W. Brubeck and L. A. Rowe, “Hierarchical Storage Management in a Distributed

VOD System,” IEEE MultiMedia, vol. 3, no. 3, pp. 37–47, Sep. 1996. [Online].

Available: https://doi.org/10.1109/93.556538

[51] M. N. Nelson, B. B. Welch, and J. K. Ousterhout, “Caching in the Sprite Network

File System,” ACM Trans. Comput. Syst., vol. 6, no. 1, pp. 134–154, Feb. 1988.

[Online]. Available: http://doi.acm.org/10.1145/35037.42183

[52] L. F. Cabrera, R. Rees, S. Steiner, W. Hineman, and M. Penner, “ADSM: A

Multi-platform, Scalable, Backup and Archive Mass Storage System,” in Proceedings

of the 40th IEEE Computer Society International Conference, ser. COMPCON ’95.

Washington, DC, USA: IEEE Computer Society, 1995, pp. 420–. [Online]. Available:

http://dl.acm.org/citation.cfm?id=527213.793539

[53] M. Satyanarayanan, J. J. Kistler, P. Kumar, M. E. Okasaki, E. H. Siegel, and

D. C. Steere, “Coda: A Highly Available File System for a Distributed Workstation

Environment,” IEEE Trans. Comput., vol. 39, no. 4, pp. 447–459, Apr. 1990. [Online].

Available: http://dx.doi.org/10.1109/12.54838

[54] J. Ousterhout, A. Gopalan, A. Gupta, A. Kejriwal, C. Lee, B. Montazeri, D. Ongaro,

S. J. Park, H. Qin, M. Rosenblum, S. Rumble, R. Stutsman, and S. Yang, “The

RAMCloud Storage System,” ACM Trans. Comput. Syst., vol. 33, no. 3, pp. 7:1–7:55,

Aug. 2015. [Online]. Available: http://doi.acm.org/10.1145/2806887

[55] B. Fitzpatrick, “Distributed Caching with Memcached,” Linux J., vol. 2004, no.

146 BIBLIOGRAPHY

124, pp. 5–, 8 2004. [Online]. Available: http://dl.acm.org/citation.cfm?id=1012889.

1012894

[56] P. Watson, P. Lord, F. Gibson, P. Periorellis, and G. Pitsilis, “Cloud Computing for

e-Science with CARMEN,” in 2nd Iberian Grid Infrastructure Conference Proceedings.

Citeseer, 2008, pp. 3–14.

[57] J. Li, M. Humphrey, C. Van Ingen, D. Agarwal, K. Jackson, and Y. Ryu, “eScience

in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the

Windows Azure Platform,” in Parallel & Distributed Processing (IPDPS), 2010 IEEE

International Symposium on. IEEE, April 2010, pp. 1–10.

[58] E. Soroush, M. Balazinska, and D. Wang, “ArrayStore: A Storage Manager

for Complex Parallel Array Processing,” in Proceedings of the 2011 ACM

SIGMOD International Conference on Management of Data, ser. SIGMOD

http://doi.acm.org/10.1145/1989323.1989351

[59] E. Perlman, R. Burns, Y. Li, and C. Meneveau, “Data Exploration of Turbulence

Simulations Using a Database Cluster,” in Proceedings of the 2007 ACM/IEEE

Conference on Supercomputing, ser. SC ’07. New York, NY, USA: ACM, 2007, pp.

23:1–23:11. [Online]. Available: http://doi.acm.org/10.1145/1362622.1362654

[60] B. Moon, H. v. Jagadish, C. Faloutsos, and J. H. Saltz, “Analysis of the

Clustering Properties of the Hilbert Space-Filling Curve,” IEEE Trans. on Knowl.

147 BIBLIOGRAPHY

and Data Eng., vol. 13, no. 1, pp. 124–141, Jan. 2001. [Online]. Available:

https://doi.org/10.1109/69.908985

[61] K. Kanov, R. Burns, G. Eyink, C. Meneveau, and A. Szalay, “Data-intensive Spatial

Filtering in Large Numerical Simulation Datasets,” in Proceedings of the International

Conference on High Performance Computing, Networking, Storage and Analysis, ser.

SC ’12. Los Alamitos, CA, USA: IEEE Computer Society Press, 2012, pp. 60:1–60:9.

[Online]. Available: http://dl.acm.org/citation.cfm?id=2388996.2389078

[62] N. Kasthuri, K. J. Hayworth, D. R. Berger, R. L. Schalek, J. A. Conchello,

S. Knowles-Barley, D. Lee, A. Vázquez-Reina, V. Kaynig, T. R. Jones et al.,

“Saturated Reconstruction of a Volume of Neocortex,” Cell, vol. 162, no. 3, pp.

648–661, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/

pii/S0092867415008247

[63] S. Sanfilippo and P. Noordhuis, “Redis,” http://redis.io. [Online]. Available:

http://redis.io

[64] V. Haenel, “Bloscpack: a compressed lightweight serialization format for numerical

data,” arXiv preprint arXiv:1404.6383, 2014.

[65] P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil, “The Log-structured Merge-tree

(LSM-tree),” Acta Inf., vol. 33, no. 4, pp. 351–385, Jun. 1996. [Online]. Available:

http://dx.doi.org/10.1007/s002360050048

[66] J. W. Lichtman, H. Pfister, and N. Shavit, “The big data challenges of connectomics,”

Nature neuroscience, vol. 17, no. 11, pp. 1448–1454, 2014.

148 BIBLIOGRAPHY

[67] T. M. Coughlin, “Achieving A Faster, More Scalable Media Active Archive,”

Coughlin Associates, Tech. Rep., July 2017. [Online]. Available: https://cloudian.

com/wp-content/uploads/2017/08/Coughlin-Associates-Media-Active-Archive.pdf

[68] “AWS Elastic Transcoder,” https://aws.amazon.com/elastictranscoder/. [Online].

Available: https://aws.amazon.com/elastictranscoder/

[69] “Microsoft Azure Media Services,” https://azure.microsoft.com/en-us/services/

media-services/. [Online]. Available: https://azure.microsoft.com/en-us/services/

media-services/

[70] “Google Cloud Media and Entertainment Solutions,” https://cloud.google.com/

solutions/media-entertainment/. [Online]. Available: https://cloud.google.com/

solutions/media-entertainment/

[71] V. Marx, “Biology: The big challenges of big data,” vol. 498, pp. 255–260, 2013.

[Online]. Available: https://doi.org/10.1038/498255a

[72] M. C. Schatz, “CloudBurst: highly sensitive read mapping with MapReduce,”

Bioinformatics, vol. 25, no. 11, pp. 1363–1369, 04 2009. [Online]. Available:

https://dx.doi.org/10.1093/bioinformatics/btp236

[73] R. Charles, V. Antonescu, C. Wilks, and B. Langmead, “Scaling read aligners

to hundreds of threads on general-purpose processors,” Bioinformatics, 07 2018.

[Online]. Available: https://dx.doi.org/10.1093/bioinformatics/bty648

[74] K. Lim, G. Park, M. Choi, Y. Won, D. Kim, and H. Kim, “Workload Characteristics

of DNA Sequence Analysis: From Storage Systems’ Perspective,” in Proceedings of

149 BIBLIOGRAPHY

the 6th Workshop on Rapid Simulation and Performance Evaluation: Methods and

Tools, ser. RAPIDO ’14. New York, NY, USA: ACM, 2014, pp. 4:1–4:7. [Online].

Available: http://doi.acm.org/10.1145/2555486.2555490

[75] . G. P. D. P. Subgroup, A. Wysoker, B. Handsaker, G. Marth, G. Abecasis, H. Li,

J. Ruan, N. Homer, R. Durbin, and T. Fennell, “The Sequence Alignment/Map

format and SAMtools,” Bioinformatics, vol. 25, no. 16, pp. 2078–2079, 06 2009.

[Online]. Available: https://dx.doi.org/10.1093/bioinformatics/btp352

[76] C. Tyler-Smith, P. Danecek, R. Durbin, V. Narasimhan, Y. Xue, and A. Scally,

“BCFtools/RoH: a hidden Markov model approach for detecting autozygosity from

next-generation sequencing data,” Bioinformatics, vol. 32, no. 11, pp. 1749–1751, 01

2016. [Online]. Available: https://dx.doi.org/10.1093/bioinformatics/btw044

[77] G. Kiar, K. J. Gorgolewski, D. Kleissas, W. G. Roncal, B. Litt, B. Wandell, R. A.

Poldrack, M. Wiener, R. J. Vogelstein, R. Burns et al., “Science In the Cloud (SIC):

A use case in MRI Connectomics,” Giga Science, vol. 6, no. 5, pp. 1–10, 2017.

[78] K. J. Gorgolewski, F. Alfaro-Almagro, T. Auer, P. Bellec, M. Capotă, M. M.

Chakravarty, N. W. Churchill, A. L. Cohen, R. C. Craddock, G. A. Devenyi et al.,

“BIDS apps: Improving ease of use, accessibility, and reproducibility of neuroimaging

data analysis methods,” PLoS computational biology, vol. 13, no. 3, p. e1005209, 2017.

[79] K. Lillaney, D. Kleissas, A. Eusman, E. Perlman, W. Gray Roncal, J. T. Vogelstein,

and R. Burns, “Building NDStore Through Hierarchical Storage Management and

150 BIBLIOGRAPHY

Microservice Processing,” in 2018 IEEE 14th International Conference on e-Science

(e-Science), Oct 2018, pp. 223–233.

[80] Y. Shao, L. Di, Y. Bai, B. Guo, and J. Gong, “Geoprocessing on the Amazon cloud

computing platform — AWS,” in 2012 First International Conference on Agro- Geoin-

formatics (Agro-Geoinformatics), Aug 2012, pp. 1–6.

[81] R. Sugumaran, J. Burnett, and A. Blinkmann, “Big 3D Spatial Data Processing

Using Cloud Computing Environment,” in Proceedings of the 1st ACM SIGSPATIAL

International Workshop on Analytics for Big Geospatial Data, ser. BigSpatial

http://doi.acm.org/10.1145/2447481.2447484

[82] A. Kaplunovich and Y. Yesha, “Cloud Big Data Decision Support System for Machine

Learning on AWS: Analytics of Analytics,” in 2017 IEEE International Conference on

Big Data (Big Data), Dec 2017, pp. 3508–3516.

[83] E. Goldin, D. Feldman, G. Georgoulas, M. Castano, and G. Nikolakopoulos, “Cloud

computing for big data analytics in the Process Control Industry,” in 2017 25th

Mediterranean Conference on Control and Automation (MED), July 2017, pp. 1373–

[84] D. A. Rodríguez-Silva, L. Adkinson-Orellana, F. J. Gonz’lez-Castaño, I. Armiño-

Franco, and D. Gonz’lez-Martínez, “Video Surveillance Based on Cloud Storage,” in

2012 IEEE Fifth International Conference on Cloud Computing, June 2012, pp. 991–

151 BIBLIOGRAPHY

[85] A. R. Elias, N. Golubovic, C. Krintz, and R. Wolski, “Where’s the Bear? - Automating

Wildlife Image Processing Using IoT and Edge Cloud Systems,” in 2017 IEEE/ACM

Second International Conference on Internet-of-Things Design and Implementation

(IoTDI), April 2017, pp. 247–258.

[86] H. Xiao, Z. Li, E. Zhai, T. Xu, Y. Li, Y. Liu, Q. Zhang, and Y. Liu,

“Towards Web-based Delta Synchronization for Cloud Storage Services,” in 16th

USENIX Conference on File and Storage Technologies (FAST 18). Oakland,

CA: USENIX Association, 2018, pp. 155–168. [Online]. Available: https:

//www.usenix.org/conference/fast18/presentation/xiao

[87] P. Jonkins, “RioFS,” https://github.com/skoobe/riofs. [Online]. Available: https:

//github.com/skoobe/riofs

[88] “BlobFuse,” https://github.com/Azure/azure-storage-fuse. [Online]. Available: https:

//github.com/Azure/azure-storage-fuse

[89] “GCSFuse,” https://github.com/GoogleCloudPlatform/gcsfuse. [Online]. Available:

https://github.com/GoogleCloudPlatform/gcsfuse

[90] “SVFS,” https://github.com/ovh/svfs. [Online]. Available: https://github.com/ovh/

[91] “S3QL,” https://bitbucket.org/nikratio/s3ql/. [Online]. Available: https://bitbucket.

org/nikratio/s3ql/

[92] “ObjectiveFS,” https://objectivefs.com/. [Online]. Available: https://objectivefs.com/

152 BIBLIOGRAPHY

[93] M. Vrable, S. Savage, and G. M. Voelker, “Cumulus: Filesystem Backup to the

Cloud,” Trans. Storage, vol. 5, no. 4, pp. 14:1–14:28, Dec. 2009. [Online]. Available:

http://doi.acm.org/10.1145/1629080.1629084

[94] “Scality RING,” https://www.scality.com/products/ring/. [Online]. Available: https:

//www.scality.com/products/ring/

[95] “AWS Storage Gateway for Files,” https://aws.amazon.com/storagegateway/file/.

[Online]. Available: https://aws.amazon.com/storagegateway/file/

[96] “OpenStack Swift,” https://docs.openstack.org/swift/latest/. [Online]. Available:

https://docs.openstack.org/swift/latest/

[97] “AWS Documentation: Bucket Policy Examples,”

https://docs.aws.amazon.com/AmazonS3/latest/dev/example-bucket-policies.html.

[Online]. Available: https://docs.aws.amazon.com/AmazonS3/latest/dev/

example-bucket-policies.html

[98] A. Bessani, R. Mendes, T. Oliveira, N. Neves, M. Correia, M. Pasin, and

P. Verissimo, “SCFS: A Shared Cloud-backed File System,” in 2014 USENIX Annual

Technical Conference (USENIX ATC 14). Philadelphia, PA: USENIX Association,

2014, pp. 169–180. [Online]. Available: https://www.usenix.org/conference/atc14/

technical-sessions/presentation/bessani

[99] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn,

“Ceph: A Scalable, High-performance Distributed File System,” in Proceedings of

the 7th Symposium on Operating Systems Design and Implementation, ser. OSDI ’06.

153 BIBLIOGRAPHY

Berkeley, CA, USA: USENIX Association, 2006, pp. 307–320. [Online]. Available:

http://dl.acm.org/citation.cfm?id=1298455.1298485

[100] “OpenIO FS,” https://docs.openio.io/18.04/source/arch-design/fs_overview.html.

[Online]. Available: https://docs.openio.io/18.04/source/arch-design/fs_overview.

[101] “ProxyFS,” https://github.com/swiftstack/ProxyFS. [Online]. Available: https:

//github.com/swiftstack/ProxyFS

[102] M. Kaczmarski, T. Jiang, and D. A. Pease, “Beyond backup toward storage

management,” IBM Systems Journal, vol. 42, no. 2, pp. 322–337, 2003. [Online].

Available: https://ieeexplore.ieee.org/abstract/document/5386857

[103] P. Schwan et al., “Lustre: Building a file system for 1000-node clusters,” in Proceedings

of the 2003 Linux symposium, vol. 2003, 2003, pp. 380–386.

[104] “SwiftOnFile,” https://github.com/openstack/SwiftOnFile. [Online]. Available: https:

//github.com/openstack/swiftonfile

[105] “Gluster-Swift,” https://docs.gluster.org/en/latest/Administrator%20Guide/

Object%20Storage/. [Online]. Available: https://docs.gluster.org/en/latest/

Administrator%20Guide/Object%20Storage/

[106] “Cloudian Hyperfile,” https://cloudian.com/products/hyperfile-nas-controller/. [On-

line]. Available: https://cloudian.com/products/hyperfile-nas-controller/

[107] “Maginatics MagFS,” http://downloads.maginatics.com/

154 BIBLIOGRAPHY

MaginaticsMagFSTechnicalWhitepaper.pdf. [Online]. Available: http:

//downloads.maginatics.com/MaginaticsMagFSTechnicalWhitepaper.pdf

[108] G. Vernik, M. Factor, E. K. Kolodner, P. Michiardi, E. Ofer, and F. Pace, “Stocator:

Providing High Performance and Fault Tolerance for Apache Spark Over Object Stor-

age,” in 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid

Computing (CCGRID), May 2018, pp. 462–471.

[109] H. Li, “Alluxio: A Virtual Distributed File System,” Ph.D. dissertation, University

of California, Berkeley, 2018. [Online]. Available: https://www2.eecs.berkeley.edu/

Pubs/TechRpts/2018/EECS-2018-29.pdf

[110] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen,

S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica,

“Apache Spark: A Unified Engine for Big Data Processing,” Commun. ACM, vol. 59,

no. 11, pp. 56–65, Oct. 2016. [Online]. Available: http://doi.acm.org/10.1145/2934664

[111] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large

Clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008. [Online]. Available:

http://doi.acm.org/10.1145/1327452.1327492

[112] J. Barr. Consistent View for Elastic MapReduce’s File System. AWS News Blog.

[Online]. Available: https://aws.amazon.com/blogs/aws/emr-consistent-file-system/

[113] D. C. Weeks. S3mper: Consistency in the Cloud. The Net-

flix Tech Blog. [Online]. Available: https://medium.com/netflix-techblog/

s3mper-consistency-in-the-cloud-b6a1076aa4f8

155 BIBLIOGRAPHY

[114] L. Rupprecht, R. Zhang, B. Owen, P. Pietzuch, and D. Hildebrand, “SwiftAnalytics:

Optimizing Object Storage for Big Data Analytics,” in Cloud Engineering (IC2E),

2017 IEEE International Conference on. IEEE, 2017, pp. 245–251.

[115] L. Rupprecht, R. Zhang, and D. Hildebrand, “Big data analytics on object stores: A

performance study,” red, vol. 30, p. 35, 2014.

[116] F. Schmuck and R. Haskin, “GPFS: A Shared-disk File System for Large Computing

Clusters,” in Proceedings of the 1st USENIX Conference on File and Storage

Technologies, ser. FAST’02. Berkeley, CA, USA: USENIX Association, 2002, pp.

16–16. [Online]. Available: http://dl.acm.org/citation.cfm?id=1973333.1973349

[117] B. K. R. Vangoor, V. Tarasov, and E. Zadok, “To FUSE or Not to FUSE:

Performance of User-Space File Systems,” in 15th USENIX Conference on File

and Storage Technologies (FAST 17). Santa Clara, CA: USENIX Association,

2017, pp. 59–72. [Online]. Available: https://www.usenix.org/conference/fast17/

technical-sessions/presentation/vangoor

[118] “AWS Event Notifications,” https://docs.aws.amazon.com/AmazonS3/latest/dev/

NotificationHowTo.html. [Online]. Available: https://docs.aws.amazon.com/

AmazonS3/latest/dev/NotificationHowTo.html

[119] “Google Change Notifications,” https://cloud.google.com/storage/docs/

object-change-notification. [Online]. Available: https://cloud.google.com/storage/

docs/object-change-notification

[120] “Blob Storage Events,” https://docs.microsoft.com/en-us/azure/storage/blobs/

156 BIBLIOGRAPHY

storage-blob-event-overview. [Online]. Available: https://docs.microsoft.com/en-us/

azure/storage/blobs/storage-blob-event-overview

[121] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin,

S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: Amazon’s Highly

Available Key-value Store,” in Proceedings of Twenty-first ACM SIGOPS Symposium

on Operating Systems Principles, ser. SOSP ’07. New York, NY, USA: ACM, 2007,

pp. 205–220. [Online]. Available: http://doi.acm.org/10.1145/1294261.1294281

[122] A. Lakshman and P. Malik, “Cassandra: A Decentralized Structured Storage

System,” SIGOPS Oper. Syst. Rev., vol. 44, no. 2, pp. 35–40, Apr. 2010. [Online].

Available: http://doi.acm.org/10.1145/1773912.1773922

[123] P. Snyder, “tmpfs: A virtual memory file system,” in Proceedings of the autumn 1990

EUUG Conference, 1990, pp. 241–248.

[124] A. Gaul, “S3Proxy,” https://github.com/gaul/s3proxy. [Online]. Available: https:

//github.com/gaul/s3proxy

[125] “AWS Lambda,” https://aws.amazon.com/lambda/. [Online]. Available: https:

//aws.amazon.com/lambda/

[126] J. Axboe, “FIO - flexible I/O benchmark,” https://linux.die.net/man/1/fio. [Online].

Available: https://linux.die.net/man/1/fio

[127] Z. Li, C. Jin, T. Xu, C. Wilson, Y. Liu, L. Cheng, Y. Liu, Y. Dai, and

Z.-L. Zhang, “Towards Network-level Efficiency for Cloud Storage Services,” in

157 BIBLIOGRAPHY

Proceedings of the 2014 Conference on Internet Measurement Conference, ser.

IMC ’14. New York, NY, USA: ACM, 2014, pp. 115–128. [Online]. Available:

http://doi.acm.org/10.1145/2663716.2663747

[128] S. Tomar, “Converting Video Formats with FFmpeg,” Linux J., vol. 2006, no. 146,

pp. 10–, Jun. 2006. [Online]. Available: http://dl.acm.org/citation.cfm?id=1134782.

1134792

[129] I. Archive, https://archive.org/search.php?query=subject%3A%22Sci-Fi%22, ac-

cessed: 2019-01-29.

[130] M. Ferrer, S. J. C. Gosline, M. Stathis, X. Zhang, X. Guo, R. Guha, D. A. Ry-

man, M. R. Wallace, L. Kasch-Semenza, H. Hao, R. Ingersoll, D. Mohr, C. Thomas,

S. Verma, J. Guinney, and J. O. Blakeley, “Pharmacological and genomic profiling

of neurofibromatosis type 1 plexiform neurofibroma-derived schwann cells,” Sci Data,

vol. 5, p. 180106, 06 2018.

[131] S. Bionetworks, https://www.ncbi.nlm.nih.gov/sra/?term=SRR6311433, accessed:

2019-01-29.

[132] “AWS SNS,” https://aws.amazon.com/sns/. [Online]. Available: https://aws.amazon.

com/sns/

[133] H. Shan and J. Shalf, “Using IOR to analyze the I/O performance for HPC platforms,”

[134] D. Kleissas, R. Hider, D. Pryor, T. Gion, P. Manavalan, J. Matelsky, A. Baden,

K. Lillaney, R. Burns, D. D’Angelo et al., “The block object storage service (bossDB):

158 BIBLIOGRAPHY

A cloud-native approach for petascale neuroscience discovery,” bioRxiv, p. 217745,

159 Vita

Kunal Lillaney received his B. Engg degree in

Computer Engineering from Mumbai University in 2011

and his MSE degree in Computer Science from Johns

Hopkins University in 2013. He enrolled in the Computer

Science Ph.D. program at Johns Hopkins University in

2013. He served as the Secretary of the Upsilion Pi Ep-

silon (UPE)—JHU Chapter between 2015 and 2017, and

won the UPE Executive Council Award in 2016. His research focuses on enabling data analysis in the cloud by building hierarchical storage service over object storage. His research papers have been published at multiple conferences including HotStorage, IEEE E-Science and SSDBM.