COMP520-12C Final Report

NomadFS A block migrating distributed file system

Samuel Weston

This report is in partial fulfilment of the requirements for the degree of Bachelor of Computing and Mathematical Sciences with Honours (BCMS(Hons)) at The University of Waikato.

©2012 Samuel Weston Abstract

A distributed file system is a file system that is spread across multiple ma- chines. This report describes the block-based distributed file system NomadFS. NomadFS is designed for small scale distributed settings, such as those that exist in computer laboratories and cluster computers. It implements features, such as caching and block migration, which are aimed at improving the perfor- mance of shared data in such a setting. This report includes a discussion of the design and implementation of No- madFS, including relevant background. It also includes performance measure- ments, such as scalability.

2 Acknowledgements

I would like to thank all the friendly members of the WAND network research group. This especially includes my supervisor Tony McGregor who has provided me with a massive amount of help over the year. Thanks!

On a personal level I have enjoyed developing NomadFS and have learnt a great deal as a consequence of this development. This learning includes improv- ing my programming ability, both in user space and kernel space (initially NomadFS was planned to be developed as a kernel space file system). I have also learnt a large amount about file systems and operating systems in general.

3 nomad /’n@Umæd/ noun member of tribe roaming from place to place for pasture;

4 Contents

1 Introduction 11

2 Background 13 2.1 A file system overview ...... 13 2.1.1 System calls ...... 15 2.2 Distributed systems ...... 17 2.2.1 Communication ...... 18 2.2.2 Synchronisation and Consistency ...... 18 2.2.3 Fault Tolerance ...... 18 2.2.4 Performance ...... 19 2.2.5 Scalability ...... 19 2.2.6 Transparency ...... 19 2.3 The Virtual ...... 19 2.4 Filesystem in Userspace ...... 20 2.5 Summary ...... 21

3 Goals 22

4 File System Survey 24 4.1 Network File System (NFS) ...... 24 4.2 Gluster File System (GlusterFS) ...... 25 4.3 Google File System ...... 25 4.4 Zebra and RAID ...... 26 4.5 Summary ...... 27

5 Design 28 5.1 Overview ...... 28 5.2 Block interface ...... 29 5.2.1 Block-based approach ...... 29 5.2.2 Identification and locality ...... 30

5 5.3 File system structure ...... 30 5.3.1 Communication API ...... 31 5.4 Performance and Reliability ...... 32 5.4.1 Cache ...... 33 5.4.2 Synchronisation ...... 34 5.4.3 Block Mobility and Migration ...... 35 5.4.4 Block Allocation ...... 35 5.4.5 Prefetching ...... 36 5.4.6 Scalability ...... 36 5.5 Summary ...... 36

6 Implementation 37 6.1 Clients and Block Servers ...... 38 6.1.1 Client ...... 38 6.1.2 Block Server ...... 40 6.1.3 Locality and Client start up ...... 40 6.2 Communication ...... 41 6.2.1 Transport Protocol ...... 41 6.2.2 Messages ...... 41 6.2.3 Common Client and Server Communication ...... 42 6.2.4 Client Network Queue ...... 42 6.2.5 Overlapped IO ...... 43 6.2.6 Block server specific communication ...... 44 6.3 Synchronisation ...... 44 6.3.1 Distributed Synchronisation ...... 44 6.3.2 Internal Synchronisation ...... 45 6.4 Cache ...... 46 6.4.1 Cache coherency ...... 46 6.5 Block migration ...... 47 6.6 Aggressive Prefetching ...... 48 6.7 Issues and Challenges ...... 48 6.8 Summary ...... 48

7 Evaluation 49 7.1 Test Environment ...... 49 7.2 Migration ...... 50 7.3 Scalability ...... 51 7.4 Effect of block size on performance ...... 53 7.5 NFS Comparison ...... 54

6 7.6 IOZone ...... 54 7.7 Summary ...... 57

8 Conclusions and Future Work 58 8.1 Summary ...... 58 8.2 Conclusion ...... 58 8.3 Future Work ...... 59 8.3.1 Potential Extensions ...... 59 8.4 Final Words ...... 60

Bibliography 62

A Performance analysis scripts 64

B NomadFS current quirks 66

C IOZone results 68

D Configuration file format for NomadFS 70

E NomadFS source code listing 71

7 List of Figures

2.1 A file system ...... 13 2.2 File system layout on block abstraction (not to scale) ...... 14 2.3 Inode structure including indirection blocks ...... 14 2.4 A distributed file system ...... 17 2.5 VFS flow example. A user space write system call passes through the VFS and reaches the required file system write function. Adapted from Fig. 13.2 [11]...... 20

5.1 High Level Architecture ...... 28 5.2 Client to server link ...... 29 5.3 Block and inode identifier ...... 30 5.4 Message passing ...... 31 5.5 File based cache invalidation ...... 34

6.1 Client Architecture ...... 38 6.2 Message layout in NomadFS (Data Block not to scale) ...... 41 6.3 Network queueing ...... 43 6.4 Overlapped IO (Adapted from Figure 2.4 [12]) ...... 43 6.5 Synchronisation ...... 45 6.6 Buffer Cache (Adapted from Fig. 5-20 [20]) ...... 46 6.7 Migration flow ...... 47

7.1 Test Environment ...... 50 7.2 Migration Performance ...... 51 7.3 Scalability on file smaller than cache ...... 52 7.4 Scalability on file larger than cache ...... 52 7.5 Affect of block size on performance ...... 53 7.6 IOZone Write ...... 55 7.7 IOZone Random Write ...... 55 7.8 IOZone Read ...... 56

8 7.9 IOZone Random Read ...... 56

9 Acronyms

API Application Programming Interface.

FUSE Filesystem in Userspace.

LFS Log-Structured File System.

NFS Network File System.

RAID Redundant Array of Individual Disks.

VFS Virtual File System.

10 Chapter 1

Introduction

Multiple computer systems such as cluster computers and computer laboratories generally have a large amount of aggregate storage, due to each machine having its own ‘small’ hard disk drive. As opposed to making use of the combined storage and performance capabilities of these ‘small’ disks, a common approach to shared data in these systems is to use a single centralised storage system. A distributed file system which can take advantage of these storage and per- formance capabilities would help to improve the usefulness of shared data in a small scale distributed settings. This report covers the design and implementation of NomadFS, a new, pri- marily block-based distributed file system for the Linux environment. NomadFS is aimed at meeting the needs of smaller scale distributed environments. From a user’s standpoint, performance is important. Because of this NomadFS has built in functionality which allows maximal usage of the machine’s local disk. This includes preferring the local disk when creating data and allowing data to migrate to the disks of machines which use it the most. So that goals such as migration could be implemented and tested, common distributed file system functionalities such as fault tolerance through replication were not deemed a priority in this research. When approaching this problem there were a number of options available on how to implement such a file system. Firstly a decision was needed on whether the underlying architecture would operate on blocks or files. A block-based approach refers to the ability for the file system to operate directly on top of a block device while a file-based approach means that the file system relies on some form of underlying file architecture. For reasons that are explained in Chapter 5, a block-based approach, with some file based elements, was chosen for NomadFS.

11 Chapter 2 contains background file system information. This includes a background to file systems, block-based file systems and distributed systems. An understanding of these topics is required to fully understand this project. Chapter 3 contains the set of goals which NomadFS aimed to meet. Distributed file systems are not a new topic in , it is therefore necessary that some related implementations are surveyed. This file system survey can be found in Chapter 4. The design and implementation of NomadFS are central to this project and are covered in Chapters 5 and 6. Chapter 5 overviews the design of NomadFS, and why these design decisions were made. Chapter 6 covers the implementa- tion, and covers the specifics of how the various design elements were imple- mented in NomadFS. Chapter 7 contains a performance oriented evaluation of NomadFS in its current state. Chapter 8 rounds off the report with conclusions and potential future work.

12 Chapter 2

Background

This chapter covers the background to this project. This includes an overview of file systems and in particular block-based file systems for the unfamiliar reader. Distributed systems, distributed file systems, and some the issues they encounter are then covered. The chapter ends by covering the Linux Virtual Filesystem (VFS) and Filesystem in Userspace (FUSE) with some depth. An understanding of these topics, especially the later ones is important in the context of this project.

2.1 A file system overview

A file system is software that provides a means for users to store their data in a persistent manner. From the user’s point of view this is generally seen as directories and files.

End User / Files and /file1 directories /directory1/ /directory1/file2

File System

Block Device

Figure 2.1: A file system

13 In file system terminology disks, or raw devices, are divided into equal sized segments called blocks. File systems are then built on top of this block-based storage abstraction, which is typically provided by a block device driver that interfaces with a piece of hardware such as a hard disk drive (HDD). For data to remain persistent, the file system must lay the data out on this series of blocks in an organised manner. Most Unix file systems do this by making use superblocks, inodes, bitmap areas, and data blocks. These are shown in Figure 2.2 and described in the following paragraphs.

Superblock Inode Bitmap Data Bitmap Inode Table Data Blocks

Figure 2.2: File system layout on block abstraction (not to scale)

Superblock A superblock is present at the beginning of the disk at a fixed location. It provides important information about the file system that follows it. This includes such information as what type of file system it is and how large the different areas of it are.

Bitmap Bitmap blocks follow the superblock and show which inodes and data blocks are currently allocated. A single bit is high if that particular inode entry or data block is allocated, and low if it is not. Naively using a bitmap means that a files data can potentially be sparse on the raw device.

Directory Data

17 file.txt

Inode number File name Inode

Mode UID Metadata File Size Data Blocks ... 55 Direct block 57 pointers ... 73 Indirect block 74 pointers ...

Figure 2.3: Inode structure including indirection blocks

14 Inode Following the bitmaps is the inode table. An inode is a data structure which contains information relating to a file. This includes metadata such as when the file was created and which user owns it. Most importantly, the inode contains where the file’s data is located through the use of block pointers. Because an inode is small and fixed in size and an individual file can potentially hold many block pointers, block pointer indirection is required, as shown in Figure 2.3. The number of indirect blocks that the file system supports determines the maximum file size. It is worth noting that the file an inode represents can be a directory. In this case the file’s data contains the file names of the children and the identifier of the inode associated with the child (shown in Figure 2.3). The root directory “/” is located at a fixed point in the inode table, and provides the starting point for directory traversal to any file or directory in the file system.

The MINIX file system [20] follows this layout very closely, but obviously, such method of laying out data on a block-level abstraction is not the only way that a file system can be organised. For example, the Linux extended file system (Ext FS) divides the blocks into “Block Groups”, each of which contain a superblock, bitmaps and data blocks [2]. This helps to keep the data from an individual file in sequential order so as to speed up accesses on the underlying device.

2.1.1 System calls

For a file system to be useful, it must be accessible by the user. POSIX [8] defines a set of system calls which provide a consistent means of accessing file system information. Almost all modern file systems including Ext conform to this POSIX standard. So that one can gain an understanding of what functionalities a file system must cater for, the following is a list of the most important POSIX file system calls. Even a simple file system must be able to handle most of these.

List of important system calls system call name arguments description open file’s path. Returns a file descriptor which describes the given file. This can be used in subsequent system calls. close file descriptor. Closes the open file. seek file descriptor. Sets the offset which reads or writes should operate from.

15 read file descriptor, buffer read into, number of bytes to read. Returns a num- ber of bytes from the described file. write file descriptor, buffer to write from, number of bytes to write. Writes a number of bytes to the described file. truncate file descriptor, new file size Sets the size of the file. mkdir path of directory. Creates a directory. rmdir path of directory. Remove an empty directory. mknod path of file. Creates a file. unlink path to file. Remove a directory entry (e.g. a file). rename path to old file, path to new file. Rename a file.

Example system calls

The following is a description of how a block-based file system handles the open and read system calls from the previous list. When a user space program wishes to read a file, it must firstly open it using the open system call. The system call results in a software interrupt and eventually the file system open function is run. The file system then traverses the directory tree until the file’s inode identifier is located. Using the inode’s identifier as an offset into the inode table, the file system can then read the inode from the block device. The system call finally returns a file descriptor, which can be used to describe this file in any further calls on this file. To read data from the file the program then calls the read system call on the file descriptor with the number of required bytes and a buffer to read the bytes into. As with the open call, this read passes through the kernel and reaches the appropriate file system function. The file system program then reads the appropriate inode and calculates which data block is required. Using the offset into the file’s data that is required, the file system can calculate which data block pointer is needed. Assuming no indirection, this pointer can then be read from the inode and then the actual block data can be read and returned to the calling program. If indirection is required (i.e. if the file’s block pointers overflow onto indirect blocks), then the file system must traverse this block indirection to locate the required block. In practice there are a number of complications in this process that relate to caching and error handling. For simplicities sake these are not mentioned here.

16 2.2 Distributed systems

Distributed systems are computer systems which cover multiple computers con- nected by a communication network [19]. Because they have access to the com- bined computational capacity of these machines, distributed systems normally have potential for greater performance and capacity. They do, however have a number of implications which result in them often becoming more complex than single computer systems. Distributed file systems are a sub-topic of distributed systems and aim to present a single usable file system to multiple client machines while making use of the storage capacity of multiple server machines. A single machine may sometimes act as both a client and a server. Such a file system has a number of advantages over single machine file sys- tems including potential for storage more and performance due to the use of multiple storage devices. As well as these, distributed file systems also have the potential for improved reliability and an improved environment for sharing of data. This sharing of data can be particularly useful in situations that involve parallel processing [5].

/ / / /file1 /file1 /file1 End User /directory1/ /directory1/ /directory1/ /directory1/file2 /directory1/file2 /directory1/file2

Client File Client File Client File System System System

Remote Data Remote Data Remote Data

Figure 2.4: A distributed file system

The important issues and implications that relate to distributed systems [10] are listed and summarised below, all of which must be handled appropriately by any distributed system. As they are of more interest in the context of this report, the points which are particularly related to distributed file systems are covered with more depth.

17 2.2.1 Communication

Since a distributed system exists across multiple machines there is a requirement for communication between nodes over a network. This is often achieved with message based communication, where machines communicate with a consistent messaging interface. There are, however, other paradigms including Remote Procedure Calls (RPC).

2.2.2 Synchronisation and Consistency

Synchronisation refers to the coordination of separate processes as they act on shared data. Consistency in a distributed system refers to the ability to keep data consistent between varying machines, even after changes occur. Both are related to one another and come into play when different programs use shared data structures. Because a distributed file system is inherently a single shared data structure between clients, these points are also of particular interest to them. In a distributed file system, it is essential that the system can safely handle or avoid synchronisation issues such as concurrent updates. A concurrent update can potentially result in data loss, and can be avoided using techniques such as mutual exclusion around shared data. In a distributed file system, one technique for providing synchronisation of files is to only allow a single writer or any number of readers access to a file. Consistency is often related to replication and caching because of the oppor- tunity for data to exist in multiple locations. Having the ability to duplicate data might lead to consistency issues, but is often required so that maximum performance can be attained. To handle consistency correctly, a distributed file system will generally implement systems that allow functionality such as cache coherency.

2.2.3 Fault Tolerance

With an increased number of machines, distributed systems have an increased chance of possible failure at any individual point. Because of this, the require- ment for safe handling of failure in individual machines is often seen as an important aspect of distributed systems such as distributed file systems. Fault tolerance is often achieved with some level of redundancy, such as storing every piece of data on two separate machines. Although not a distributed file system Redundant Array of Individual Disks (RAID) 2 provides an example of how this

18 can be achieved, where a single parity disk results in the ability for any single disk to be recovered in the event of a disk failure [16].

2.2.4 Performance

Since a distributed system operates over multiple machines, it appears possible at first sight for it to perform better than a single machine. Actually achieving this performance in a distributed system is not always straight forward. Per- formance in file systems, including distributed file systems is, often measured by the speed at which reads and writes on files can occur. The challenge of achieving good performance is particularly related to some of the earlier points, such as synchronisation and consistency, where achieving these can often result in reduced performance. Achieving acceptable performance while also main- taining these elements can be an important challenge in any distributed system and involves designing the system so that it “balances” these features based on the project’s goals.

2.2.5 Scalability

Depending on the goals of the distributed system, it is often desirable that the system can continue to operate appropriately even when the number of machines increases. The extent to which the system is scalable should be set in the goals of any distributed system.

2.2.6 Transparency

Transparency refers to the system hiding the underlying implementation from the calling program. In a distributed system this often refers to hiding the network and the presence multiple machines so that the system “appears” as a single machine or system. In Linux file systems this is guaranteed due to the Virtual File System (VFS), which is described in Section 2.3 below.

2.3 The Linux Virtual File System

The Linux kernel supports many varying file systems, all of which have much in common. Linux achieves consistency in this functionality by providing a layer between user space file system calls and the actual kernel file system programs. This abstraction layer, or common interface, is known as the VFS (Virtual File System), an object oriented set of functions and data structures which kernel space file systems must implement and make use of [11] [1]. This means that

19 the kernel need not know of the underlying file system architecture. Therefore, any file system can potentially be implemented in the Linux environment. All file system operations pass through the VFS (see Figure 2.5 below), so it is important to have an understanding of what role it performs.

file system's write() sys_write() write method

user space VFS file system physical media

Figure 2.5: VFS flow example. A user space write system call passes through the VFS and reaches the required file system write function. Adapted from Fig. 13.2 [11].

2.4 Filesystem in Userspace

Filesystem in Userspace (FUSE) (Filesystem in Userspace) is a Linux kernel file system module that allows a user space program to indirectly resolve file system calls. In short, the FUSE kernel module catches file system calls from the VFS and proceeds to forward them to a user space file system program to resolve. There are a number of positive and negatives to file system development in this manner. Kernel space programming is generally more difficult and time consuming than user space programming. This is not only because of less documentation of kernel space code but also due to the more time consuming development cycle, the available API functionality and the increased consequences of mistakes. Be- cause of this, FUSE allows a faster and easier environment in which to develop file systems. This does, however, come at the cost of an increase in overhead be- cause of need for additional memory movement and context switches [21]. Such overhead becomes less important when dealing with networked file systems and therefore distributed file systems because of the longer wait times on network operations [3] [17].

20 2.5 Summary

File systems provide a means for data to be stored and represented to end users. Distributed file systems extend this notion so as to include multiple machines but must also account for additional complexities. In the Linux context, the VFS provides a means of abstracting away the file system internals, while FUSE provides a means of developing user space file systems.

21 Chapter 3

Goals

Chapter 2 covered the background of file systems and distributed file systems. Using this information it is possible to list the goals which NomadFS aimed to meet. NomadFS is aimed at providing improved performance in a smaller scale distributed setting, over centralised storage solutions. Because read operations are seen as more important, the goals are based on the use case that a single writer or ‘n’ readers will act on an individual file at a given time (i.e. many readers but only a single writer can act on a file at any given time).

• Performs well for use case. A method of achieving this, in a distributed scenario, is to maximise the usage of the local disk. This is because the local disk is faster to access than a remote one and is further expanded upon in Chapter 5. Maximising the usage of the local disk can be achieved through the preferred use of the local server (if it exists) when creating new data and through migration of data to the local server.

• Allow for multiple clients to concurrently act on a file. As mentioned in Section 2.2 this means that the system must appropriately handle syn- chronisation and consistency issues. This includes, for example, handling of concurrent updates and consistency of client caches. How these are handled comes back to the use case of many readers but only a single writer.

• File system implementation should be completely transparent (i.e. the calling program doesn’t know it is any different than any other file system). This is implicitly provided by the VFS, but is worth mentioning.

• Consistency should be guaranteed. This implies that the file system forces clients to always display the latest modifications file data and is needed if

22 users are to share data appropriately.

• Scales well over the number of machines that exist in laboratory or cluster scenarios (i.e. hundreds). This is related to the performance goal, but is separated here so that it is clear that the system should continue to perform well when machines are added.

As the number of machines in a system increases so to does the chance of failure at an individual point. Because of this, distributed file systems often focus on fault tolerance mechanisms such as data replication. So that function- alities such as data migration could be explored, implemented and tested, fault tolerance issues were not covered in this research. Because of these goals, a block-based approach to implementation was cho- sen for NomadFS. The reasons for this are further expanded in Chapter 5, and included a natural ability for data from a single file to be split across multi- ple machines. This splitting of file data is important when dealing with data migration. Together, achievement of these goals results in distributed file system which allows for a well performing shared storage medium of data in a small scale distributed setting.

23 Chapter 4

File System Survey

Distributed file systems are not a new topic in computer science. Because of this, it is important that some other distributed file system implementations are examined. This chapter examines four distributed file system examples that illustrate important aspects of distributed file systems and are also related to the goals of NomadFS.

4.1 Network File System (NFS)

Implementation Because it allows for only a single server, NFS [13] [18] is not strictly a distributed file system. It is however worth mentioning because it is probably the most well known and most widely used distributed file system and is often used as a comparison in network file system benchmarks. NFS works on a file level, exporting the contents from an underlying local file system, such as Ext4, over a network connection from the NFS server. Multiple NFS clients can then concurrently connect and interact with this exported file system. NFS also offers an interesting caching system, with both the client and server maintaining a cache. The server cache performs the same function as that of a standard file system cache, aimed at reducing the number of disk accesses. The client side cache, however, aims to reduce network load. Consistency is achieved in directory and attribute information by caching for a predetermined length of time. File data on the other hand is checked for validity only on a file open.

Comparison of goals with NomadFS As it provides a simple method of distributing and sharing a single file system across multiple clients, NFS is what is often used in laboratory and cluster scenarios. A central NFS server

24 can, however, cause bottlenecks and proves to be a single point of file system failure. NomadFS aims to distribute data across the clients, making use of both the aggregate storage potential and distributed computational capacity of the client disks.

4.2 Gluster File System (GlusterFS)

Implementation Gluster File System (GlusterFS) [9] is a distributed file system which exports a local file systems from a set of servers to multiple clients. Each of these is known as a volume. Clients then access these files using either TCP/IP, InfiniBand or SDP. GlusterFS supports a number of distributed file system mechanisms such as file based replication, striping and failover. It also has mechanisms to avoid coherency problems and scales up to several petabytes. The client file system runs through the FUSE API.

Comparison of goals with NomadFS It is useful to examine GlusterFS as it has similar performance goals to NomadFS, is also commonly used in smaller scale distributed settings and makes use of the FUSE API (see Chapter 6). GlusterFS also operates in a manner which does not require any centralized metadata storage location, a feature which NomadFS replicates. GlusterFS op- erates on a file-level and requires the use of an underlying local file system on each server. Because of this it does not allow for any block-level migration of data to the local server, an aspect of NomadFS which aims to improve perfor- mance by maximising the use of the local disk. GlusterFS was one of the motivators for NomadFS. A previous student at Waikato University attempted to use it in a small scale distributed scenario. Despite significant attempts to tune GlusterFS, its performance was not satis- factory, leading to the idea that GlusterFS is too complicated, and a simpler approach may provide better performance.

4.3 Google File System

Implementation The Google File System (GoogleFS) is a proprietary, widely deployed distributed file system which provides reliability and performance on a large scale [4]. GoogleFS was designed to meet Google’s large scale storage and file IO requirements. It does this by making a number of assumptions. These include file reads will occur as large streamed or small random tasks. Also assumed is that writes will generally be large sequential appends. It also

25 attempts to provide high sustained network bandwidth as opposed to low latency in operations. Finally the file system is assumed to be run on inexpensive commodity components and therefore must therefore must provide a reasonably amount of fault tolerance and replication of data. GoogleFS operates in chunks of file data and consists of GFS clients, GFS chunkservers and a GFS master. A GFS master node provides a mapping from file name and chunk index to a chunk location for GFS clients. Once a client has a mapping it can request the file chunk from a GFS chunkserver. These chunkservers store the chunks as individual files in an underlying file system. Because of the previous assumptions the Google File System was designed with 64 MB chunk sizes, much larger than a standard file system block (generally 4kB).

Comparison of goals with NomadFS GoogleFS provides a relevant ex- ample of a distributed file system as it is a well known production system that distributes file data across machines, making use of the storage and perfor- mance capabilities of individual ‘small’ disks. Because chunks can be seen as large blocks, the underlying architecture is block-based, naturally allowing for splitting of file data across multiple machines. GoogleFS, however, achieves its goals on a much larger scale than the use case for NomadFS. Because of this it can safely operate with centralised chunk organisation. In a smaller setting such as NomadFS’s use case, a decentralised structure is preferred so that an individual point of failure can be avoided.

4.4 Zebra and RAID

Implementation The Zebra Striped Network File System is a network or dis- tributed file system which attempts to improve data throughput and availability by basing its design on the ideas that are present in RAID and Log-Structured File Systems (LFSs) [6] [7]. In the case of RAID, Zebra is similar in that data is striped across multiple networked devices. This means that that server load is distributed among multiple server machines whenever a client attempts to per- form an operation on a file. This is particularly important in the case of slower write operations which are required to be written to the physical disk. Zebra also uses a parity scheme like many RAID schemes such as RAID 2. This parity scheme means that a single server acts as a parity server (an XOR combination of the other servers). Subsequently, in the event of an individual server failure, all data can be restored using the data present on the other servers.

26 LFSs, such as Zebra, perform write operations as a sequential log on the storage. This is based on the observation that, with increased memory, write operations to physical media will outnumber read operations. Hartman [7] shows that applying these techniques can provide performance benefits. In the case of large files a 4 to 5 times speed up was seen in comparison to NFS and Sprite (Zebra predecessor). Smaller files saw a speed up of between 20% and 300% times.

Comparison of goals with NomadFS Zebra file system is of interest to this project and NomadFS as it is an existing implementation of a block-based distributed file system which can split data across multiple server machines. It also shows that potential benefits of making use of the data storage on multiple machines. Striping of data, as exists in Zebra, is also of interest to NomadFS, as it provides a possible piece of future work that could potentially aid performance. Zebra was, however, developed in 1994 and is not publicly available. This age means that it has performance that is not targeted at the larger, faster disks and improved network bandwidth that exist in modern computing.

4.5 Summary

Many distributed file systems have been developed for varying purposes over a number of years. Examining past implementations can aid in the development of new distributed file systems such as NomadFS. NomadFS however has a different combination of features to other distributed file systems and revisits these architectures in the current context of disk and network performance.

27 Chapter 5

Design

Chapter 2 introduced and provided background on file systems and, in partic- ular, distributed file systems. This chapter presents the design of NomadFS, making use of the goals that were presented in Chapter 3 so as to aid design decisions. This chapter flows from high to low level design. This begins with an ar- chitecture overview, moves on to an overview of the block interface and the motivations behind it. The structure of the file system is then covered, with performance and reliability aspects covered last. Chapter 6 builds on this chap- ter to provide a discussion of how these design elements were implemented.

5.1 Overview

Key

Machine

Client Client Server Client Server Server

Figure 5.1: High Level Architecture

From a high level point of view NomadFS consists of clients and servers. The client is a program which provides the file system call functions to the end user. The server has access to the file system data. Clients communicate with servers

28 over a network such as LAN. Collectively the servers form the distributed file system’s data, with the client connecting, collating the data and exporting it back to local end users in the file system format. As can be seen in Figure 5.1 clients and servers can potentially exist on the same machine. Communication in these cases is identical to that which occurs over the machine to machine network, but occurs over the internal local link. This local link is dedicated and does not experience variable network conditions like the shared network and because of which is generally faster. In terms of performance it is therefore of paramount priority that the usage of this local link is maximised.

5.2 Block interface

The primary focus of the server is to serve blocks over the network to the clients from a block device. Because of this the report may refer to the servers as block servers. Communication is performed over a common network API or block interface, where clients request blocks over the API and servers respond to these requests with the appropriate data block.

Export Blocks Request Client Blocks Server

Communication API

Figure 5.2: Client to server link

5.2.1 Block-based approach

A block-based approach to development for NomadFS was chosen for a number of reasons. Firstly such an approach is of interest so as to provide an indication as to whether it is a viable option for a distributed file system. It was however believed to be the case for a number of reasons. A block-based approach allows for a natural approach for dividing data from a single file across multiple machines or devices. What this means is that, any individual block from a file can potentially exist at any location in the entire system. Achieving this functionality in a file system which deals with files, rather than blocks would be more difficult and would rely on entire file based migration, or an extension to the file semantics (because a file is inherently a

29 single chunk of data). Having the ability to easily divide data across multiple devices also has the benefit of allowing for block mobility. Because a block-based file system is designed to directly interact with a raw device, such an approach to a distributed file system has no reliance on an under- lying host file system. What this means is that, instead of being forced to serve data that exists in a local file system, as is the case in existing file systems such as NFS, a block-based distributed file system can directly use raw devices, such as individual disk partitions. Whether this is actually a performance benefit remains to be seen however. Finally a block-based approach allows for simplicity in a number of key areas. This mostly includes the block servers, which as mentioned have the primary task of serving blocks, a simple task in user space Linux programming.

5.2.2 Identification and locality

For blocks to be accessible in a distributed file system, there needs to be a method of identifying which machine contains the block. Each block and inode has a 64 bit identifier which includes the server where the block or inode is located, and the index into the block device that the block or inode is located. This layout can bee seen in Figure 5.3. Because of the block-based nature of NomadFS, the block pointers that contain these identifiers can be modified, allowing for mobility of individual blocks. Such an identifier structure also allows a large number of blocks to exist in the system (up to 248 blocks on each of 216 servers). A block’s identifier is stored directly in the inode or one of the inode’s indirect blocks. An inode’s identifier is stored in the parent directory’s data blocks, along with the file’s name.

Server Block / inode offset 16 bits 48 bits

Figure 5.3: Block and inode identifier

Clients can therefore identify which server to request a block or inode from.

5.3 File system structure

A distributed file system such as NomadFS involves a number of complexities which do not exist in a local file system, such as Linux’s extended file sys- tem. This stems from the need for suitable communication over a network,

30 and the need for appropriately handling multiple machines operating on shared data. Communication not only needs to be reliable, but also has to perform in a suitable manner if the system is to achieve its performance goals. Reli- able communication leads on to the need for appropriate synchronisation of the shared data structures, something that is often achieved with locking or mutual exclusion. These points are further expanded upon in this, and the following sections.

5.3.1 Communication API

The client and server communicate using a common network API, where com- munication is performed using messages. A messaging approach was chosen because it fits well with the block-based approach, where a single message can be used to hold the data from a single block. A message contains a request or a response. Requests include such operations as a read request and are responded to by the block server with a read response. While the request in this case is small and only includes the required block identifier, the response is slightly larger than the block size.

Request Message

Client Server

Response Message

Figure 5.4: Message passing

So as to maximise the effective use of the local link, the client creates inodes and allocates blocks on the local server (if it exists) by preference.

Extending Communication

Some operations such as inode and bitmap operations are performed on partial blocks. The block interface is not ideal for performing these operations because of both synchronisation and performance issues. The synchronisation issues stem from the need for locking of the entire block or part of the block. Locking the entire block is wasteful because it would stop other operations occurring elsewhere on the entire block for the duration of the lock. Locking of partial blocks would work but would add a large amount of complexity to the system. Performance issues of partial block operates are the result of wasted network

31 network traffic. Using the block interface would mean that entire blocks of data are transferred when only part of the block is actually required. In NomadFS’s case, the solution to these issues is to implement these partial block operations on the server. This required the communication API to be extended so as to include a number of additional operations such as allocating blocks and freeing inodes. Having the ability to extend the API also means that other non-block op- erations can be performed server side. This includes such functions as leasing and reference counting, which are discussed in Section 5.4. Extending the API so that some file system operations occur on the server does however have some drawbacks. Most importantly it adds complexity to the server, which now needs to have the ability to perform low level bitmap and inode operations. Overall it strikes an appropriate balance between perfor- mance, while still mostly holding to a block-based nature. It also means that synchronisation of these partial block operations can be provided completely server-side (see Section 5.4.2).

Example Messages

To illustrate how this communication API operates, some of the common op- erations are listed below, each of which is sent in a single message, and has an appropriate response message from the server.

• Data block read

• Data block write

• Inode create

• Inode free

• Allocate block

• Free block

Actual implementation details of this communication is shown in Section 6.2, and includes aspects such as how reliability and performance of communication were implemented.

5.4 Performance and Reliability

This section describes the additional features of NomadFS that are particularly related to performance and reliability. This includes such aspects as how the

32 cache operates, how synchronisation was achieved and why block migration was implemented.

5.4.1 Cache

For a file system to perform in a suitable manner, it will have to make use of a memory cache. This is so that data can be fetched from memory, as opposed to the slower disk subsystem, and is particularly important in network file systems such as NomadFS [6]. Generally in a standard single client-device scenario, caching coherency does not prove to be an issue because the file system can assume that cached data has not changed on the block device. In a distributed file system however, because there are multiple clients potentially modifying the shared files, cached data can become out of date. It is therefore necessary that such a system has a mechanism for invalidation of this cached data.

Cache coherency through invalidation

In a file system such as NomadFS cache invalidation could occur in either a block oriented or file oriented manner. While a block approach would suit the block-based nature of NomadFS, blocks are small and because of which would add a large amount of overhead due to the potential for large numbers of cached blocks. On the other hand, a file based approach has the problem that even when only a small part of a file changes, the entire file’s cached data would need to be invalidated. This does however mean that invalidations do not require very much network traffic and state storage. NomadFS achieves cache coherency between clients using a method of file based cache invalidation (see Figure 5.5). This means that when a file is modi- fied, all clients that have data from this file cached are notified of the modifica- tion. Upon receiving this notification, the client can invalidate the appropriate cached inodes and data blocks. The next time a system call requests data from this file, the client will have no cached data from this file and will fetch the data directly from the appropriate server. For this to work the server which holds the file’s inode needs to track which clients have some of the file’s data cached. The client also needs to keep a record of which blocks belong to which inode. This file based cache invalidation was appropriate as it provides an acceptable balance between unnecessary cache data loss and potential invalidation overhead.

33 Key Client Client Cached Data

file.txt 1 Write to file.txt

2 Invalidate file.txt

Client

Server

Client 2 Invalidate file.txt file.txt

Figure 5.5: File based cache invalidation

5.4.2 Synchronisation

Any program which has multiple threads of execution potentially modifying shared data must make sure to synchronise actions so as to avoid concurrent update errors. In the case of NomadFS, from a high level viewpoint, these threads of execution are individual clients, which can potentially concurrently access and modify the shared file data simultaneously. Synchronisation is often achieved using write locking of data so that only a single thread, or in this case client can write to that data at a single point in time. Synchronisation in NomadFS was achieved in a number of steps. As mentioned in Section 5.3.1, partial block operations, such as inode and bitmap operations occur on the server. Because of this, synchronisation of these operations is already achieved, and merely involves internal server synchronisa- tion. Partial block operations are not the only operations that require synchroni- sation. Anything that the client does will require some scheme that will provide consistent, non-colliding data access and modification. There are a number of options available that could provide this functionality, including implementa- tion of block locking or leasing, or even implementation of file based leasing. In NomadFS an inode level leasing method was chosen. This was implemented in a manner which allows for explicitly one writer or ‘n’ readers to operate on an

34 individual file. Unix file system semantics state that: “any number of readers and writers can act on a file at any given time, and that a read always returns the most recently written data, and that writes are always indivisible even if there are other writers.” Because of this fixed number of readers and writers, NomadFS does cause a weakening of these semantics. Taking into account the use case, this was deemed acceptable. When a client begins writing to a file it requests a lease for this file’s inode from the server which holds the inode. Assuming no other client holds the lease, the server can then respond with a success message. While holding this lease the client can then proceed to perform its write operation and release the lease when finished. This means that multiple writers can have a file open at any given time, however only one write operation is allowed at a time. Also, because of the possibility of changes to block pointers within the period of a write operation, reads are not allowed during such a time. Leasing is also advantageous over any form of locking as it safely handles client failures (if a client fails, the lease will expire on timeout). This high level description of synchronisation ignores internal synchronisa- tion requirements of shared data between a client or server process’s threads. These are covered in Chapter 6.

5.4.3 Block Mobility and Migration

One method for improving performance is to implement the ability for data to be dynamically migrated between machines, so as to reduce the latency and general throughput in repeat transactions. Such functionality is particularly important when dealing with files that are unable to be cached for the entire duration they are used at a particular client. For example when the file is larger than available memory or is repeatedly accessed over a long period of time. As mentioned in Section 5.2.2, because of NomadFS’s block-based nature, individual blocks are potentially mobile. Because of this NomadFS implements block migration of data blocks on read. This means that while a file is read by a client, the blocks that exist on remote servers are migrated to the local server. Subsequent reads from the server can then be fetched from the faster local server.

5.4.4 Block Allocation

In an early revision of NomadFS it was noted that write operations performed poorly. After investigation this was found to be the result of not the write

35 messages but the allocation of blocks on the zone bitmap. The solution to this issue was to implement a block pooling scheme. Instead of requesting a single block from the server a client will request an entire block of block pointers. These blocks are allocated in one step on the server and returned in one message, reducing the affect of latency on larger numbers of block allocations. Subsequent client block allocations can be taken from this pool of blocks.

5.4.5 Prefetching

It is common for file systems to fetch data into the cache before it is actually required. This is aimed at improving performance in the common case that a file is read sequentially. The prefetched data can be fetched straight from the faster cache as opposed to from the device. Whereas a standard file system might prefetch a small number of blocks, No- madFS implements an aggressive prefetching strategy where an entire file can, potentially, be fetched into the cache. Because network actions are considered to be the slowest area of the file system, such an approach to prefetching should provide reasonable performance related benefits, especially when network la- tency is an issue. This can however come at the cost of keeping the network busy when this occurs.

5.4.6 Scalability

There is no server-to-server communication and communication between server and client only occurs if specifically required. This means that the only overhead the file system experiences from the addition of clients or servers is that from holding the client-server connections. Subsequently the file system’s scalability is mostly dependant on the scalability of these connections. Scalability it further discussed in Section 7.3.

5.5 Summary

NomadFS is a primarily block-based distributed file system, but based on im- plementation experience, implements some meta-data operations at the block server. This includes the partial block operations associated with inodes and bitmaps. A hybrid approach maintains the benefits of a block-based approach, such as the ability to easily split data from a single file across multiple servers, while also improving network usage and simplifying synchronisation.

36 Chapter 6

Implementation

Chapter 5 described the design of NomadFS. This chapter presents how these design elements were implemented. NomadFS is a program which consists of approximately five thousand lines of C code, excluding tests. It was developed in user space Linux, using the FUSE file system kernel module. FUSE was chosen as the development platform as it provides an easier and faster file system development environment than the kernel. As mentioned in Section 2.4, FUSE does however add additional overhead to file system calls. In NomadFS’s case, this was deemed acceptable due to transfer time being dominated by slow disk and network transactions. Because of a lack of appropriate source code, for reasons such as the require- ment for 64 bit identifiers, NomadFS was written completely from scratch. This includes the entirety of the block-based operations.

37 6.1 Clients and Block Servers

6.1.1 Client

File System Calls from FUSE kernel module

FUSE endpoints

File system functions

Block abstraction

Messages to servers over network

Figure 6.1: Client Architecture

So as to provide a clear separation of the various layers of client code, NomadFS’s client program was split into three distinct layers (see Figure 6.1 above). The three layers are covered as follows.

FUSE endpoints

The FUSE endpoints layer is a small abstraction layer which is called by the FUSE kernel module and passes the appropriate data on to the NomadFS file system operations.1 Each system call endpoint function is added to a struct fuse_operations structure so that the FUSE kernel can identify which function in the client pro- gram should execute in response to the various system calls (see Section 2.1.1 for a list of important calls). For example adding the entry .write = nomadfs_write to the structure results in FUSE calling the nomadfs_write function when a write system call is called on the file system. Because NomadFS makes use of the higher level FUSE API, all system calls receive a string containing the absolute path of the respective file that the sys- tem call is associated. This is as opposed to the inode structure that standard kernel space file systems receive from the VFS. The problem surrounding this and that NomadFS is block-based is that the client must perform a complete traversal of the directory structure whenever a file is operated on. This how-

1Source can be found in client/client.c, see Appendix E

38 ever is not a particularly large problem, assuming that inode and block data is cached appropriately in the lower layers. There is potential here for improved performance through the use of caching of open files.

File system operations

The core of NomadFS is a standard Unix type file system that implements all the standard block-based file system functions such as locating raw inodes and performing indirect block traversal. This is extended by distributed function- ality such as a networked block abstraction, concurrency, coherency and block mobility.2 Implementation of these file system operations provided a significant chal- lenge and consumed most of the time associated with NomadFS’s current im- plementation. Since the client is designed to do ‘most’ of the work, this is not a surprising observation. An example of a challenge in the implementation was implementation of the file indirect blocks. These blocks allow a file to have more blocks associated to it than can fit inside the inode and is performed by placing block pointers inside of other blocks. If one block of indirect pointers is not enough then another level of indirection is used. A block-based file system needs to implement appropriate routines which can handle locating a block at an index, shrinking of a file (freeing blocks and their indirect blocks) and extending of a file (allocating blocks and their indirect blocks). Older file systems such as the MINIX file system have a limited number of indirect blocks (two in the MINIX case). Because of the potential for very large files, NomadFS currently implements a maximum of four layers of indirection, allowing for a maximum file size of approximately 255TB, with 4kB blocks. This was written in a manner which means that this can be increased, allowing for potentially much larger file sizes. Another important point in the file system operations is the need for avoiding unnecessary computational complexity. For example allocating blocks in the bitmap by searching for the first zero bit from the start of the bitmap is O(n) complex. If however the file system keeps track of the last allocated or freed block, the allocation process is average O(1) and worst case O(n). Points such as this become increasingly important with larger files. Because free block operations should never fail, they were implemented in a manner which required no response from the server. This results in a speed

2The code which performs these operations is mainly contained within the single source file common/bitmap.c

39 up in operations such as file deletions and truncations because the client does not need to wait for responses from the server. However, when deleting or truncating a large file (i.e. greater than 1 GB), it is possible for the client to get sufficiently far in front of the server’s block freeing process to cause time outs on subsequent operations. Because of this it was necessary for the client to synchronise its own freeing progress with that of the server.

Block abstraction

The main focus of the block abstraction layer is to transform block requests into network messages which are sent to the appropriate network address. Because of this, the block abstraction layer is where all of the client com- munication code is located.3 This layer also handles block and inode caching, which is visited in Section 6.4. An example of an operation that this layer can perform is to perform an inode operation, which has the following function definition:

int n e t w o r k i n o d e o p e r a t i o n ( struct nomadfs data ∗w data , u i n t 6 4 t index , int operation , int options , struct nomadfs inode ∗ inode ) ;

6.1.2 Block Server

The block server is a much simpler program than the client. This is because its primary focus is to serve a device’s blocks over the network, a simple task for a user space Linux program because everything, including raw block devices, are represented as files.4 As seen in Section 5.3.1 the communication API was extended so that in- ode and bitmap operations occur on the block server. Because of the layered approach to the client, the server simply makes use of the client file operation code to perform these operations. In a similar fashion to GlusterFS [9], NomadFS’s block server also imple- ments block caching so as to reduce the number of slower disk reads.

6.1.3 Locality and Client start up

The current implementation of NomadFS uses a static configuration file to pro- vide the clients with server information. When a client starts up, it will read the configuration file, create connections to the servers specified in the configuration

3Found in common/w_network.c 4Found in server/blockserver.c

40 and request the superblocks from these servers. The superblock information is sufficient to find the root inode (directory “/”) on server zero. Because all sub- directories and sub-files stem from this single directory it is then possible to traverse and access the entire structure of the file system.

6.2 Communication

As explained in Chapter 5, NomadFS requires a method of reliable communica- tion between client and server. This was achieved using a method of messaging over the transport protocol TCP. Because network communication is the vital performance point for networked file systems it was important that communi- cation was carefully planned and executed.

6.2.1 Transport Protocol

When deciding on a transport protocol for communication, the options include TCP and UDP. Although UDP is often used in real time applications and op- erates in datagrams, something which would suit messages, it is unreliable. Reliability is an important aspect of communication and is required because messages are guaranteed to reach their destination. If NomadFS were to make use of UDP, it would need to implement its own method reliability and frag- mentation. Such an implementation was not a focus of this project and would unlikely perform as well as the inbuilt TCP reliability and fragmentation, so TCP was chosen as the transport protocol. A limitation with TCP however is that it is stream based, meaning that messages don’t fit naturally with it’s structure. This is not a particularly problematic however, and was resolved by assuming every message begins with a fixed size header, which can be used to calculate the length of the remainder of the message.

6.2.2 Messages

Type Identifier (8 bytes) Data Block (4096 bytes)

Figure 6.2: Message layout in NomadFS (Data Block not to scale)

NomadFS messages can vary depending on their purpose, but generally con- sist of a 12 byte header and an optional data block. The header always contains

41 4 bytes of message type and flag information. The current implementation cur- rently only uses the first two of these bytes, with the first byte indicating the type of the message and the second byte containing various flags which indicate information such as whether to request an inode lease or whether an error oc- curred when processing the previous request. These additional header bytes are aimed at allowing further extension of the communication API, to allow extra functionality such as further “piggy backing” of messages.

6.2.3 Common Client and Server Communication

The client and server share a common communication API, and so as to max- imise code reuse, the two programs share many common network functions, such as the function that reads a message from a socket. NomadFS is designed to run between machines over a network, it therefore is important that varying host byte orders are accounted for. This is achieved through serialisation, where data is transferred using a common network byte order.

Nagle’s algorithm

Nagle’s algorithm is mechanism in modern TCP which attempts to combine small small packets into single larger ones to avoid unnecessary bandwidth con- sumption [14]. This however comes at the cost of increased latency as the TCP send buffer waits for more data before transmitting. Because of this, and the fact that NomadFS often transmits small messages, such as read requests, it was decided that this would be turned off to avoid unnecessary latency.

6.2.4 Client Network Queue

Because a file system must be able to handle multiple concurrent system calls or requests, it is important that the client’s mechanisms for network messaging can be performed in an asynchronous manner. Generally the best approach for asynchronous networking is to implement a network queue. A design of how the network queue in NomadFS operates is shown below in Figure 6.3.

42 1 File System Call

3 Send Request Message over Request Network Thread

2 Queue Request

Request 4 Queue Wait for response 5 Receive Response 7 Message over Release Network waiter Dequeue Receive 6 Request Process Thread 8 Response

9 Respond to File System Call

Figure 6.3: Network queueing

6.2.5 Overlapped IO

Application Client Server

System call spanning Individual multiple blocks Block Request

Individual Block Response

Figure 6.4: Overlapped IO (Adapted from Figure 2.4 [12])

43 While synchronously sending read or write block requests and waiting for the response provides a solution to single block system calls, network latency results in this becoming less than satisfactory when the system call requests more than a single block. Because of this, NomadFS implements overlapped IO as shown in Figure 6.4, so as to improve performance. This functionality is possible because of the network queue implementation above in Section 6.2.4.

6.2.6 Block server specific communication

Because the current block server implementation runs in a single thread a queue- ing message system is not required. A single threaded approach was chosen because of simplicity and could potentially be extended to multiple threads to improve performance. The nature of the communication API means that there is very little server specific communication code. The server can simply make use of the networking code that the client uses, after using select5 over the open TCP sockets.

6.3 Synchronisation

Synchronisation of shared data is achieved on two levels in NomadFS. Firstly the distributed synchronisation involves how collisions between different clients are avoided. The other level is the internal synchronisation of data structures within the individual client or server programs.

6.3.1 Distributed Synchronisation

Distributed synchronisation of a distributed file system can be a complicated goal to reach. Section 5.4.2 explained how this is achieved in NomadFS, but doesn’t show how it is implemented. Most importantly, NomadFS’s synchro- nisation is file based, and explicitly allows any number of readers or a single writer to act on a file. When writing a file, a client holds a lease on this file. The lease expires after a set period of time or when the client notifies the server that it has finished working on the file. When beginning a write, the client must obtain the inode for the file it is writing. NomadFS ‘piggy backs’ a lease request on top of this inode request, which is responded to by the server with a response message indicating whether the lease attempt was successful. When the client has finished the write

5Unix system call which allows examination of multiple file descriptors (such as TCP sockets).

44 operation it notifies the server with a similar message that indicates that the server should free the lease. A background thread in the client keeps track of the current leases that are held and renews them when required. Because of this, a lease expiring will generally indicate that the client has closed unexpectedly. While leasing guarantees that only a single client can write to a file at any given time, a read reference count is needed so that writers do not operate when there are readers reading. In a similar manner to the lease messages, clients indicate when they are beginning and ending a read operation, by messaging the server that contains the file’s inode. The server can then keep a count of clients currently reading any given file, and will not issue write leases if this count is greater than zero. To safely handle the failure of clients, this read reference count is decremented periodically by a background thread. Figure 6.5 below shows the cases when a client will be denied read or write access.

Key

File Name Client Read Reference Count Client Write Leased

Reading File.txt Request write lease for File.txt Request read Client access for File.txt

Denied

Request write Writing foo.c Server lease for foo.c Client File.txt

Denied Increment 1 Server No

Request read foo.c access for foo.c 0 Client Denied Yes

Figure 6.5: Synchronisation

6.3.2 Internal Synchronisation

Synchronisation of the internal data structures in the client is entirely achieved using critical regions, protected by mutexes. A critical region, is a region of code which only allows a single thread access at any given time. Because NomadFS was developed in user space it was possible to use the POSIX Threads standard to achieve this.

45 6.4 Cache

Front Rear Hash Table

Figure 6.6: Buffer Cache (Adapted from Fig. 5-20 [20])

Because the FUSE API doesn’t provide any mechanism for directly accessing the kernel buffer cache, a user space cache was implemented. As this was not a primary focus of the project this cache was written as a simple buffer cache which works at a block level in a similar manner to that which exists in the MINIX operating system [20]. Blocks identifiers are hashed using a pre specified number of low order bits. This hash provides the index into an array of pointers to blocks in the cache. The cached blocks are linked into a list of colliding blocks (identical low order bits). An advantage of having a custom cache implementation is that we have complete control in how it worked and could easily use the same structure for all the inode, client block and block server caches.

6.4.1 Cache coherency

As explained section 5.4.1 cache coherency in NomadFS was achieved using a file-based cache invalidation scheme. This section explains how this invalidation scheme is implemented. Firstly the client needs a method of keeping track of the cached blocks that belong to a specific file (or inode). Each cached inode has a table of associated blocks, which are added as they are cached. On receiving an invalidation mes- sage from the server, the client can then free the inode and all the blocks in the table. The alternate approach would be to step through all the block pointers in the file and invalidate these. This would not be appropriate because of the potential overhead on large files and because of the likelihood of a concurrent update on indirect blocks during this process. Reading these indirection blocks from the file would require a read reference increment, something which would

46 not be possible because the invalidation is likely the result of a write operation (which would still be in progress). For invalidations to occur the server which contains the file’s inode must keep track of which clients currently have the file’s data cached. This is done by keeping a table of clients associated with any particular inode. A client is added whenever it reads this inode. This works because a client must read the inode in order to get access to the block pointers. Since TCP doesn’t support any multi- cast mechanisms, these invalidation messages must be sent out sequentially to the appropriate clients. When a client drops the connection it is removed from the table.

6.5 Block migration

NomadFS was originally developed so that migrations occurred on every block read. During development however, it was seen that migrations provided a reasonable amount of overhead. Because of this, two further migrations schemes were added, including migration on random read and by using a dedicated background thread. The performance of each is affected by the overhead of migrating blocks, with the dedicated thread method aimed at migrating the maximum number of blocks without degrading performance. Because of the block-based nature of the file system, block migrations are mostly simple and involve migration of the block’s data to the local machine and swapping of the pointer to this block. This pointer to a block can either exist directly in an inode or in an indirect block. Figure 6.7 provides an illustration of migration flow.

Read block Roll back data

Local Error

Allocate local Compare set Free remote Read block Check server Copy local block block pointer block

Not local

Figure 6.7: Migration flow

Although the actual process of migrating an individual block is easy, the problem becomes more complicated when considering the affects of this migra- tion on the client caches and their cache coherency. The solution NomadFS currently employs is to invalidate the file’s client caches if it has data migrated.

47 6.6 Aggressive Prefetching

Another feature that NomadFS employs, that is aimed at improving perfor- mance is aggressive prefetching of file data. This is currently implemented as a background thread which reads a maximum of half the cache size from the file being read. This is in the hope that these will be the next blocks requested. It is not entirely obvious how such a system should work with migration, and does not currently cooperate with migration at all in NomadFS.

6.7 Issues and Challenges

Implementation of NomadFS provided a number of significant challenges. These have resulted from the eventual size and complexity of the system, which must not only handle low level block-based file system operations, but also the higher level distributed system complexities. It is also important to note that while NomadFS is functional and stable in the test environment, it still has some incomplete aspects. This mostly includes a lack of complete error and edge case handling, aspects which can account for large proportions of implementation in production file systems.

6.8 Summary

NomadFS has involved a significant amount of development effort, and has suc- cessfully been implemented to the point where it is in a usable state. This includes that it has functional cache coherency, synchronisation and block mi- gration functionalities.

48 Chapter 7

Evaluation

This chapter focuses on an evaluation of NomadFS’s performance so as to guage whether the goals from Chapter 3 have been appropriately met. Because of time constraints, a complete evaluation of all the file system operations was not possible. Instead, this evaluation focuses on read and write performance as well as the affect of migrations on remote reads. Since block size can influence the performance, a comparison of performance with varying block sizes was also performed. Another important goal was scalability, so this was tested using the recourses available. Tests were performed on NomadFS in a stable, but still incomplete state. It is important to note that little performance tuning had been performed prior to these tests, meaning that performance improvement is almost definitely avail- able.

7.1 Test Environment

The test environment consisted of five machines, four of which were directly con- nected to a central node, all using Gigabit ethernet. To keep testing consistent, the three identical machines (shown by grayed boxes in Figure 7.1) were used for all IO tests. These machines each had 2GB of memory and a single hard disk drive with a 20 GB partition dedicated to NomadFS. All tests were per- formed using the data copy Unix utility dd. Unless otherwise stated tests were performed with a 400MB client cache and 512MB files, so that the influence of the client cache could be reduced. The underlying disk subsystem was first tested for its sequential read and write performance. Sequential reads and writes both operated at approximately 41 MB/s.

49 Central node

Figure 7.1: Test Environment

Although not a preferred test environment, it still did provide a means of testing performance of NomadFS in a worst case scenario (fast network with slow disks). The design of NomadFS suits an environment that can make the most gains through the use of the local disks (an environment limited by network bandwidth). Please note that because of complications resulting from the underlying ker- nel buffer cache, a number of the following results will appear higher than one might expect.

7.2 Migration

Migration analysis was performed using two machines, each with a server and client. To test the performance of migrations, a 512MB file was created on the second machine and read multiple times on the first machine. When reading the file, migrations should result in improved performance on the following reads. Figure 7.2 shows the effect of block migration on read performance. With no migration, read operations on the 512 MB file average a throughput of ap- proximately 50 MBytes/s. ‘Migrate all’ shows the performance when migrating all data blocks to the local server. Although throughput improved after the initial read to approxi- mately 65 MBytes/s, the initial read did show a large amount of overhead. Be- cause of this overhead two other migration schemes were explored that spread this migration overhead over a number of read operations. Firstly is the migration of blocks from random system calls. This is aimed

50 Repeated Read Migrations 512MB file No client cache 80

70

60

50

No Migration 40 Migrate all 30 Migrate Random (1/8) Migrate Background

Average Throughput(MB/s) 20

10

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Repeats

Figure 7.2: Migration Performance

at only migrating a fraction of the remote blocks to the local server on each read. In the results shown in Figure 7.2 one out of eight system calls had blocks migrated. This resulted in performance which gradually improved, as more and more blocks were migrated to the local server. The other migration scheme implemented was to perform migrations in a background thread. This background thread, like the prefetch thread from Sec- tion 6.6 only performs operations if not busy. Because of this, it results in migrating the maximum number of blocks without adversely affecting perfor- mance, and can be seen in Figure 7.2. When taking into consideration that blocks should move to the clients which use them the most, the latter two schemes make the most practical sense, as over time blocks will migrate to the servers which make the most use of them. The ‘migrate’ all scheme would however be more appropriate in scenarios where a single machine will perform many reads on the remote file.

7.3 Scalability

Because of its decentralised design, adding additional clients and servers to NomadFS should not adversely affect performance. A method of testing whether this theory has been put into practice was to take IO measurements of file writes, reads and re-reads with varying numbers of

51 clients and servers. The results when using a file smaller than the client cache is shown in Figure 7.3. Figure 7.4 shows the same data but with a file larger than the client cache. The varying file sizes were used to show the scalability of files smaller and larger than the client cache.

Scalability Write 256MB file 400MB client cache Read Re-read 500 450 400 350 300 250 200 150 100 Average ThroughputMB/s 50 0 0 1 2 3 4 5 6 Number client-servers pairs

Figure 7.3: Scalability on file smaller than cache

Scalability Write Read 512MB file 400MB client cache Re-read 140

120

100

80

60

40

Average ThroughputMB/s 20

0 0 1 2 3 4 5 6 Number client-servers pairs

Figure 7.4: Scalability on file larger than cache

Figures 7.3 and 7.4 show that performance has not decreased with the addi- tion of new clients and servers. This excludes the performance of three clients and servers in Figure 7.4, which strangely show decreased re-read performance, even when repeated five times. This particular result requires further investiga- tion. It is also worth noting that the 256MB file test was performed on faster, more modern hardware and achieved must higher cache throughput (approxi- mately 900 MB/s). Although scalability is acceptable with five machines, an

52 environment with more machines would be more suitable if further scalability tests were to be performed.

7.4 Effect of block size on performance

NomadFS supports different block sizes, and the underlying block size of the system can potentially affect performance. Because of this, write read and re- read performance tests were performed with varying block sizes.

Block Size performance Write Read 512MB file 400MB client cache Re-Read 250

200

150

100

Average ThroughputMB/s 50

0 4k 8k 16k 32k 64k 128k Block Size

Figure 7.5: Affect of block size on performance

Figure 7.5 shows that increasing the block size can provide advantages in terms of file IO performance on large file sizes. There are a number of possible reasons for this improvement and include:

Less indirection Larger blocks means indirects can hold more pointers.

Less message passing System calls contain larger requests, so less reference count and lease messages are sent.

Less message header overhead Larger blocks means that individual data messages have a smaller proportion of header information.

Larger sequential operations on the underlying disk The underlying disk subsystem has to deal with fewer and larger sequential operations, some- thing that favours traditional hard disks.

53 Increasing the block size does however mean that there is increased potential for wasted disk storage. Larger block sizes are also less preferred when the file system deals with many small file operations.

7.5 NFS Comparison

Because NFS is commonly used in small scale distributed scenarios, it was worth- while performing a small comparison analysis between NomadFS and it. This test consisted of two machines one with a server, one with a client.

NFS NomadFS 128k block File Size Write Read Re-Read Write Read Re-Read 1024MB 44.8 116 1186 25.1 101 101 2048MB 42.2 38.9 40.0 24.5 51.0 53.7 4096MB 42.9 38.9 42.2 19.9 43.4 44.5

Table 7.1: Effect of file size on performance (MB/s)

The results in Table 7.1 show a couple interesting results. Firstly in NFS, the kernel buffer cache allows for high re-read performance on files that are smaller than machine memory (2 GB). Also in terms of raw read performance, NomadFS performs favourably on larger files. The result also shows that there is decreasing write performance in NomadFS. This is currently believed to be the result of block allocation and requires further investigation.

7.6 IOZone

The final test that was performed was the IOZone file system benchmark [15]. This was performed with five machines with file sizes ranging up to 512MB. Because of the range of results IOZone produces, it was deemed appropriate only to display the write, random write, read and random read performances of NomadFS1. These are shown in Figures 7.6, 7.7, 7.8 and 7.9.

1Full results can be viewed in Appendix C

54 Iozone performance

Write

110000 100000 90000 80000 Kbytes/sec 70000 60000 50000 40000 30000 20000 10000

16384 4096 1024 64 256 256 1024 Record size in 2^n Kbytes 4096 64 16384 65536 16 File size in 2^n KBytes 262144 1.04858e+06 4

Figure 7.6: IOZone Write

Iozone performance

Random_write

140000 120000 100000 Kbytes/sec 80000 60000 40000 20000 0

16384 4096 1024 64 256 256 1024 Record size in 2^n Kbytes 4096 64 16384 65536 16 File size in 2^n KBytes 262144 1.04858e+06 4

Figure 7.7: IOZone Random Write

55 Iozone performance

Read

160000 140000 120000 Kbytes/sec 100000 80000 60000 40000 20000 0

16384 4096 1024 64 256 256 1024 Record size in 2^n Kbytes 4096 64 16384 65536 16 File size in 2^n KBytes 262144 1.04858e+06 4

Figure 7.8: IOZone Read

Iozone performance

Random_read

600000 500000 400000 Kbytes/sec 300000 200000 100000 0

16384 4096 1024 64 256 256 1024 Record size in 2^n Kbytes 4096 64 16384 65536 16 File size in 2^n KBytes 262144 1.04858e+06 4

Figure 7.9: IOZone Random Read

56 Interestingly the results show that random writes and reads performed better than their sequential counterparts. The reasons for this are unknown at this stage, but could be the result of a poor performing client cache. Also note the drop off in read performance as the file size becomes greater than the client cache size (400MB). A similar drop can be seen in write performance, possibly showing the affects of overwriting data in the client cache.

7.7 Summary

Although performance analysis of NomadFS is not complete (other tests were run, but could not be reported due to time constraints), the results thus far have shown some potential in its design. The results show that NomadFS is currently in a usable state, with migration results showing the benefits possible with their implementation. As of yet, performance tuning of NomadFS has not been performed, giving an indication that further performance can be gained from further development. It is also worth noting that none of these tests have shown the true poten- tial in the distributed nature of NomadFS. Whereas a shared storage solution, such as NFS, is likely to degrade in performance as client machines are added, NomadFS will likely continue to perform well.

57 Chapter 8

Conclusions and Future Work

8.1 Summary

This report has shown the development and implementation of NomadFS, a primarily block-based distributed file system aimed at improving performance of shared data in small scale distributed environments. NomadFS supports the ability for multiple client machines to concurrently make use of a shared file system that can be distributed across multiple server machines. Among other features, NomadFS has the ability to migrate blocks to the machines which make the most use of them, so as to improve performance.

8.2 Conclusion

NomadFS is a research implementation that was developed to test the feasibility of a block-based approach to a small scale distributed file system. It has shown that this is indeed possible and has performed well enough to suggest that such an approach does have potential benefits. Data migration, a central design feature of NomadFS has also been shown to provide performance benefits. The implementation is complete in that it is stable, can be benchmarked and can handle all the required Unix file system calls. As a consequence of the time frame of development however, it is not intended for a production environment. For this to be possible further work on fault tolerance (for example through replication) and error handling would be required.

58 8.3 Future Work

There are a number of possible extensions that could be applied to NomadFS so as to improve various aspects of its implementation.

8.3.1 Potential Extensions

Replication Replication refers to the ability for unique data to exist in multi- ple locations. Having the ability to replicate data across multiple machines can provide a number of benefits, but can also come at the cost of requiring appropriate consistency mechanisms. In relation to migration, replication can allow for data to migrate to the local disks of all the machines that make use of it, improving the IO performance on all of these machines, instead of individual machines as is currently the case in NomadFS. As mentioned in Chapter 3, distributed file systems often focus on fault toler- ance. Data replication is one technique which can be adopted to allow for handling faults, and therefore improving the availability of the system. In NomadFS, the root inode provides a specific example of where replication could provide further benefits. Currently in NomadFS, the root inode is located at a fixed position on the first server. Because of this, the first server actually proves to be a single point of failure in the file system. Having the ability to replicate this inode over multiple machines would improve this aspect of NomadFS.

Inode migration Although only a small piece of data, migration of inodes could bring about a number of performance benefits. Currently NomadFS needs to message the server that contains an inode twice on every system call (because of leasing and reference counting), even if the inode is cached. Migrating the inode would mean that these could be directed at the local server, reducing the affect of network latency on performance. Inode mi- grations would however require a means of tracking the directory entries that point to each inode.

Improved caching Having file based cache invalidation is overly conservative in a number of situations, such as when making small modifications to a large file (for example a data base file). Having the ability to invalidate ‘chunks’ within a file could improve this functionality, so that cached data can be kept for a longer period of time. Chunk based invalidation would more appropriately suit NomadFS’s migration functionality. Another po- tential improvement to the current caching system would be to cache local

59 blocks for less time than remote blocks. This would help to improve that benefits that caching provides, as the slower to fetch remote blocks would be more likely to exist in cache.

Local disk caching Machine memory is generally much smaller than disk space. The migration results show that moving data to the local disk does pro- vide performance improvements. Because of this, the local disk could potentially be used as an extension of the memory cache and provide per- formance benefits on files larger than the memory size.

Striping Because of the block-based nature of NomadFS, it would be naturally suited to striping of data across multiple servers. The advantages of this have been shown in Section 4.4.

Distributed Memory Since virtual memory data structures operate in a sim- ilar manner to those which exist in block-based Unix file systems, it is per- ceivable that such a distributed system as NomadFS could be extended to provide a method of distributed virtual memory. This would however require significant performance improvements to become viable.

Further Performance Analysis Performance analysis of NomadFS (in Chap- ter 7) was incomplete in that it did not show the benefits of its distributed nature. Performing analysis of the system when applied to parallel opera- tions could provide a better view of how well it performs in a distributed setting. It is also important to note that there are incomplete aspects of NomadFS, such as the need to resolve write performance issues. Repeats of these tests would be required if these aspects were completed/improved.

8.4 Final Words

NomadFS has been successfully implemented to the point where some strengths in its design have been demonstrated. This provides a strong footing for future research in the area of small scale block-based distributed file systems.

60 Bibliography

[1] Daniel Bovet and Marco Cesati. Understanding The Linux Kernel. Oreilly & Associates Inc, 2005.

[2] Remy Card, Theodore Ts’o, and Stephen Tweedie. Design and implementation of the second extended filesystem. 90-367-0385- 9. http://tldp.org/LDP/khg/HyperNews/get/fs/ext2intro.html re- trieved on 07/05/2012.

[3] Brian Cornell, Peter A. Dinda, and Fabi´anE. Bustamante. Wayback: a user-level versioning file system for linux. In Proceedings of the annual conference on USENIX Annual Technical Conference, ATEC ’04, pages 27–27, Berkeley, CA, USA, 2004. USENIX Association.

[4] Sanjay Ghemawat, Howard Gobioff, and Shun- Leung. The Google file system. SIGOPS Oper. Syst. Rev., 37(5):29–43, October 2003.

[5] Ibrahim F. Haddad. Pvfs: A parallel virtual file system for linux clusters. Linux J., 2000(80es), November 2000.

[6] John H. Hartman. The Zebra Striped Network File System. PhD thesis, The University of California at Berkeley, 1994.

[7] John H. Hartman and John K. Ousterhout. The Zebra striped network file system. ACM Trans. Comput. Syst., 13(3):274–310, August 1995.

[8] IEEE. The Open Group Base Specifications Issue 7 handled IEEE Std 1003.1. 2008.

[9] Red Hat Inc. Gluster file system, 2012. http://www.gluster.org.

[10] Ajay D. Kshemkalyani and Mukesh Singhal. Distributed Computing: Prin- ciples, Algorithms, and Systems. Cambridge University Press, New York, NY, USA, 1 edition, 2008.

61 [11] Robert Love. Linux Kernel Development. Addison-Wesley Professional, 3rd edition, 2010.

[12] Anthony J McGregor. Block-Based Distributed File Systems. PhD thesis, The University of Waikato, 1997.

[13] Sun Microsystems. NFS: Network File System Protocol specification. RFC 1094, United States, 1989.

[14] J. C. Mogul and G. Minshall. Rethinking the tcp nagle algorithm. SIG- COMM Comput. Commun. Rev., 31(1):6–20, January 2001.

[15] W. Norcutt. The filesystem benchmark. http://www.iozone.org.

[16] David A. Patterson, Garth Gibson, and Randy H. Katz. A case for re- dundant arrays of inexpensive disks (raid). SIGMOD Rec., 17(3):109–116, June 1988.

[17] Aditya Rajgarhia and Ashish Gehani. Performance and extension of user space file systems. In Proceedings of the 2010 ACM Symposium on Applied Computing, SAC ’10, pages 206–213, New York, NY, USA, 2010. ACM.

[18] S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, Sun Microsystems Inc., C. Beame, Hummingbird Ltd., M. Eisler, D. Noveck, and Network Appli- ance Inc. Network file system (nfs) version 4 protocol. RFC 3010, United States, 2003.

[19] A. S. Tanenbaum and R. van Renesse. Distributed operating systems. ACM Computing Surveys, 17(4):419–469, December 1985.

[20] Andrew S Tanenbaum and Albert S Woodhull. Operating Systems Design and Implementation (3rd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2005.

[21] Erez Zadok, Rakesh Iyer, Nikolai Joukov, Gopalan Sivathanu, and Charles P. Wright. On incremental file system development. Trans. Storage, 2(2):161–196, May 2006.

62 Glossary

context switch Process in which the current CPU state is saved, and another state is restored. The original state can be restored at a later time.

file descriptor An identifier for a file in Unix systems. inode A data structure which stores information about a file. kernel space Memory space where the kernel operates.

Linux A portable Unix-like operating system.

MINIX A microkernel architectured Unix-like operating system.

POSIX Portable Operating System Interface. Standards aimed at maintaining compatibility between operating systems.

RAID RAID is an architecture which allows multiple small disks to work to- gether and provide increased availability and performance, while appear- ing to the upper architecture as a single larger and faster device . superblock A data structure which stores information about a file system. system call A request to the operating system kernel.

Unix A multi-tasking multi-user operating system which has stemmed into a number of different variants. user space Memory space where user programs and applications operate.

63 Appendix A

Performance analysis scripts

The following are the scripts that were used in the performance analysis tests. Excluding IOZone, all made use of the Unix dd utility.

Disk

1 #!/ bin/bash 2 3 #2 5 6M ∗ 6 0 = ˜ 1 5GB 4 5 i f [ $# != ” 1 ” ] 6 then 7 echo ”bad number of arguments” 8 e x i t 9 f i 10 11 #write performance 12 dd i f =/dev/zero of=/dev/ $1 bs=256M count=60 13 14 #read performance 15 dd i f =/dev / $1 of=/dev/null bs=256M count=60

Read

1 #!/ bin/bash 2 3 i f [ $# != ” 1 ” ] 4 then 5 echo ”bad number of arguments” 6 e x i t 7 f i 8 9 BS=256 10 BSN=$BS”M” 11 NAME=$ (( $1 ∗$BS ) )M 12 13 i f [! −f /mnt/ d f s /$NAME ] 14 then 15 echo ”file not found” 16 e x i t 17 f i 18

64 19 dd i f =/mnt/ d f s /$NAME of=/dev/null bs=$BSN count=$1 2>&1 | t a i l −n 1 | awk ’{ p r i n t $1 ” , ” $6 ” , ” $8 ” , ” } ’

Write

1 #!/ bin/bash 2 3 i f [ $# != ” 1 ” ] 4 then 5 echo ”bad number of arguments” 6 e x i t 7 f i 8 9 BS=256 10 BSN=$BS”M” 11 NAME=$ (( $1 ∗$BS ) )M 12 13 #remove if exists 14 i f [ −a /mnt/ d f s /$NAME ] 15 then 16 echo ”already exists” 17 e x i t 18 f i 19 20 dd i f =/dev/zero of=/mnt/dfs/$NAME bs=$BSN count=$1 2>&1 | t a i l −n 1 | awk ’{ p r i n t $1 ” , ” $6 ” , ” $8 ” , ” } ’

IOZone

1 #!/ bin/bash 2 3 i o z o n e −a

65 Appendix B

NomadFS current quirks

The current implementation of NomadFS has a number of limitations. This is mostly because of time constraints and concentration of development on other aspects that were deemed more important.

Improved Migrations Chapter 7 showed that the current migration imple- mentation has a large amount of overhead. This is possibly due to the lack of overlapping IO on migration operations, something which could be implemented.

Directory entries The current implementation only has support for a limited number of directory entries. Only a small change is required so as to rectify this, but was not deemed important enough at the time of writing.

Requirement for all servers to be available Because of the client network code, NomadFS currently relies on all the servers being available. Again this would require minimal effort to fix.

Truncate The POSIX standard [8] states that truncates which extend a file should fill the extra data with zero. Currently NomadFS doesn’t do this, so the extra data is bogus. cp When using the cp program to copy a file larger than approximately 100MB, strangely only a truncate system call is called and not the writes. This means that when using cp the output file has the correct size but has bogus data. The workaround for this in testing was to use cat and IO redirection ( cat input > mount point/output ). This is most probably a bug in NomadFS, but was not seen as important considering time constraints and that a workaround was available.

66 Directory entries Currently only supports a limited number of directory en- tries within a directory. Not a big thing to fix. rmdir Currently recursively removes sub files, which isn’t what it is meant to do.

Write Performance on large files Large files appear to have poor write per- formance. This is currently believed to be the result of bitmap allocations.

Further analysis NomadFS has only been analysed minimally. The affect of various aspects of its implementations, such as the consistency control, remain unknown. Modification of such features could prove to provide important enhancements.

67 Appendix C

IOZone benchmark results

Iozone: Performance Test of File I/O Version $Revision: 3.308 $ Compiled for 64 bit mode. Build: linux

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root.

Run began: Sat Oct 13 22:32:57 2012

Auto Mode Command line used: iozone -a Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 64 4 15873 24448 131663 453587 229470 23728 76459 24492 789967 17195 26079 83760 324665 64 8 27255 40452 116366 407457 224480 39628 134230 40846 818885 31279 45809 96392 287156 64 16 42637 66250 141221 471099 222250 65315 207798 64376 546928 57189 84658 112844 303383 64 32 59469 94826 110339 418250 321554 89633 296023 95535 361867 134702 187611 123828 402569 64 64 78242 117641 128511 538156 435202 111485 408077 107727 474430 85114 124286 134499 488236 128 4 15458 24183 117559 553586 125004 23717 78630 24362 341129 15928 24907 104657 386848 128 8 26633 40113 129936 630273 296973 39432 134291 40973 1424795 28425 42711 106844 396271 128 16 42227 62464 129967 511887 338761 62110 214421 66183 1142750 47794 71271 100390 447464 128 32 64389 90385 133059 546819 361576 87369 330625 94739 809997 83718 117765 127105 476031 128 64 82105 108026 135580 442668 474349 111111 410199 119283 386570 169316 206107 140955 453896 128 128 41371 119045 126477 576161 516318 134190 404025 120056 533771 100805 139273 133324 579270 256 4 15121 24162 135954 612584 154122 23827 79308 24484 303676 15372 24297 96240 485678 256 8 26826 40277 135371 602276 200631 39629 137786 40616 372205 27497 41605 93737 486559 256 16 42646 62409 102900 524593 415566 60248 223985 65472 1855098 45753 64385 113325 505812 256 32 43214 87254 100822 545654 402632 84996 305839 92613 1306564 74118 97747 95208 466275 256 64 83907 111351 103147 526652 399337 106842 359615 116472 744204 116675 141751 132777 474937 256 128 100436 122088 101585 450809 547044 123620 420283 135593 729540 186032 227738 134963 490336 256 256 100709 122143 100907 536654 477472 127738 429530 125118 520524 100080 118569 133619 597916 512 4 14176 24025 128132 607353 172505 23443 77611 24354 297855 14989 23862 126480 482060 512 8 26209 39674 131113 613075 258707 38926 134555 40787 440707 26715 40182 121761 527792 512 16 36758 62060 136540 525210 306967 59300 226150 66564 528962 44133 61302 125644 524056 512 32 63913 84087 112452 559560 456347 83469 313144 89793 1999881 67518 87581 114674 466255 512 64 64475 107925 120897 471476 539459 106333 413231 115525 1475116 97525 118437 105326 591951 512 128 100845 117081 102398 567096 536628 117697 451075 125095 1422357 131587 148884 129485 577776 512 256 74148 111304 124421 550523 573456 115135 518239 117324 890404 199213 229069 136462 603767 512 512 98518 115104 117536 525724 598050 117510 458785 106644 524056 96547 119571 129157 570865 1024 4 14427 23770 125430 575285 175585 23209 79024 24097 308706 14853 23759 105308 556861 1024 8 24960 39696 117310 527261 257804 38591 134736 40733 402836 26237 39646 121399 508954 1024 16 39173 59503 126733 504174 332458 58270 210746 65953 458579 42700 60047 116575 503229 1024 32 57002 82340 104045 552918 370869 82935 306044 94299 523214 65611 84279 106912 498556 1024 64 73616 101077 117163 490864 442515 100322 357651 119113 1976691 89744 108278 110775 518477 1024 128 87363 117580 110487 571762 597452 115521 473389 128754 2110750 114016 131180 127827 582146 1024 256 102780 113151 109846 521372 505182 116206 499774 122051 1238168 133281 151906 119222 527520 1024 512 99881 112975 104032 490136 479145 114505 451303 113714 653445 194194 237977 117256 511500 1024 1024 100679 118137 104873 493742 530059 113136 505897 115252 485317 85255 115354 131756 497516 2048 4 14513 23577 110858 509854 165347 23121 78320 24095 275824 14521 23534 122231 566663 2048 8 25298 39437 123144 520324 268806 38367 136052 40627 381744 25425 38991 115897 552165 2048 16 40308 58998 122496 557722 312670 57012 206912 65746 436773 40705 59173 120577 525802 2048 32 59816 82703 111407 552912 389569 82884 326119 85808 474403 60324 82683 107067 535004 2048 64 78039 103262 118808 531265 410749 104415 373105 118916 534007 80935 104102 134153 603445 2048 128 90965 115653 114869 488069 462500 115301 414396 122101 2778290 106164 120493 141877 555594 2048 256 91273 112372 124409 585631 578220 114656 456139 115140 2270183 112059 130812 139470 517440 2048 512 91366 113012 114241 486576 461159 107338 469272 108508 1295458 130572 149609 141837 531693 2048 1024 92817 111675 112806 483346 485422 105074 470042 107698 681048 187084 211612 141760 456430

68 2048 2048 89335 115185 129677 487986 501344 110368 505414 108061 482314 89915 111833 140911 535872 4096 4 13293 23469 126517 589850 169131 23026 76836 24153 277153 13272 23208 202140 536337 4096 8 23419 39127 123414 612520 248480 38166 135646 40600 382882 23323 38418 210352 517687 4096 16 37878 58891 123213 576412 338294 57971 224685 62801 440240 37968 57845 214617 581820 4096 32 56728 81652 113818 520132 379331 74462 295375 88752 477725 57385 82693 223083 553371 4096 64 76218 100983 115114 539097 466567 102253 417201 109536 482556 73970 94735 197076 579094 4096 128 91908 113149 123157 551736 463771 110529 441848 125068 533821 92125 112200 233737 589344 4096 256 90687 111258 120474 543154 511933 107658 461107 113284 3016231 90921 114037 236491 535351 4096 512 87407 112750 129003 518609 509776 106048 485652 108733 1927548 102101 117799 226788 512253 4096 1024 88552 117018 121090 514447 499562 107230 476334 107942 1045970 119808 139624 236735 515357 4096 2048 90010 108722 127534 498967 489722 106204 505367 106475 695659 164339 209406 238696 475293 4096 4096 89547 111759 129670 474846 465077 106833 467303 106575 470696 88863 108426 236680 492488 8192 4 12719 23456 121754 581857 169256 22898 76340 24050 279513 12654 23001 324989 556633 8192 8 22391 38859 123692 579922 230683 37872 132952 40289 359675 22306 38057 314823 552169 8192 16 36358 58356 122495 556029 322179 55852 223129 61260 442211 36465 57314 324228 512412 8192 32 54985 81476 126986 552320 395403 74051 328336 84948 514569 55449 81790 348356 575041 8192 64 73643 101274 120591 567387 445023 91929 422489 103728 538091 71366 90830 364703 529647 8192 128 88394 112018 121503 555167 480859 108326 472053 117113 541910 86813 108309 388800 556110 8192 256 87498 113446 122748 509638 488433 108111 449759 111872 576691 87234 109049 382911 574637 8192 512 87426 110842 122418 519242 494357 105575 478755 107975 2531513 87658 110029 363572 503566 8192 1024 87831 110411 122175 481878 459577 106042 469242 108112 1360576 96757 119638 361552 473223 8192 2048 86285 110048 120949 494300 482819 105741 475173 106148 891307 110334 137625 360135 508951 8192 4096 86288 111829 116360 476663 480100 105852 482710 106743 694363 163372 199687 366876 494542 8192 8192 85762 111272 124684 508883 505618 104350 535208 105318 515387 85691 106611 376483 516223 16384 4 12412 23440 121910 624275 168966 22903 76123 24103 286173 12328 22973 453990 591757 16384 8 21795 38831 125628 514183 242790 37871 130748 40529 337807 21844 37982 481414 609686 16384 16 35286 58296 117174 533123 314931 55805 214085 61963 406098 35653 57031 485510 536444 16384 32 52916 81183 121015 553940 396343 74992 329651 81792 497917 50604 75496 462954 584474 16384 64 71550 101572 122257 550055 444867 93151 423077 105045 562542 68389 90976 450186 514314 16384 128 85110 114198 128758 475410 426056 107333 460290 122380 512897 84484 107910 483304 539480 16384 256 84022 111003 121440 559810 481246 108094 456314 114713 552488 82303 106292 479022 529202 16384 512 82789 109614 121116 555898 500000 100255 459721 108883 558019 82352 105416 417501 422398 16384 1024 82766 112122 115485 479302 460963 105571 488637 107021 1696952 85542 109211 460926 497132 16384 2048 83544 111029 122114 494029 461573 96993 440109 105049 1052481 93499 116734 447541 450009 16384 4096 84495 110131 118921 509199 478882 107099 488387 106308 902353 109633 137890 467140 492708 16384 8192 83431 112310 121235 480244 475216 105856 502207 107236 735362 157372 201830 526223 543452 16384 16384 83216 111073 121997 491524 500623 104389 512194 105256 536075 83388 106036 534691 499618 32768 64 67343 100178 126671 584809 446636 92793 419553 106220 522224 65447 90614 454644 533636 32768 128 79454 112082 123963 563110 447241 110377 412491 122005 538769 78760 107344 494097 596376 32768 256 78820 109542 122756 553569 506601 107499 496297 115177 554881 77878 105030 468756 565550 32768 512 78391 109778 121872 561624 469233 104632 473048 106115 564702 79045 104779 460866 514450 32768 1024 78030 111045 121688 535547 481436 106483 469705 107344 529840 80072 106793 426261 471767 32768 2048 77905 108960 120447 463505 457820 105735 461067 107204 1194385 82304 109896 426311 482472 32768 4096 78250 110181 122097 511151 488113 104917 490039 106624 1075381 88953 117084 458023 477381 32768 8192 77867 108954 121992 513347 492620 105339 489418 106038 895573 105055 136752 475394 505086 32768 16384 78307 109101 118410 474349 466263 105462 491304 106176 697072 155459 199743 445890 470330 65536 64 61815 100508 124251 619023 430209 92639 362463 103746 502630 60206 93256 594473 604012 65536 128 72977 113804 124496 550740 456560 109440 483215 121608 549116 72084 104822 491164 517032 65536 256 72297 114995 124002 524435 458483 106329 449742 112553 513983 71492 104626 575327 539062 65536 512 72195 112451 117946 509936 481090 104568 466063 108367 475291 70967 103522 457640 500201 65536 1024 72395 109138 121554 515030 460486 104586 482773 105655 513503 71536 103639 436496 485164 65536 2048 71918 113046 117997 498100 462312 105688 459414 106528 530295 72715 105788 464638 481054 65536 4096 71988 111166 119391 470132 467359 105039 448333 106818 1179830 75701 107620 458489 480511 65536 8192 71990 111637 121961 470476 456216 105341 481259 106523 1076227 81816 116468 450573 483204 65536 16384 72013 112301 119950 447501 446235 105914 457397 106286 921238 97778 136863 442885 454769 131072 64 62608 102032 124361 540295 409653 92720 400906 100378 494727 59970 93298 555547 539022 131072 128 73514 115463 122447 520003 440409 110737 436688 120917 541052 72000 106601 547566 542925 131072 256 73106 113708 123186 523751 482721 106389 447797 114597 531540 70726 104531 531019 518562 131072 512 72449 111490 123385 539615 460596 105225 452882 107173 451003 69848 103703 501974 465585 131072 1024 72572 111602 117223 483571 461309 105324 428343 106614 505805 70857 103735 449621 444141 131072 2048 72235 111127 121825 481420 448621 104130 441903 107019 498791 71391 103373 477078 477417 131072 4096 72431 111107 121995 490175 480543 105544 513766 106163 517059 72394 106135 484459 484454 131072 8192 72436 111372 121655 452495 443345 103956 474297 104995 1195932 75511 109687 475475 491430 131072 16384 72632 109680 121553 490034 473397 104480 472413 104947 1102455 80986 118194 467196 475477 262144 64 61819 53030 117538 520919 393788 87648 366204 100682 497861 34989 53987 506800 517143 262144 128 72841 68412 112641 521553 448142 102791 391324 114775 490662 70560 39561 520595 507967 262144 256 72332 27457 119284 492242 456372 96249 446505 111655 517170 61559 39654 501338 503374 262144 512 71242 64020 115794 474945 453210 94242 414175 103089 460331 69090 43769 471553 455508 262144 1024 71434 27243 115662 457986 435668 101484 445137 106668 467815 63977 99448 435845 448924 262144 2048 71678 39203 116302 462107 444406 95560 445766 104790 463703 42866 22599 443329 437690 262144 4096 71423 27287 116941 430446 432333 100640 433699 104375 447996 61720 99490 424664 440340 262144 8192 71528 27042 116071 442578 446901 100350 437877 105313 500888 63482 101515 437386 431280 262144 16384 71832 27112 115994 443431 430172 71450 454806 101116 1165690 36170 109702 484903 483988 524288 64 46987 22304 89192 65550 178697 42772 146114 94379 75663 46648 22381 153267 51617 524288 128 49591 22524 40571 39097 71789 46053 66913 117223 70951 49167 22037 93428 30691 524288 256 49728 22305 20126 39600 46768 29820 60471 117002 60394 42372 21902 35554 36507 524288 512 46535 19257 17266 33821 54815 20326 36686 103623 49398 31211 16337 42776 52344 524288 1024 28915 17654 21405 29194 59740 19381 51652 103468 58380 21263 16954 40050 52923 524288 2048 21782 17720 20799 27720 53783 17229 63842 112942 59567 21539 20033 40226 41301 524288 4096 21962 20167 18738 25604 51788 18968 57330 103052 55544 23242 21372 40397 41320 524288 8192 24352 21598 20104 24220 39251 20522 47266 111118 50411 24693 22283 47210 28965 524288 16384 23824 22091 20062 22929 55609 22062 66646 111399 59021 22369 22754 38762 40312 iozone test complete.

69 Appendix D

Configuration file format for NomadFS

, , ,, ,, ...

70 Appendix E

NomadFS source code listing

The source code for NomadFS may be found at: http://www.github.com/samweston/nomadfs/

71