Exploration and Integration of File Systems in LlamaOS

A thesis submitted to the

Division of Research and Advanced Studies of the University of Cincinnati

in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

in the School of Electrical and Computing Systems of the College of Engineering and Applied Science

November 12, 2014

by

Kyle Craig BS Computer Engineering, University of Cincinnati, 2013

Thesis Advisor and Committee Chair: Dr. Philip A. Wilsey Abstract

LlamaOS is a minimalist and customizable operating system for Xen-based virtual machines running on parallel cluster hardware. With the addition of LlamaMPI, a custom implementation of the standard Message

Passing Interface (MPI), LlamaOS is able to achieve significant performance gains for fine-grained parallel applications. However, the initial setup of LlamaOS lacked full file system support which limited its use to only a limited set of applications. Initially LlamaOS could only support a single read-only file for each instance in the parallel virtual machine.

The original design of LlamaOS was motivated by work in parallel discrete-event simulation. However, the single-file file system of LlamaOS had significant drawbacks. For some simulation models, multiple configuration files are needed to fully run properly and to make these models run on LlamaOS the parallel simulator that was being evaluated had to be changed so that only a single configuration file was needed.

This went against one of the base principles of LlamaOS: Outside applications can be run with little to no modification. Another major drawback was that the models could not create and write their own results

file limiting the amount of data that could be gathered from running in LlamaOS. In order to alleviate these issues this thesis explores the implementation of a bare-bones Virtual File System (VFS) and, consequently, support for a Second Extended File System (Ext2) driver has been added to LlamaOS.

In order to test the functionality and performance of the new LlamaOS file system, the bonnie benchmark

(a common tool used for benchmarking various types of file I/O at a basic system level) was ported to run within LlamaOS. The benchmark showed promising results, and verified that the file system implementation was functional and able to perform file reads, file writes, and file creation. In addition to the bonnie bench- mark a couple of parallel simulation models were also used to verify that the file system implementation achieved its goal of efficiently supporting parallel simulation within LlamaOS with little to no modification

i to the original simulation kernel.

ii

Acknowledgments

I would like to thank my family for supporting me throughout this endeavour. I would like to thank Andrea

Giovine for putting up with the late nights and aloof behavior I exhibited while working. I would like to thank Eric Carver and Josh Hay for being great friends that I was able to go through this experience together with. Finally, I would like to Dr. Wilsey for being a great advisor and for just putting up with me in general.

Support for this work was provided in part by the National Science Foundation under grant CNS-0915337.

iii Contents

1 Introduction 1

1.1 Motivation ...... 2

1.2 Thesis Overview ...... 3

2 Background 5

2.1 Virtualization ...... 5

2.2 Xen Hypervisor ...... 8

2.3 LlamaOS ...... 9

2.4 File Systems ...... 13

2.4.1 The Second Extended File System ...... 16

2.5 WARPED ...... 19

2.5.1 Discrete Event Simulation ...... 20

2.5.2 Parallel Discrete Event Simulation ...... 20

2.5.3 Time Warp Mechanism ...... 22

3 Related Work 24

3.1 The CROCOS Kernel ...... 24

3.2 Time Warp Operating System ...... 25

4 Overview of Approach 29

4.1 File System Choice ...... 29

4.1.1 The FAT File System ...... 30

iv CONTENTS

4.1.2 The MINIX File System ...... 32

4.1.3 Final Comparison ...... 34

4.2 File System Benchmark Research ...... 34

5 Implementation Details 37

5.1 File System ...... 37

5.1.1 Virtual File System ...... 37

5.1.2 Ext2 Driver ...... 40

5.2 Benchmark Modifications ...... 45

6 Performance Results 46

6.1 File System Benchmarks ...... 46

6.2 WARPED Tests ...... 49

7 Conclusion and Future Work 53

7.1 Summary ...... 53

7.2 Suggestions for Future Work ...... 53

7.2.1 LlamaOS ...... 54

7.2.2 File System Implementation ...... 54

v List of Figures

2.1 Type-1 Hypervisor ...... 6

2.2 Type-2 Hypervisor ...... 7

2.3 Comparison of Native and Virtual Linux to LlamaOS [29] ...... 11

2.4 LlamaNET block diagram [29] ...... 12

2.5 I-node direct block mapping [19] ...... 14

2.6 I-node indirect block mapping [19] ...... 15

2.7 Virtual File System indirection [12] ...... 17

2.8 Physical structure of the Ext2 file system ...... 18

2.9 Translation of bitmap to disk blocks [7] ...... 19

2.10 PDES Illustration [17] ...... 21

3.1 Structure of the Time Warp Operating System [23] ...... 26

4.1 Layout of the FAT12/16 File System [5] ...... 31

4.2 Layout of the MINIX File System [41] ...... 33

5.1 of the fs open function...... 39

6.1 Bar Chart Demonstrating Increase in Speed with File Size ...... 48

6.2 Comparison of Native and LlamaOS Simulation Times ...... 51

vi List of Tables

2.1 Xen para-virtualized x86 interface [8] ...... 9

4.1 Comparison of FAT, MINIX, and Ext2 file systems ...... 34

6.1 Test System Specifications ...... 46

6.2 Averages of bonnie results by file size ...... 47

6.3 Native bonnie results for test system ...... 47

6.4 Comparison between LlamaOS and Native for epidemic model ...... 51

vii Chapter 1

Introduction

In recent years significant improvements have been made in the field of virtualization [6, 32]. The x86 architecture, which has historically not supported hardware virtualization, has now been extended. Instead of relying on paravirtualization [32] or intricate binary translation from Virtual Machine Monitors (VMMs) such as VMware Workstation [40], virtualization in hardware is now directly supported by all major x86

CPU manufacturers [6].

One field able to take advantage of hardware virtualization is High-Performance Computing (HPC). In general, HPC users do not care about the and time multiplexing that hardware virtualization provides.

They are more focused on pushing the limits of CPU performance and memory size. Because they are trying to push these limits, HPC tends to require a significant portion of operating system customization.

Most HPC applications provide extremely different workloads than those provided by typical servers or workstations [32]. Virtualization is able to aid with this operating system customization, and now that

VMMs have easier direct hardware access, their full potential can be explored.

Virtualization allows for the quick creation and destruction of multiple operating system instances on a single machine. This significantly decreases the cost of HPC by removing the need to purchase expensive data clusters. Virtual clusters can be created, and each individual virtual machine can run a specialized operating system to better utilize the hardware running underneath [32]. One such attempt at a specialized operating system is the Low-LAtency Minimal Appliance Operating System (or LlamaOS) [29] being de- veloped at the University of Cincinnati. LlamaOS works by launching a virtualized cluster of LlamaOS

1 CHAPTER 1. INTRODUCTION 1.1. MOTIVATION virtual machines. Communication between nodes within a LlamaOS cluster is achieved by LlamaNET. Lla- maNET was developed as a high performance, lightweight communication protocol to decreased the latency between the machines in a cluster and boost I/O and parallel computation performance.

Testing showed that LlamaNET was a massive success, and that it could reduce latency by an order of magnitude over native networking applications for message sizes below 1KB [20]. This opened up the idea to further improve LlamaOS and explore different avenues to decrease parallel simulation runtime; however, before further performance improvements could be made, additional modules needed to be added to the core of the LlamaOS solution. The first of these improvements was the addition of a virtual file system [12] and a driver to support the second extended file system (Ext2) [35].

1.1 Motivation

Although LlamaOS has the capability to show significant one-way message passing latency decreases with the use of LlamaNET [29], it still has a lot of areas that are underdeveloped. The first of these areas is file system I/O. Until 2013 LlamaOS lacked the capability for applications to have their own specific configuration files. This was largely an issue for the more complex WARPED [30] simulation models that required a configuration file detailing many different running parameters. A quick fix was created to solve this issue that involved the implementation of support for a single-file file system. This was a flat file system that contained no directory structure and was only able to contain a single configuration file. The current limitations with the file system are as follows [20]:

• The file needs to be statically declared and loaded into memory at the start of the instance.

• The file is read-only and new files cannot be created.

• The file system does not contain directories or any hierarchical structure.

• The file needs to be padded to 512 byte segments.

These limitations hinder the ability of LlamaOS to perform in depth testing of simulation models that require even the most basic of file system operations.

2 CHAPTER 1. INTRODUCTION 1.2. THESIS OVERVIEW

To overcome these limitations this thesis explores the expansion of LlamaOS to incorporate a Virtual

File System (VFS) and a second extended file system (Ext2) driver. These changes allow LlamaOS to perform basic yet necessary operations to a single disk formatted with the Ext2 file system. These abilities include, but are not limited to: (i) the ability to read more than one file and the ability to write data back to the file system, and (ii) allowing the data to be read from any operating system that has Ext2 file system support. LlamaOS is also now able to run any simulation models that require multiple configuration files.

Simulation models from the WARPED parallel discrete event simulator [30] will be used to demonstrate this new capability.

1.2 Thesis Overview

The remainder of this thesis is organized as follows:

Chapter 2 provides various background information on different software, systems, and concepts that are related to this research. These topics include virtualization (with a specific focus on the Xen hypervisor),

LlamaOS, basic file system concepts (with specific focus on the second extended file system), and the many concepts behind the WARPED parallel discrete event simulator.

Chapter 3 contains information about work related to file system driver support in microkernels and other custom operating system environments. It takes a specific look at the CROCOS read-only imple- mentation of the second extended file system which was used as a backbone for the LlamaOS file system implementation. It also examines other research-related custom operating systems for parallel discrete event simulation with a specific focus on the Time Warp Operating System [23].

Chapter 4 discusses the approach taken to select a file system to implement in LlamaOS. It goes in depth into the discussion of the merits of the second extended file system over various other candidates such as the MINIX file system and the the variants of the FAT file system. This chapter also reviews the approach taken to learn about file system benchmarking, the controversies in the field of file system benchmarking, and how the proper way to benchmark a file system within the LlamaOS environment was determined.

Chapter 5 contains a detailed discussion of the implementation of the second extended file system driver and virtual file system used in LlamaOS. It also discusses how the chosen file system benchmarking tool was ported over to be used in the LlamaOS environment.

3 CHAPTER 1. INTRODUCTION 1.2. THESIS OVERVIEW

Chapter 6 discusses the results of all tests conducted on the latest version of LlamaOS. This includes the results of the file system benchmarking tool and the results of the WARPED simulations being run in the new

LlamaOS environment. It also contains a comparison to native implementations of both.

Chapter 7 contains some concluding remarks and suggestions for future research. It gives a brief sum- mary of all previous chapters before discussing the implications of the test results. It also provides sugges- tions for future improvements to the LlamaOS file system and other parts of the operating system.

4 Chapter 2

Background

This section provides an in depth look at many of the different subjects involved in the implementation and testing of the LlamaOS file system. The subjects discussed in this section start with the basics of the

LlamaOS platform, virtualization and the Xen hypervisor, then move onto an explanation and brief history of the second extended file system before closing with an overview of parallel discrete event simulation and the WARPED system which was used to test that latest version of LlamaOS including file system support.

2.1 Virtualization

Virtualization is the act of running multiple operating systems on a single physical machine [32]. This allows hardware to be sectioned off and partitioned appropriately to support the needs of each operating system; however, if you want to achieve resource sharing between guest operating systems, a hypervisor or virtual machine monitor (VMM) needs to be used [32]. Popek and Goldberg [36] argued that a successful VMM requires three characteristics.

The first characteristic is that the VMM needs to provide an “essentially identical” environment for the guest operating systems to run on. This means that each operating system should not be able to tell that it is not running directly on the hardware [36].

The second characteristic is that the VMM needs to be efficient. To achieve a high level of efficiency, the VMM needs to run a majority of the virtual processor’s instructions on the physical processor without software intervention [36].

5 CHAPTER 2. BACKGROUND 2.1. VIRTUALIZATION

Figure 2.1: Type-1 Hypervisor

The third and final characteristic is that the VMM must properly control resource allocation between the guest operating systems. Appropriate resource control is attained when a guest operating system cannot access any resources not allocated to it, and when the VMM can regain control of allocated resources as necessary [36].

These three characteristics eventually evolve into what is now known as a Type-1 hypervisor. Later on in the paper the authors discuss a second type of VMM for which they loosen their strict characteristics.

They refer to this type of VMM as a Hybrid Virtual Machine (HVM), but it has now come to be known as a

Type-2 hypervisor. Most modern VMMs and hypervisors can be classified as one of these two types [36].

Type-1 (native or bare-metal) hypervisors run directly on top of the hardware layer. All operating sys- tems (including the host operating system) run through them. Running directly on the hardware allows the hypervisor to easily interface with system resources to satisfy the three characteristics outlined above. The

Xen [8] hypervisor used to run LlamaOS is a good example of a Type-1 hypervisor. Figure 2.1 shows an illustration of a Type-1 hypervisor.

Type-2 hypervisors are typically run on top of a host operating system. They do not have direct access to the hardware and as a result their instruction sets are limited [11]. This occasionally impacts performance

6 CHAPTER 2. BACKGROUND 2.1. VIRTUALIZATION

Figure 2.2: Type-2 Hypervisor

when compared to a Type-1 hypervisor [36]. The main use for Type-2 hypervisors is software compatibility of user applications. A good example of a Type-2 hypervisor is VirtualBox [4]. VirtualBox uses software indirection to fix faults in the guest operating systems. When a guest OS sends an instruction to the hardware,

VirtualBox takes the instruction and re-routes it to where it actually needs to go. This allows the guest OS to continue without knowing that it is not directly on the hardware [4]. This differs from a Type-1 hypervisor which attempts to allow as many instructions as possible without relying on software indirection. Due to this software indirection some low-level instructions are unable to be executed within Type-2 hypervisors; however, for operating systems with little to no I/O, both types of hypervisors will run processes with little to no difference from a native operating system [32]. Figure 2.2 contains an illustration of a Type-2 hypervisor.

Although guest operating systems with little to no I/O will function similarly to a native operating system regardless of whether the instruction set is x86 or x64, virtualization does incur some costs [8, 36].

These costs typically occur at the I/O level when a guest OS attempts to communicate with the hardware.

Depending on the type of virtualization being used, there can be a significant increase in latency. The first type of virtualization, full virtualization, has been somewhat discussed in the Type-2 hypervisor description.

Full virtualization is the act of keeping the guest OS completely blind to the fact that it is running in a virtual

7 CHAPTER 2. BACKGROUND 2.2. XEN HYPERVISOR environment. This is achieved by building a hypervisor that can provide full emulation for the hardware devices used by the guest OS. This causes the worst latency whether the guest OS is running on a Type-1 or

Type-2 hypervisor. There have been successful measures used to decrease the latency in I/O when running in a virtual environment. Recent processors contain hardware features to allow for easier virtualization, and software para-virtualization allows for the guest OS to be aware of the fact that it is running in a virtual environment. With para-virtualization the guest OS can communicate with the hypervisor and even request direct access to certain parts of the hardware. The Xen hypervisor uses para-virtualization [32]. Another measure introduced to reduce virtualization latency is pre-virtualization. In pre-virtualization a guest OS has its code semi-automatically annotated to adapt to the specific hypervisor it is running in and remain compatible with the hardware [32].

2.2 Xen Hypervisor

The Xen hypervisor is a Type-1 hypervisor. It is installed directly onto the hardware and can run multiple guest operating systems called domains. The host or primary operating system is named dom0, and all other guest operating systems are referred to as domU [13]. Once running inside dom0 a user is then able to use the Xen toolchain to create and destroy multiple domUs. This is useful for LlamaOS because it allows for the quick execution of a single LlamaOS application before terminating the domU.

Xen also provides various other features that allow it to be a solid hypervisor choice for this project.

Being a Type-1 hypervisor, Xen allows LlamaOS to have direct access to the hardware which ultimately decreases the latency of any I/O functions. This is called PCI-passthrough (or direct connect). Xen is also a para-virtualized hypervisor that allows each domain to communicate directly with it using a hypercall [8].

Hypercalls can be used to request page table updates or for the use of certain Xen specific tools such as the XenStore which allows memory to be shared between multiple domains [14]. This is useful for creating virtual clusters and allowing them to communicate with each other. All communications with the hypervisor are processed as asynchronous events [8]. Table 2.1 overviews the various aspects of the para-virtualized x86 interface of the Xen hypervisor, including memory management, CPU properties, and device I/O.

Although dom0 is the only domain with unchecked access to physical disks, Xen provides various ways for a guest OS to gain access to and use a file system through the use of Virtual Block Devices (VBDs) [8].

8 CHAPTER 2. BACKGROUND 2.3. LLAMAOS

Memory Management Segmentation Cannot install fully-privileged segment descriptors and cannot overlap with the top end of the linear address space. Paging Guest OS has direct read access to hardware page tables, but updates are batched and validated by the hypervisor. A domain may be allocated discontiguous machine pages. CPU Protection Guest OS must run at a lower privilege level than Xen. Exceptions Guest OS must register a descriptor table for exception handlers with Xen. Aside from page faults, the handlers remain the same. System Calls Guest OS may install a ’fast’ handler for system calls, allowing di- rect calls from an application into its guest OS and avoiding indirecting through Xen on every call. Interrupts Hardware interrupts are replaced with a lightweight event system. Time Each guest OS has a timer interface and is aware of both ’real’ and ’virtual’ time. Device I/O Network, Disk, etc. Virtual devices are elegant and simple to access. Data is transferred using asynchronous I/O rings. An event mechanism replaces hardware interrupts for notifications. Table 2.1: Xen para-virtualized x86 interface [8]

If the dom0 is a Linux distribution, the Linux Logical Volume Manager (LVM) can be used to partition a portion of a physical disk which can then be pointed to in the Xen configuration [13]. This causes all domUs to load with access to that specific VBD; however, due to the nature of LlamaOS a different method needed to be used.

Stand-alone files (including disk images) are treated as VBDs and are also able to be loaded into the running memory of any domU through that domUs configuration file. Using this method almost all I/O overhead can be overcome, but the resource usage of each domU is significantly increased. Upon booting up, dom0 is automatically given access to all memory in the system. As each domU is spawned dom0 decides how much memory is able to be allocated with it [13]. This number can also be set before starting a domU. This limits the file system size with how many guest operating systems are being launched.

2.3 LlamaOS

The Low-LAtency Minimal Appliance Operating System (LlamaOS) is designed to utilize virtualization to allow for a cheap alternative for high performance computing (HPC). HPC traditionally requires a signifi-

9 CHAPTER 2. BACKGROUND 2.3. LLAMAOS cant amount of hardware that can be quite expensive to create large clusters of specialized platforms [29].

Beowulf clusters [39] are typically used as a low-cost alternative to the typical infrastructure required in

HPC.

A Beowulf cluster is created using open source software and cheap hardware. Most Beowulf clusters run standard Unix-based operating systems. This allows the cluster to provide a basic level of support for both parallel and sequential computations, but leaves significant room for improvement. For instance, to fully optimize for HPC the standard Linux operating system would need to be customized on some degree.

These customizations typically require frequent configuration changes that can be difficult to deploy due to the complexity of the Linux operating system [29].

LlamaOS combines the properties of virtual appliances [25], microkernels [27], and exokernels [16] to mitigate the cost of configuring the operating system. The first property, borrowed from virtual appliances, is the ability to have application code directly built into the OS image. This allows application configuration to occur before the OS image has been built instead of after the OS is already running [25]. The second property, borrowed from microkernels, is the ability to run using minimal system services. Microkernels are built to run with a single task and any services not needed to accomplish these tasks are not included [27].

Although LlamaOS is able to accomplish a variety of tasks, it is still able to ensure that it only includes services that are necessary for the task at hand. It does this with the third property. Exokernels are designed to have nearly all system services be optional [16]. Taking this concept, LlamaOS can be configured to run with only the specific set of system services that are needed to accomplish the tasks of the application it has been configured to run [29]. Figure 2.3 [29] illustrates the minimal running conditions of LlamaOS compared to that of Linux.

This minimalism allows a single physical machine to launch several LlamaOS nodes that each have a different role. One node may be customized as a primary computation node while another is configured to focus on external networking, and the resources required of each virtual machine can be significantly reduced because the OS image of each node was configured to only include the system services and third party packages necessary to accomplish its task [20]. With each node working on a small yet separate task, there needs to be a way for them to communicate with one another. LlamaOS utilizes hypervisor calls to accomplish this communication.

10 CHAPTER 2. BACKGROUND 2.3. LLAMAOS

Figure 2.3: Comparison of Native and Virtual Linux to LlamaOS [29]

The current implementation of LlamaOS is built to run specifically on the Xen hypervisor. Xen is open-source, well documented, and as discussed in Section 2.2, Xen is a Type-1 hypervisor that contains hypercalls which allow LlamaOS to communicate with it directly. This allows LlamaOS access to shared memory and the XenStore. It also allows for control of core system resources to optimize application latency [29]. Although LlamaOS is currently built for Xen, many of the core concepts behind the project are hypervisor independent, and LlamaOS could theoretically be re-configured to work with a different hypervisor [29].

Even though LlamaOS is as minimal as possible it still has a number of standard components embedded in it [29]. The first of these components is the hypervisor interface. This is necessary for LlamaOS to interact with other nodes running LlamaOS; as well as, interface with any hardware devices that require communi- cation through the hypervisor such as network components and I/O devices [29]. Other components include standard C/C++ GNU tools such as partial ports of glibc-2.17.0 and gcc-4.7.1. These are included to provide a modern software engineering environment for future application developers on the platform, and add a layer of abstraction between the LlamaOS source and the virtual applications is can run [29].

11 CHAPTER 2. BACKGROUND 2.3. LLAMAOS

Figure 2.4: LlamaNET block diagram [29]

LlamaOS also contains a high performance networking API, designed for low-latency communication known as LlamaNET [29]. LlamaNET is Ethernet based and provides a mechanism in which no intermedi- ary buffers are involved in the message passing algorithms, similar to the concept of remote direct memory access (RDMA) [28, 29]. LlamaNET works by designating an instance of LlamaOS as the driver instance.

The driver instance is tasked with taking control of a network interface adapter and creating shared memory for the other LlamaOS instances to use as communication buffers. Once the communication buffers have been established, the driver instance focuses on keeping information flowing from the shared buffers and the hardware. The rest of the LlamaOS instances are designated as application instances. Application in- stances are tasked with running whatever application logic they are built with. They all communicate with each other directly through the shared memory interface [29]. Figure 2.4 [29] provides an illustration of the

LlamaNET system.

In preliminary testing, LlamaNET provided a 70% reduction in network latency over native TCP for message sizes less than 8 kilobytes [29]. LlamaNET is just one example of what the LlamaOS virtual environment makes possible. Future improvements to the core of the operating system, such as the ones implemented and discussed throughout the course of this thesis are still under development.

12 CHAPTER 2. BACKGROUND 2.4. FILE SYSTEMS

2.4 File Systems

A file system is a specialized driver that provides an abstraction layer between the operating system and the physical permanent storage such as disk drives. File systems provide tools that allow the operating system to create, modify, store, and retrieve data [19]. There are many different types of file systems to fit the needs to their operating systems. For example, a file system being used within a non-user facing high end mainframe could prioritize throughput by increasing the number of transactions able to be performed on the memory in a shorter amount of time, but is unorganized from the perspective of a user trying to locate a file within the system [19]. However, even with differences in some features all file systems follow a basic structure. The following terms will be used extensively throughout this paper when discussing file systems [19]:

• Disk: The physical devices which contain all files. Disks are made up of a series of sectors or blocks

of equal size, and may contain multiple partitions.

• Partition: Partitions are subsets of the blocks on a disk.

• Block: A block is the smallest unit that can be written to a disk or partition. File systems are comprised

of a series of operations being performed on the blocks of a disk. The typical block sizes of most

modern disks are 512 and 1024 bytes.

• Volume: A volume is a partition that has been initialized with a specific file system structure. A single

disk may contain multiple volumes with differing file systems.

• Superblock: The superblock is a special type of block with contains all the information of a file

system stored on a volume. This typically includes information such as how large the volume is, the

name of the volume, and the number of free blocks. It may also contain other types of information

about the volume.

• Metadata: Metadata refers to information about an object that is not stored within that object. An

example of this is file size being stored in an inode instead of in the file.

• I-node: An i-node is where a file stores all metadata pertaining to itself. An i-node may also point to

a series of other i-nodes representing a directory.

13 CHAPTER 2. BACKGROUND 2.4. FILE SYSTEMS

Figure 2.5: I-node direct block mapping [19]

• Journaling: Journaling is a method of insuring that the metadata of a file system remains correct even

in the event of a system failure.

There are two fundamental concepts involved in every file system. The first fundamental concept is that of a file. A file is a piece of data that can be given a name when stored so that it can be retrieved later using that same name. This is made up of a collection of bytes which the user wants to store. Each file is assigned an i-node which keeps track of all the metadata involved with that file. This metadata contains the information of how large the file is and the information of how to obtain the blocks which contain the data of the file. It often contains more information, but these are the minimum. Different file systems store different metadata in i-nodes.

The actual bytes of the file are stored in a series of blocks within the volume. Depending on memory usage these blocks may be located beside each other or they could be scattered throughout the volume. The way new blocks are allocated to a file is based on the implementation of the file system. Due to this fact the i-node associated with a file needs to contain a block map which points to the physical address of the data blocks within the volume as illustrated in Figure 2.5 [19].

14 CHAPTER 2. BACKGROUND 2.4. FILE SYSTEMS

Figure 2.6: I-node indirect block mapping [19]

Figure 2.5 is an example of a specific case of block mapping known as direct mapping. Direct blocks have their information stored directly in the data map of the i-node; however, due to the size limitations of an i-node most file system implementations also use a method of indirect block mapping. Once out of direct blocks an i-node will be pointed to an indirect block which can contain up to the block size of additional blocks that the file can be stored to [19]. Figure 2.6 contains an illustration of an indirect block.

The second fundamental concept of file systems is a directory. Directories allow a user to name organize multiple files. Most file systems implement directories as an i-node which contains a list of other i-nodes.

This creates a hierarchal structure that is easy to navigate both forwards and backwards [19]. There also flat file systems that consist of only a single directory.

File systems also contain a set of common operations. The basic operations included in nearly all file systems are the abilities to mount, unmount, read data, and write data [19]. Mounting is the act of accessing the disk to read all information about the volume from the superblock. The metadata contained in the superblock is then used to initialize other features of the file system. Different types of error checking also occur during this step. The most common type of error checking is a consistency check to make sure that the volume was unmounted properly the last time it was used [19].

Unmounting a volume involves storing all modified metadata and flushing any data still stored in mem- ory to the disk. Once the data has been flushed a check is performed to make sure that the volume is in proper shape and that all reads and writes to the disk have finished. The superblock is then updated to note that the volume was unmounted properly. At the point no operations should be run on the disk until is has been mounted again [19].

15 CHAPTER 2. BACKGROUND 2.4. FILE SYSTEMS

Reading and writing data are both the process of taking a file or directory name and converting it to block locations. This involves tracing metadata through i-nodes until the proper blocks are found. Then depending on the operation the data is either retrieved into memory or written to the device. In the case of a write, the metadata of the file and volume will also need to be updated to maintain consistency.

Initially most operating systems were designed to support a single file system. This was true of both the

Minix operating system [41] and early versions of Linux [12]. Eventually file systems evolved to a point where it was difficult to keep the file system driver within the OS up-to-date. This was when the Virtual File

System (VFS) [12] was developed. A VFS acts as an indirection layer between the operating system and the file system drivers. Instead of issuing a file system call directly to that file system’s driver, the kernel calls a function within the VFS. The VFS then re-directs the system call to the appropriate file system driver.

The file system driver then continues to perform the actual I/O operations. Figure-2.7 [12] illustrates this concept within the Linux kernel.

With the VFS firmly in place, developers were able to quickly advance the capabilities of the file systems they were working with. One example of this is the evolution of the Extended File System (ext) to the Second

Extended File System (ext2).

2.4.1 The Second Extended File System

As briefly mentioned at the end of Section 2.4, the Second Extended File System (Ext2) was designed as an upgrade to the Extended File System (Ext). Ext2 features significant improvements over its predecessor.

In addition to a reduced risk of data loss, Ext2 provides a significant portion of both standard and advanced features.

The standard features of Ext2 include support for all standard Unix file types (regular files, directories, device special files and symbolic links), the ability to manage file systems on partitions up to 4 terabytes (up from the previous upper-bound of 2 gigabytes), the ability to use long file names (255 characters), and the reservation of blocks for the super user to allow for easy recovery from situation where a user process uses all free memory within the file system [12].

The more advanced features included in Ext2 are the addition of file attributes which allow users to modify kernel behavior when acting on a set of specific files, new mount options to choose whether file

16 CHAPTER 2. BACKGROUND 2.4. FILE SYSTEMS creation is based on BSD [26] or System V Release 4 [21] semantics and to allow metadata to be written synchronously, the ability to choose logical block size when creating the file system, fast symbolic links, and multiple ways to continuously track and check the state of the file system to prevent as many data errors as possible [12].

As most file systems, Ext2 uses blocks as the basic unit of storage, contains a hierarchical directory system, has a superblock to track the metadata of the file system, and keeps track of files and directories using i-nodes. The more specialized portions of its design include block groups to logically split the disk into more manageable sections [35], block and inode bitmaps (a sequence of bits usually stored as a byte array) to track which blocks and inodes have been allocated [35], and symbolic links. This section will

Figure 2.7: Virtual File System indirection [12]

17 CHAPTER 2. BACKGROUND 2.4. FILE SYSTEMS

Figure 2.8: Physical structure of the Ext2 file system discuss these more specialized portions and the overall structure of the Ext2 file system.

Block groups, as the name suggests, are groups of blocks. Defined in Section 2.4, blocks are the smallest unit that can be written to a disk or partition. Grouping blocks together helps to reduce fragmentation and minimize the amount of disk seeks required when reading large amounts of consecutive data [35]. Each block group contains a block usage bitmap and an inode usage bitmap that corresponds with the blocks in that group. These bitmaps are limited to a single block each. This forces the size of a block group to be eight times the size of a block [35]. The metadata about each block group is stored in a descriptor table which follows immediately after a copy of the file systems superblock. This metadata contains addresses of the different aspects of the block group, such as the location of the block and i-node bitmaps and the start of the data blocks. Figure 2.8 shows the general layout of the Ext2 file system.

Having the superblock replicated in block group allows the file system and added layer of protection in the case of a hardware failure. It also allows all file system information to be easily obtained no matter which block group the disk head is currently looking at.

The method of storing information about the allocation of blocks and inodes in bitmaps is also a new concept. Previously the Ext file system stored all inodes and block in a linked list structure. This required a massive overhead in seeking through the data blocks to determine which blocks were free. In Ext2 all information regarding whether or not a block or i-node is free are stored in bitmaps. Bitmaps are simply long sequences of bits. When being read a logical value of 0 means the block is free and a logical value of

1 means the block has already been allocated. Each bitmap occupies a single block and can be read using character array. Each byte of the array can then be converted to a bit string and parsed to find the closest 0.

Figure 2.9 [7] illustrates the concept of how the block bitmap is laid-out to represent the blocks contained

18 CHAPTER 2. BACKGROUND 2.5. WARPED

Figure 2.9: Translation of bitmap to disk blocks [7] in the block group.

Symbolic links are a special type of file that contain references to other files or directories. They allow for data to appear duplicated for organizational purposes without actually requiring the data contained in the blocks to be copied. The only space requirement of a symbolic link is the allocation of an i-node. Most applications handle symbolic links in different ways. Some applications will treat the symbolic link as if it is operating directly on the target file. Other applications may require modifications to the symbolic links themselves and choose to manipulate them directly [35]. Either way it is a useful feature that is available with the Ext2 file system.

Further discussion about the Ext2 file system will occur in Section 4.1 where it is compared to the other candidates to be used for the LlamaOS file system. Section 5.1 will also go into detail about the specific implementation of the Ext2 file system used for the LlamaOS file system including which features are currently available and the differences with the Linux Ext2 driver.

2.5 WARPED

WARPED is a parallel discrete event simulator being developed at the University of Cincinnati based on the Time Warp Mechanism [24]. It is used as a case study to evaluate the utility of LlamaOS (and for the purposes of this thesis, the filesystem therein). WARPED is designed to provide a highly configurable environment to allow for the integration of many different Time Warp optimizations [30]. The following subsections discuss the concepts of parallel discrete event simulation and the Time Warp Mechanism to give proper context to the system.

19 CHAPTER 2. BACKGROUND 2.5. WARPED

2.5.1 Discrete Event Simulation

In general, simulations attempt to model physical systems to better predict future outcomes by putting the physical system into terms that a computer can understand. The most basic way to model physical systems is to use a set of states and events [17]. Events occur over time and the effects that they have on the system are represented as different states. What makes discrete simulation special is that events occur instantaneously and are fixed to a specific point in time [17]. This differs from continuous simulation in which the state changes occur continuously in time. As the name suggested, a Discrete Event Simulation (DES) is also event driven. An event driven simulation advances its simulated time based on the occurrence of events instead of at specified time steps like a time driven simulation. To properly function, a DES needs to maintain three main elements [17]:

• Event List: Holds timestamped events scheduled to occur in the future.

• (Global) Clock: Indicates the current simulation time.

• State Variables: Define the current state of the system.

The DES uses a simulation engine (SE) to continuously drive the simulation by pulling events from the event list and changing the state variables or adding new events to the event list based on the effect of the current event. The SE performs this task until the event list is empty or a pre-defined end time or terminating condition is reached [17]. Events are pulled from the event list based on a smallest-timestamp-first basis.

Occasionally events will have the same timestamp and one of them will be picked arbitrarily. This opens up an avenue for accelerating the speed of the simulation using parallel processing.

2.5.2 Parallel Discrete Event Simulation

Parallel Discrete Event Simulation (PDES) aims to speed up the execution of sequential DES by dividing the global simulation task into a set of communicating logical processes (LPs) [17]. Each LP contains a communication interface (CI) and a SE and operates over an event structure region [17]. These regions are each a set of events that occur during a subspace of the total simulation time. Figure 2.10 [17] illustrates the basic architecture of a PDES using LPs. This illustration contains the following [17]:

20 CHAPTER 2. BACKGROUND 2.5. WARPED

Figure 2.10: PDES Illustration [17]

• A set of LPs labeled LP1 through LPP that execute events.

• A communication system that the LPs use to exchange data and synchronize activities.

• The regions, R1 through RP, that each SEi performs event executions on.

• The simulations engines, SE1 through SEP, contained within each LPi.

• The communication interface, CI1 through CIP, of each LPi.

Instead of executing the events through a single queue, PDES allows the system to concurrently execute events. This can be done either synchronously (The simulation uses the global clock to track simulation time and all LPs must process events according to the global clock time. LPs will not begin processing a new event until all other LPs have finished.) or asynchronously (Events that do not affect one another are simulated. LPs process events continuously without regard for the other LPs.). Unfortunately, the decrease in runtime comes with the possibility of causality errors.

Causality errors occur when events are executed out-of-order and the events are causally dependent. An example of this is when an LP executes an event that causes a new event to be inserted into a second LP. This

21 CHAPTER 2. BACKGROUND 2.5. WARPED new event may have a smaller timestamp than the event that the second LP is currently executing and cause a causality error. Synchronous and asynchronous execution both handle causality errors differently [18].

The synchronous approach (sometimes referred to as the conservative approach [18]) attempts to com- pletely remove the possibility of causality errors ever occurring. They do this by ensuring that new events are only processed once all events that could possibly affect the new event have already completed. This is usually accomplished by allowing the LPs to communicate to determine when it is safe. Aside from the fact that synchronous execution is slower due to LPs needing to wait on others to finish processing, it also has the drawback of entering deadlock [18]. Deadlock situations occur when events that are dependent on each have similar enough timestamps that their LPs decide to block until the other is executed. This requires some method of deadlock avoidance to be used (such as null messages [18]).

Instead of attempting to remove causality errors from the system, the asynchronous approach (some- times referred to as the optimistic approach [18]) instead chooses to implement a system of detection and recovery to deal with them [18]. Asynchronous systems simply continue executing regardless of any causal dependencies between the events being executed. This means that they need to have a system in place to determine when a causality error occurs, and once an error is detected it needs a system to recover from the error. This recovery is usually done with a rollback to an earlier time to prevent the error from occurring.

Due to the fact that LPs have no reason to block themselves in this implementation, deadlock is never an issue [18].

2.5.3 Time Warp Mechanism

The Time Warp Mechanism is a specific technique used to implement an asynchronous (optimistic) PDES.

Being asynchronous, LPs in Time Warp continue to compute events on a smallest-timestamp-first basis without observation of the local causality constraints (lccs) [17]. If a straggler event, an event which arrives out-of-order, needs to be processed, the LP will rollback to a previous time to process the event. When a rollback occurs, the most recent saved state of the simulation needs to be loaded to maintain consistency with the timestamp on the new event. This requires each LP to keep a sufficient history containing the following data [17]:

• State Stack (SS): Contains internal state information to allow for restoration of past states.

22 CHAPTER 2. BACKGROUND 2.5. WARPED

• Input Queue (IQ): Stores received messages.

• Output Queue (OQ): Stores sent messages.

Each of these elements are all time dependant and require each LP to contain a set of clocks that allow them to keep track of their own time, and the overall system time. They are defined as [17]:

• Local Virtual Clock (LVT): Stores the timestamp of the most recent event processed by the LP. If

the next event has a timestamp smaller than this value, a rollback is triggered.

• Global Virtual Time (GVT): An approximation of the overall time of the entire simulation. It is

approximated by the minimum timestamp of all unprocessed events.

During a rollback the LP will be restored to its previous state and any messages sent after the time that the rollback goes to must be cancelled with anti-messages [17].

23 Chapter 3

Related Work

This section will begin by discussing information related to the creation of custom operating systems and how they choose to implement file system drivers, taking a specific look at CROCOS [15] whose file system and development phases were used to help shape the LlamaOS file system. After discussing other file system implementations this chapter will change focus and look into other projects working with parallel discrete event simulations in virtualized environments, such as the Time Warp Operating System [23].

3.1 The CROCOS Kernel

Developed by Guillaume Duranceau, CROCOS is a small UNIX-like kernel for x86/x86 64 systems [15]. It is being developed in several distinct steps known as phases to allow for easy education and understanding of basic operating system concepts. The current implementation is written in C and only functions as a

Linux task manager with plans to expand into a fully working microkernel [27] that can be booted from a real machine or an emulator.

The completed phases of CROCOS are as follows [15]:

• Phase 1: The Kernel libc - Implementation of traditional libc functions, such as string.h and stdarg.h.

• Phase 2: The Tasks Manager - Implementation of a simple task manager based on fork/exec func-

tions that handles task switching (including relevant context storing), basic process operations, such

as wait() and kill(), and has the ability to synchronize all tasks.

24 CHAPTER 3. RELATED WORK 3.2. TIME WARP OPERATING SYSTEM

• Phase 3: Read-Only File System - Generic implementation of a revision 0 ext2 file system which

allows for the reading of directories and regular files on a single device through the use of a virtual

file system and ext2 driver.

Future phases, such as adding writes to the file system and creating a boot sequence, are planned, but no development has occurred on the project since 2009.

3.2 Time Warp Operating System

Released in 1987, the Time Warp Operating System (TWOS) was one of the first attempts at creating a system focused on PDES [23]. TWOS was mostly written in C and was developed at NASAs Jet Propulsion

Laboratory (JPL). It ran on a Mark III hypercube and on a network of seven Sun workstations even though it was initially designed for the Caltech Mark II hypercube.

Unlike other operating systems of the time, TWOS completely committed itself to optimistic PDES and to process rollback for almost all synchronization [23]. Process rollback is treated as the normal mechanism for process synchronization instead of as a form of exception handling like most other operating systems treated it. The specific implementation of a rollback mechanism that TWOS uses is completely able to con- trol, if not completely undo, errors, infinite loops, I/O, creation and destruction of processes, asynchronous message communication, and termination [23].

Although TWOS is a single-user system and only supports a single process at a time, it is able to distribute each process between multiple machines using a communication protocol to track the overall state of the system [23]. TWOS does not support general time sharing between the independent processes running on it, and it also has a few other limitations. One of these limitations is the inability to dynamically create processes during runtime. It also initially lacked the ability to migrate process to aid with load management [23], but has since been updated to allow for basic static and dynamic load balancing [37].

Despite these differences to ordinary distributed operating systems, TWOS still retains a general mod- ular decomposition [23]. TWOS contains modules to handle processor scheduling, memory management, process synchronization, message queuing, and commitment protocols that fill the same roles as their coun- terparts in other operating systems. Each module is just specially developed with a focus on PDES [23].

25 CHAPTER 3. RELATED WORK 3.2. TIME WARP OPERATING SYSTEM

Figure 3.1 [23] gives a high-level view of the structure of TWOS.

TWOS uses a simple object-oriented programming model that utilizes a global process name-space [23].

Each process is assigned a 20-character name that is used by other processes to send messages at any time.

No channels, pipes, or connections need to be opened for one process to communicate with another. This

Figure 3.1: Structure of the Time Warp Operating System [23]

26 CHAPTER 3. RELATED WORK 3.2. TIME WARP OPERATING SYSTEM allows for flexibility during the execution of complex situations by not forcing process communication to be specified in the source code [23].

Each process is logically composed of the following four parts [23]:

• Initialization Section: The code segment that is executed once at the initialization of the process. Its

main purpose is to initialize the state variables, but it may also send event messages which will not be

received until all process initializations are complete.

• Event Message Section: The code segment that is called whenever a set of messages is set to be

processed. It usually modifies state variables and sends query and event messages.

• Query Message Section: The code segment that is called when a query message needs to be pro-

cessed. It cannot modify state variables or send event messages. It is only allowed to send more query

messages, and is required to send exactly one reply to its invoking query message.

• Termination Section: The code segment that is called when the simulation is ended. It outputs the

final statistics and is only allowed to send query messages.

There are also two significant restrictions placed on process behavior (although neither is explicitly enforced by TWOS) [23]:

• Each process must be rigidly deterministic in its input-output behaviour. This is to ensure that a

domino effect does not occur during a rollback, and that the same input messages should generate the

same output messages.

• Process should not use heap storage. Heap storage makes state-saving difficult and slow which in-

creases the amount of time it takes the system to rollback to a previous state.

The benchmarks performed on the JPL Mark III Hypercube in 1987 showed that TWOS was capable of at least an order of magnitude of speedup in a relatively small and irregular simulation. The COMMO benchmark which was used showed a maximum speedup of 10.66 using 24 processors with a note that the number of anti-messages increased with the number of processors [23]. In 1989, further research was done to attempt to limit a portion of the systems optimism, but the modest gains that the changes showed

27 CHAPTER 3. RELATED WORK 3.2. TIME WARP OPERATING SYSTEM were unpredictable at best. It was concluded that although other methods of limiting optimism exited, there appeared to be very little value in doing so [38].

28 Chapter 4

Overview of Approach

This chapter will start with a comparison of the Minix [41], FAT [5], and Ext2 [35] file systems. It will discuss statistics related to each such as the maximum file size, filename length, and overall partition size.

It will also take a look into the usability of the each file system before explaining why the Ext2 file system was chosen to be used as the LlamaOS file system. It will then discuss the research that went into deter- mining which file system benchmarking methods to use. These methods have been heavily researched, and

Section 4.2 will aim to summarize this research.

4.1 File System Choice

Before implementation could begin, a file system that was appropriate for LlamaOS needed to be selected.

Being a light-weight minimalist operating system, a file system that shared similar principles needed to be chosen. This immediately removed more modern file systems from the equation. Certain features contained in the Ext3 and Ext4 file systems [31], such as journaling, delayed allocations, and larger directories, were not needed in LlamaOS. Instead, their predecessor, the Ext2 file system was considered. The New Tech- nology File System (NTFS) [1] was cut for similar reasons, and its predecessor the FAT file system was considered. The final consideration was the Minix file system. As one of the first file systems to be created for a minimalist operating system, it fit the criterion almost perfectly. Current implementations of all three candidate file systems were analyzed to help determine the ease at which they could be implemented in

LlamaOS. The main implementations explored at this point were in the Linux kernel [3].

29 CHAPTER 4. OVERVIEW OF APPROACH 4.1. FILE SYSTEM CHOICE

4.1.1 The FAT File System

As the Ext2 file system was already discussed in Section 2.4.1, the FAT file system will be the first discussed.

Developed by Microsoft, the (FAT) file system was developed to allow a light-weight

file system to still maintain solid performance for use in mobile devices and embedded systems [9]. The

FAT file system comes in three variants, FAT12, FAT16, and FAT32. Unlike most unix-based file systems, the FAT file system uses a file allocation table and clusters. Each entry in the file allocation table serves a similar function to that of an inode. It stores all of the metadata related to the file and points to the first cluster of the file or directory. Clusters are similar to blocks. They store the raw data of the file with a pointer to the next cluster in the file if it is larger than a single cluster. Each file or directory in a FAT file system has its data contained in a linked list of clusters [5]. The number at the end of each name indicates the number of table element bits in the file allocation table and the specific type is usually dependent on the number of clusters. In general, floppy disks are formatted with FAT12, volumes less than 512MB in size are formatted with FAT16, and volumes greater than 512MB in size are formatted with FAT32 [5].

There are four basic regions in each FAT file system. They’re laid out in this order [5]:

• Reserved Region: Contains information for booting the operating system and mounting the file sys-

tem.

• File Allocation Table Region: Stores the table containing all files and directories in the file system.

This region may also contain a backup copy of the main file allocation table.

• Root Directory Region (only for FAT12 and FAT16): Contains all metadata about the root directory.

• Data Region: The remainder of the volume that stores all data contained in the files and directories.

Figure 4.1 [5] shows the a possible layout of a FAT12/16 file system. The only major difference between this layout and the layout of the FAT32 file system stores information about the root directory in the file allocation table. A copy of the file allocation table is not always included.

The FAT file systems are natively supported by all major operating systems currently in the market which allows for easy movement of data from LlamaOS to linux, Mac OSX, and Windows. The only

30 CHAPTER 4. OVERVIEW OF APPROACH 4.1. FILE SYSTEM CHOICE significant drawbacks to the use of the FAT file systems are the file size limitations, volume size limitations, and difficulty of implementation.

Each variant of the FAT file system has a different limit for the file and volume sizes. FAT12 limits both

file size and volume size to 16MB. FAT16 has different limits based on whether large file support (LFS) is added to the file system. The file and volume size limitation is 2GB without LFS and 4GB with. Finally,

FAT32 is similar to FAT16 in that it has different volume size limitations. Without LFS, the maximum volume size is 32GB. With LFS, the maximum volume size is 2TB. Under both conditions the maximum

file size is 4GB [1].

The assessment of difficulty for the implementation is mostly subjective. It is based on the specification document, the current implementations of the FAT file systems, and the state of LlamaOS itself. In rela- tion to the Ext2 and Minix file systems, there is not any particular algorithm or data structure that would

Figure 4.1: Layout of the FAT12/16 File System [5]

31 CHAPTER 4. OVERVIEW OF APPROACH 4.1. FILE SYSTEM CHOICE be significantly simpler to implement; however, being a Windows based file system makes its addition to

LlamaOS more difficult, because LlamaOS is built similarly to a linux microkernel making both Minix and

Ext2 better choices for the ease of implementation.

4.1.2 The MINIX File System

The MINIX file system was developed by Andrew S. Tanenbaum for use inside the MINIX (“mini-Unix”) operating system [41]. The MINIX operating system was developed primarily for education purposes and eventually served as a basis for many ideas behind the Linux kernel. It contained many basic modules that demonstrated the various uses of an operating system. One of these modules was a very simplistic file system known as the MINIX file system.

The MINIX file system contains many of the concepts discussed in Section 2.4. It has the ability to allocate/deallocate space for files, keep track of disk blocks and free space, and protect against unauthorized

file access [41]. The MINIX file system also relies on i-nodes and blocks to navigate the files stored on it.

The first implementation of the MINIX file system was a flat file system. It contained no directories, and had very low amounts of functionality; however, the current iteration, MINIX 3, now supports directories [41].

The layout of the MINIX file system is very similar to that of the Ext2 file system. It consists of [41]:

• The Boot Block: Contains information for booting the operating system.

• The Superblock: Contains metadata describing the layout of the file system.

• The I-node bitmap: Contains information on the current usage of the file systems i-nodes.

• The Zone (block) bitmap: Contains information on the current usage of the file systems blocks.

• I-nodes: A series of blocks containing all file metadata.

• Data blocks: The data stored in the files.

Figure 4.2 [41] shows the layout of the MINIX block structure. The main difference that can be noticed between the MINIX structure and the Ext2 structure is the lack of block groups. The entirety of the MINIX block structure is similar to that of a single block group in Ext2.

32 CHAPTER 4. OVERVIEW OF APPROACH 4.1. FILE SYSTEM CHOICE

The MINIX file system is written in C and by design is able to be modified and experimented with completely separate from the MINIX operating system. Its main implementation runs similarly to a network

file that receives requests in the form of messages and then sends responses with data. It is meant to run in user space, but simple modifications to the messaging system could allow it to run as part of the operating system [41]. Due to its lack of features and simple block structure, MINIX would be the most light-weight and easiest to implement file system for LlamaOS.

However, the lack of features is cause for concern. For starters, the MINIX file system is the least supported of the three contenders. It has native support in most Unix-based kernels, but requires special third-party software to be accessed in both Mac OSX and Windows. The limitations on file and volume sizes is also the worst of the three with a maximum size of 64MB for both. It also has a strict 30 character limit on file names, and does not allow for a variable block size [12].

Figure 4.2: Layout of the MINIX File System [41]

33 CHAPTER 4. OVERVIEW OF APPROACH 4.2. FILE SYSTEM BENCHMARK RESEARCH

FAT MINIX Ext2 Max Volume Size 16MB, 2GB, 32GB 64MB 4TB Max File Size 16MB, 2GB, 4GB 64MB 2GB Max File Name 255 characters 30 characters 255 characters Ease of Implementation 3 1 2 Linux Support Y Y Y Mac OSX Support Y 3rd Party Y Windows Support Y 3rd Party 3rd Party Table 4.1: Comparison of FAT, MINIX, and Ext2 file systems

4.1.3 Final Comparison

All three file systems had reasons that would make them the best choice for LlamaOS. The FAT file system had the best support in regards to other operating systems, and it had the most customizable structure with the ability to implement FAT12, FAT16, and FAT32; however, it was also the most difficult to implement.

On the other hand, the MINIX file system was designed for easy modification and experimentation. It also had the simplest overall structure to allow for easy implementation. Unfortunately, this came at the cost of little outside operating system support, and a small overall file system size.

Ext2 was decided on because it served as a solid middle ground between the FAT file system and the

MINIX file system. It was initially built from the MINIX file system, and maintained a similarly simple structure. This simple structure combined with the fact that it was mainly used as one of the first major file systems for the Linux kernel [3] allowed for a much easier implementation than that of the FAT file system.

On top of this, Ext2 is also one of the most widely used operating systems today. Both Linux and Mac OSX have native support for it, and many third-party solutions allow for full modification of Ext2 volumes in

Windows; however, Ext2 really shines in its usage throughout the custom operating system community [2].

Most operating systems that have file system support use Ext2. Ext2 also provided the largest maximum volume size of 4TB and a respectable maximum file size of 2GB [12]. A summary of the comparison between these three file systems is shown in Table 4.1.

4.2 File System Benchmark Research

After a file system was determined, a way to properly assess its performance needed to be determined.

Benchmarks are used to provide a general idea of the performance of a certain piece of hardware or software.

34 CHAPTER 4. OVERVIEW OF APPROACH 4.2. FILE SYSTEM BENCHMARK RESEARCH

In this case a benchmark was needed that specifically focused on file systems. The field of file system benchmarking is currently in somewhat of a state of turmoil [42, 43]. No standardized method exists to properly assess how a file system performs, and there are various different benchmarking tools available that provide varying degrees of data [43]. In some cases the current benchmarking tools are even called into question, and a new metric for benchmarking tools is suggested [42]; however, developing a new file system benchmarking tool was outside the scope of this project.

Traeger et al [43] evaluate the current state of file system benchmarking and make suggestions on how to improve data collection by giving a set of standard guidelines to use when benchmarking a file system.

The first guideline is to choose a question that properly highlights the performance characteristics that are relevant to your research, and whether or not it is relevant to compare benchmark results with other file systems [43]. Then use this question to determine an appropriate workload and benchmark to use to answer it. Benchmarks are broken down into three main types [43]:

• Macrobenchmarks: Give an overall view of system performance by performing multiple file system

operations; however, the workloads are typically unrealistic.

• Trace-based: Used to assess the performance of the system under a realistic workload by tracing

operation execution through the file system.

• Microbenchmarks: Perform few file system operations and mostly only used to better understand a

macrobenchmark or trace-based benchmark.

In general the best benchmarks to choose will highlight both high-level and low-level performance, and using more than one type of benchmark is recommended [43]. For the purposes of testing the LlamaOS file system only a macrobenchmark was used. The benchmarks looked into were the bonnie benchmark [10], the andrew benchmark [22], and the IOZone benchmark [33]. Due to the ease of implementation, the bonnie benchmark was chosen as the benchmark to be ported into LlamaOS. No comparisons to other file systems could be properly made, because LlamaOS only had support for Ext2 added.

The second guideline is to choose and appropriate benchmarking environment. This has to do with the hardware and operating system that your test will be run on. Any unneeded processes should be killed to allow for the purest results, and if any comparisons are being done between file systems they need to be

35 CHAPTER 4. OVERVIEW OF APPROACH 4.2. FILE SYSTEM BENCHMARK RESEARCH done on the same hardware [43]. All file system benchmarks for the LlamaOS file system were run inside a virtual machine on an x86 64 Gentoo Linux machine with an 8-core AMD FX-8120 processor @ 3.10GHz with 1GB of memory allocated to the virtual machine. The test disk was a 64KB empty disk image with a

512 byte block size formatted as Ext2. The chosen operating system was LlamaOS.

The third guideline involves running the chosen benchmarks. It gives four criterion for doing this prop- erly [43].

• All benchmarks should be running identical to one another.

• Benchmarks should each run several times to ensure accuracy and to properly calculate confidence

levels and standard deviations.

• Benchmarks should run for a length of time sufficient for the system to reach a stable state.

• The Benchmarking process should be automated to minimize human error.

The bonnie benchmark was chosen to determine the performance and functionality of the LlamaOS File

System. Due to the nature of the bonnie benchmark, it cannot be specified to run for any specific length.

It performs a standard set of file system operations and times how long each one takes. There is no stable state to be reached. A short script was written to run the bonnie LlamaOS app 10 times in a row without interruption.

The rest of the guidelines were about how to properly present and validate benchmark results in respect to benchmark comparisons [43]. Since only one benchmark is being used and the LlamaOS file system is being benchmarked to prove functionality, these guidelines are not pertinent to this research.

36 Chapter 5

Implementation Details

The following section discuss details of the LlamaOS file system. This includes a detailed description about the data structures and algorithms used in the LlamaOS virtual file system (VFS) and Ext2 driver. It will give a breakdown and explanation for the main functions used in each. It then goes on to discuss the modifications that needed to be made to the bonnie benchmark to allow for it to run properly as a LlamaApp.

These modifications include removal of certain portions of the test (such as threaded disk seeks) and minor function modification.

5.1 File System

As mentioned in Chapter 3, the LlamaOS file system implementation was largely based on the CRO-

COS [15] implementation. Other Ext2 file system implementations were looked into, such as the Linux-

3.12.6 kernel [3] implementation, but the CROCOS file system had the least dependencies on its host operating system. This made it the easiest version to attempt to port to LlamaOS. CROCOS provided a strong foundation for the VFS layer and the read-only Ext2 driver; however, significant changes still needed to be made to the code base to allow for the existing code to function within LlamaOS.

5.1.1 Virtual File System

The first decision that needed to be made when the code revision began was whether or not to keep the extra

VFS code. The original plan was to simply implement and Ext2 driver that LlamaOS would interact with

37 CHAPTER 5. IMPLEMENTATION DETAILS 5.1. FILE SYSTEM directly. It was ultimately decided that keeping the VFS code in the system could prove useful in the future at little to no overhead cost. Currently the VFS only indirects to the Ext2 driver, but future implementations of the MINIX or FAT file systems would be able to function through the use of the VFS.

The current implementation of the virtual file system contains the following functions which are all called from LlamaOS. All functions are prototyped in a vfs.h header file and then implemented in vfs.c.

These are the only way for LlamaOS to interact with a file system.

• fs initialize

• fs finalize

• fs open

• fs close

• fs lseek

• fs read

• fs write

• fs readdir

The fs initialize function is called at the start-up of LlamaOS. It provides an indirection layer to mount the Ext2 volume on the disk image specified in the Xen config file. Once the volume is mounted, it initializes two lists that are used to track the used and free inodes (used inodes and free inodes). These lists allow for quick access to the availability of inodes during runtime.

Providing the opposite function is fs finalize. fs finalize is called as LlamaOS is shutting down. This occurs once the specified LlamaApp has finished its execution. It provides an indirection layer to unmount the Ext2 volume. Both mounting and unmounting of the volume will be discussed in more detail in Sec- tion 5.1.2.

The fs open function is called when a file is attempted to be opened to perform operations on. It takes two inputs. The first input is a character string containing the path to the file, and the second is an int whose bits determine the file flags that determine how the file is to be treated (read-only, read-write, create, etc.).

38 CHAPTER 5. IMPLEMENTATION DETAILS 5.1. FILE SYSTEM

Call lookup to YES Check used_inodes Found Already YES retrieve inode inode? to see if file is Open? number for file. already open. NO NO

Move inode from Allocate file free_inodes to descriptor. used_inodes.

Return Error Code Return File Descriptor

Figure 5.1: Flowchart of the fs open function.

It then attempts to use this information to find an inode number for the file. It does this by performing a lookup on the file system. This part is handled by the Ext2 driver. The inode number is then checked against the list of currently used inodes to determine if the file is already open. If it is not already open then a file descriptor (a number that the operating system references the file by) is allocated and it is added to the list of used inodes. Figure 5.1 contains a flowchart depicting the operation of this function.

The fs close function performs the opposite operation. It moves the file inode from the used inodes list to the free inodes list and then frees the file descriptor so it can be assigned to another file. Its only input is the file descriptor.

The fs lseek function is used to move a file to a new offset. It takes the file descriptor, new offset, and a type of seek to perform as inputs. After error checking occurs and the file descriptor is opened, the function then performs one of three functions based on the type specified in the inputs.

• SEEK SET: Sets the file offset to the new offset specified when the function was called.

• SEEK CUR: Adds the offset value specified in the function call to the current file offset, and stores

the resulting offset as the files new offset.

• SEEK END: Adds the offset value specified in the function call to the files size, and stores the

resulting offset as the files new offset.

39 CHAPTER 5. IMPLEMENTATION DETAILS 5.1. FILE SYSTEM

Upon successful completion the function returns the new offset of the file.

The fs read function takes a file descriptor, a reference buffer to store the retrieved data, and the amount of bytes to read as inputs. After checking the permissions of the file to ensure that the current user has permission to read the file, and that the file is in fact valid, the inode and offset information from the file descriptor are then passed on to the Ext2 driver to handle the actual data retrieval. All data retrieved is placed in the buffer provided to the function and the number of bytes read is returned.

The fs write function is similar to the fs read function. It takes a file descriptor, an input buffer, and an amount to write as inputs. File permissions are checked to make just that the user is allowed to write to the

file. Then the inode, offset, input buffer, and amount to write as passed to the Ext2 driver to handle the file operations. Upon successful completion, the function returns the number of bytes that were written.

The final function that the VFS contains is the fs readdir function. This function takes a file descriptor and a reference to a directory structure (or direntry). The mode of the file descriptor is checked to ensure that it is a directory, and then the information is passed to the Ext2 to retrieve the directory information. The function returns a 1 upon successful completion.

Those functions cover the entirety of the VFS in its current state. Right now the VFS only has the capability to interact with the Ext2 driver, but additional file system driver calls could be added to the source as more file systems are added to the operating system.

5.1.2 Ext2 Driver

Significant improvements have been made to the CROCOS Ext2 implementation. The main change made was the addition of file writes and file creation. In CROCOS the file system was only able to read files with no way to write to the disk.

The current LlamaOS Ext2 driver is comprised of the following files:

• config.h: Contains macros that hold system configurations like the maximum number of i-nodes that

can be opened and the maximum number of block groups allowed in the mounted file system.

• direntry.h: Contains the data structure for a directory.

• disk.cpp/disk.h: Contains functions that allow Ext2 to access the physical disk and perform raw data

40 CHAPTER 5. IMPLEMENTATION DETAILS 5.1. FILE SYSTEM

reads and writes.

• ext2.c/ext2.h: Contains all functions relevant to operations performed on an Ext2 file system. The

bulk of operations are performed in this file.

• fd.c/fd.h: Contains the file descriptor structure, and basic operations that can be performed on file

descriptors.

• fsdesc.h: Contains an abstraction of basic file system operations such as reads and writes.

• inode.h: Contains the internal i-node structure that is used to easily access information within i-nodes.

• stat.h: Contains a structure that is used as the basis for the i-node structure in inode.h.

Out of these files, ext2.c and ext2.h are the only files that are complex enough to merit an in depth discussion. The public functions available to the VFS from ext2.h are:

• ext2 mount

• ext2 unmount

• ext2 lookup

• ext2 fill stat

• ext2 read

• ext2 write

• ext2 readdir

These functions mostly line up with their counterparts in the VFS. To support these public functions, ext2.c also contains private functions that mostly deal with metadata manipulation. These are:

• disk blk to cache

• fill inode

41 CHAPTER 5. IMPLEMENTATION DETAILS 5.1. FILE SYSTEM

• update inode

• ext2 find free block

• ext2 find free inode

• ext2 block allocate

• ext2 inode allocate

• get file blk

In addition to these functions, ext2.c also contains data structures for the superblock ( superblk t), block group descriptors ( group desc t), i-nodes ( inode t), and direntries ( direntry t).

ext2 mount (called from fs initialize) begins by loading in the disk from the Xen configuration. The hypervisor contains a list of blocks and each block on the input disk image is loaded into the virtual machines running memory. This essentially makes the file system a part of the RAM. This is perfect for LlamaOS because it significantly decreases the amount of time it takes to perform file system I/O and the additional constraints it puts on the file system size and memory usage of LlamaApps are not relevant to the current scope of the LlamaApps that currently run on LlamaOS. In the future a system may need to be put in place to allow for the use of a logical file system, but for now the implementation covers all needs of LlamaOS.

After initializing the disk image, the superblock and block groups are loaded into their respective data structures for easy access during the LlamaApps execution. A sanity check occurs to verify that the volume being mounted is in fact formatted with the Ext2 file system, and the block size and block group counts are checked for consistency. If all of this completes successfully, a value of 1 is returned.

ext2 unmount (called from fs finalize) performs the opposite function. It writes the superblock and group descriptor data structures back to the disk then finalizes it by removing the file system from running memory.

ext2 lookup is the first complex function that saw a major overhaul from its roots in CROCOS. ext2 lookup is called when a file is first opened and behaves differently based on the flags provided by the user. Its main goal is to locate the i-node number of the file with the given name. It does this by looping through all directory entries within the parents i-node that it accepts as an input. Once the file is found within its parent

42 CHAPTER 5. IMPLEMENTATION DETAILS 5.1. FILE SYSTEM directory the function immediately returns the files i-node number. If the file isnt found the function then checks the open flags to determine if it needs to create the file. If the file does need to be created, the function performs the following steps:

• Find and allocate a free i-node to store the data of the file.

• Update the previous direntry in the parent directory to reflect its new file length (This needs to be done

to ensure that looping through all blocks in the parent i-node doesnt run infinitely.)

• Update the parent i-node with the new direntry.

• Update the superblock and block group.

If all of these steps are completed successfully, the newly allocated i-node number is returned.

ext2 fill stat takes an i-node number and a stat structure buffer as inputs. The stat buffer is then filled with all elements within the i-node that pertain to the stat structure. This function serves to allow i-node metadata to be read by the VFS. It returns 0 on success.

ext2 read has not gone through significant changes from the CROCOS version. The inputs are an i-node, the block offset, the number of bytes to read, and a buffer to store the read data in. It begins by calculating the first block to read. After determining where to start it pulls all blocks storing the file using get file blk until the number of bytes to read has been met. All this information is then stored in the buffer to be used later.

ext2 write is a completely new addition to the CROCOS implementation. Similar to ext2 read, the inputs are an i-node, an offset to the block, the number of bytes to write, and a buffer containing the data to be written. This function determines the start of the file then uses get file blk to continuously find the blocks it needs to write to until all the data that needs to be written has been. It may return an error if there is more data to write than there is remaining on the disk.

ext2 readdir performs five simple tasks. First it retrieves the directory i-node. It then uses the directory i-node and a block offset to find the directory block. Next it retrieves the current directory entry by copying it into a small buffer cache. It performs a quick check to see if the entry is corrupt before finally copying the current directory entry to the destination entry. It takes in a directory i-node, a block offset, and a directory entry. It returns the record length of the directory entry after the current entry has been copied over.

43 CHAPTER 5. IMPLEMENTATION DETAILS 5.1. FILE SYSTEM

disk blk to cache is a very simple function that copies a block into the block cache using a disk read.

This is only used during read function calls.

fill inode is an important function that reads an i-node from disk and fills up an i-node data structure, so that the information can be easily accessed.

update inode is a counterpart to fill inode. Where fill inode is all about reading an i-node from disk, update inode writes an i-node data structure back to the disk.

ext2 find free block is a function that needed to be added to the CROCOS implementation to enable writing. It is a somewhat complex function that performs a very simple task, finding a free block to allocate the new data to. It starts by checking the current block groups block bitmap to see if any blocks are free there. This way data blocks in a file should maintain spatial locality. If there are no free blocks left in the current block group it uses the block group table to search through block groups until it finds a free block to allocate the data to. Once a free block is found, it updates both the superblock and the block bitmap of the group the block was found in to let them know that the block is no longer free. It returns the location of the block on success.

ext2 find free inode performs a similar function to the ext2 find free block function. The only difference is that it is looking for a free i-node. It performs the same steps as ext2 find free block, but it uses the i-node bitmap instead of the block bitmap. On success it returns the location of the i-node so that it can be loaded into a data structure.

ext2 block allocate begins by using ext2 find free block to find a free block. It then attempts to allocate this block to the files i-node by adding it to the i-nodes (blocks) array. It attempts to do this directly, but if the i-node already contains 12 blocks it needs to allocate this block indirectly. If the indirection block has already been allocated the new block is simply added to its contents. If the indirection block hasnt been added yet, the new block is made the indirection block and another free block is found to store the data.

After all blocks have been allocated to the i-node, the i-node is updated and the function completes.

ext2 inode allocate first uses ext2 find free inode to find a free i-node. It then uses fill inode to place that i-node into a useable data structure. Next it initializes all necessary values in the i-node before finding a free block to add as the i-nodes starting block. Finally it calls update inode to save all the new information in the i-node before returning the i-node number.

44 CHAPTER 5. IMPLEMENTATION DETAILS 5.2. BENCHMARK MODIFICATIONS

get file blk is used for both reading and writing. The main purpose of this function is to find the next block to be used. For inputs it has an i-node, a block number, an i-node number, and a flag to determine whether or not a free block should be allocated. For reads it performs the relatively simple function of using the i-node and the block number to load that block into the buffer cache. For writing it passes its information through to ext2 block allocate so that it can find the proper block to write to.

5.2 Benchmark Modifications

In order for the bonnie benchmark to work within LlamaOS a few modifications needed to be made to the test. The first major modification that needed to be made was the size limitation of the benchmarks. By default the minimum size that the bonnie benchmark can handle is 1MB. The current implementation of the

LlamaOS file system has a maximum size of 64KB, so bonnie needed to be modified to allow for the smaller

file system. This was a simple single line change that involved removing a single multiplication by 1024.

The second major modification was to the random seek timing. This portion of the benchmark attempts to launch a ton of separate process to force the disk to seek as many times as possible as fast as possible.

Unfortunately LlamaOS currently does not have support for multiple processes or threads so this portion of the test was removed. As I am not comparing my performance results to other file system, and simply running the benchmark as a functionality test this portion was unneeded anyway. Disk seeks occur when both reading and writing which the benchmark also performs.

The final change was a simple replacement of the timing functions used. The bonnie benchmark uses rusage to determine the time operations take. This function is not available in LlamaOS so a time calculation using gettimeofday was used to determine the milliseconds operations take. For this use case they are functionally the same.

45 Chapter 6

Performance Results

This chapter goes in depth into the results of the various tests run within LlamaOS to verify the functionality of the LlamaOS file system. First it will explain the results of the bonnie benchmark and put it into context with the system. Then it will show the results of running a few different WARPED models to show a practical use of the LlamaOS file system. All tests were run on a system with the specifications shown in

Table 6.1.

Processor AMD FX(tm)-8120 Cores 8 Speed 3.1 GHZ Memory 16 GB Operating System 3.14.14-gentoo Table 6.1: Test System Specifications

6.1 File System Benchmarks

Before presenting the data gathered from the bonnie benchmark, the format in which bonnie presents its results needs to be explained. The bonnie benchmark breaks its results down into three main categories.

These categories are:

• Sequential Output

• Sequential Input

46 CHAPTER 6. PERFORMANCE RESULTS 6.1. FILE SYSTEM BENCHMARKS

• Random Seeks

As the Random Seeks portion of the benchmark requires multi-threading support, the results from it were ignored and will not be reported. The other two sections were completed in their entirety.

The Sequential Output section is further broken down into three more sections which are:

• Per Character: measures the bytes written using the putc() function per second.

• Block: measures the amount of file system bytes allocated per second.

• Rewrite: measures the speed at which a file can be read, dirtied, and then re-written in bytes per

second. This portion of the test performs an lseek which proves the functionality of the file system

without the need to measure the speed at which the seeks perform.

The Sequential Input section is broken down similarly, but focuses on reading files instead of writing them. It measures the bytes read per second per character and the block bytes read per second.

In order to obtain the following results, the bonnie benchmark was run a total of 30 times. Of those 30 executions, 10 were run with a file size of 4KB, 10 were run with a file size of 8KB, and 10 were run with a file size of 12KB. The choice to run 10 times each was an arbitrary choice made to ensure that the results seemed stable. The file size choices were made to show a scaling performance of the system. It allows for the system to display stable numbers in regards to the amount of bytes being written per second per character while at the same time showing how the block allocation scales as the file size increases.

Sequential Output Sequential Input FileSize Per Char Per Block Re-write Per Char Per Block KB B/s CPU B/s CPU B/s CPU B/s CPU B/s CPU 4 304.5 100 23609.6 100 24513.2 100 723.5 100 20339.1 100 8 304.1 100 43918.1 100 46494.1 100 740.2 100 40938.4 100 12 304.2 100 63926.6 100 65947.1 100 738.4 100 61399.6 100 Table 6.2: Averages of bonnie results by file size Sequential Output Sequential Input FileSize Per Char Per Block Re-write Per Char Per Block GB MB/s CPU MB/s CPU MB/s CPU MB/s CPU B/s CPU 1 61.0 56.2 103.5 15.9 111.7 9.7 76.7 100.0 5278.2 100 Table 6.3: Native bonnie results for test system

47 CHAPTER 6. PERFORMANCE RESULTS 6.1. FILE SYSTEM BENCHMARKS

Figure 6.1: Bar Chart Demonstrating Increase in Speed with File Size

As Table 6.2 shows, the file system writes characters at roughly 304 bytes per second regardless of what the file size is. File system reading per character also shows a relatively stable measurement in the lower

700 bytes per second range. Where the results differ is on their block bytes allocated per second. As the

file size increases the number of block bytes allocated also increases. The upper limit I found in running my benchmarks is reflected in the 12KB file size test. Beyond that point the increases in speed significantly drop off.

Table 6.3 is shown for comparison. These results were obtained by running the bonnie benchmark natively on the test machine. It can be noted that the native implementation allows for a much larger file size and it is able to perform at a much faster speed than the LlamaOS implementation. This is expected as the

LlamaOS implementation is meant to be minimal to keep the memory footprint of LlamaOS light.

These results show the file the system is able to read, create, write, rewrite, and seek to files properly. The speed at which the file system performs and its implementation still leaves significant room for improvement, but the current implementation is functional enough to allow for more complex applications that require

filesystem I/O to run on LlamaOS.

48 CHAPTER 6. PERFORMANCE RESULTS 6.2. WARPED TESTS

6.2 WARPED Tests

One of the key reasons for adding a file system to LlamaOS was for a cleaner and more stable solution to running WARPED models. When LlamaMPI was added WARPED was chosen as a case study to demonstrate the performance gains that could be achieved with an extremely low-latency communication system between nodes. At the time, the number of models that could be used by LlamaOS was severely limited by the configuration files required by each model.

The most basic simulation models require at the very least either a parallel or sequential .json file that specify various configuration options including but not limited to: the number of worker threads, the method with which threads are synced, how load balancing is handled, what type of scheduler is used, and how states, communication, and output are managed. In the old single-file file system this was the single configuration

file that was loaded into LlamaOS to allow for the model to run. The current implementation of the LlamaOS

File System now allows for both the parallel.json and sequential.json file to be stored on a single Ext2 formatted disk image along with any additional configuration files to be stored alongside them.

To setup the environment for the tests to run, WARPED, the WARPED models, and LlamaOS were all cloned from their respective repositories and placed in a common parent directory. A custom-vars.mk was then added to LlamaOS to allow for it to find the WARPED and WARPED models source code at compile time. Next, to aid with test execution, various scripts were created that automated basic functions needed to meet the base requirements of WARPED. These scripts performed tasks such as creating the configuration

files for the WARPED models, creating, formatting, and copying over the WARPED configuration files to a disk image, and launching multiple concurrent LlamaApp instances to act as nodes for the various WARPED models.

First, as a simple proof of concept, the most basic model, the ping-pong model, was run within LlamaOS to show that the file system implementation was working and that the external WARPED code could be run within a LlamaOS application. The ping-pong model performs a single round of communication between two nodes on the cluster. A simple script was created to launch two instances of ping-pong LlamaApps and configure their shared memory. The ping-pong test was run a total of 10 times and the completion time for the 10 runs was recorded. The average completion time of a ping-pong execution was 0.722 seconds. This showed that WARPED models were able to run within LlamaOS and that the configuration files could be read

49 CHAPTER 6. PERFORMANCE RESULTS 6.2. WARPED TESTS from the file system implementation.

The next WARPED model used was the epidemic model. In Parallel Discrete Event Simulation, an epidemic such as a disease outbreak can be simulated by abstracting the Logical Processes and their states into a model for population subsets and their respective infection level [34]. The WARPED model performs a similar function. This requires communication between multiple nodes to determine the infection rate of the disease, and more importantly, for our purposes is requires an additional . configuration file. This additional configuration file contains various information regarding the speed at which the disease is diffused and other disease specific parameters. The epidemic model also records various data into a .csv file for later viewing. Both of these features made it a perfect candidate to test the new capabilities of the LlamaOS File

System.

Similarly to the ping-pong model, all configuration files were generated and loaded onto an Ext2 format- ted disk image. Three scripts were then written to perform the actual execution of the testing. These scripts were epidemic-2.sh, epidemic-4.sh, and epidemic-8.sh. The number at the end of the file name represents the number of LlamaApp instances are launched. For this test, the epidemic model was run with 2, 4, and 8 node configurations. Each script launches its requisite number of LlamaApps along with a shared memory configuration to ensure that the nodes of the LlamaNET can communicate. Each test was run a total of 10 times and then averaged to ensure that the data was stable. Table 6.4 shows the average run-time and number of rollbacks for each node variation. It also compares these numbers to the simulation running natively on the same machine with the number of threads equal to the number of nodes.

Multiple configurations of threaded WARPED running in a native environment were attempted to obtain the best possible performance for comparison. This included allowing for load balancing between the worker threads, as well as, changing the number of LTSF queues. All data shown for comparison in Table 6.4 was obtained with the number of LTSF queues equal to the number of worker threads, and with load balancing off.

Since LlamaOS has yet to include threading, it makes up for this by being able to run multiple instances and create a virtual cluster. The closest comparison that could be made to this using a single machine was to match the number of worker threads to the number of nodes. In the case of the epidemic model, LlamaOS showed a significant performance improvement when run on the same machine as a threaded WARPED

50 CHAPTER 6. PERFORMANCE RESULTS 6.2. WARPED TESTS

LlamaOS Native Node/Threads Run-Time (s) Rollbacks Run-Time (s) Rollbacks Speed-up 2 50.76541 2.2 207.7846 24.6 24.4% 4 34.29771 0 173.0005 1293.2 19.8% 8 19.81948 0 147.2088 150.4 13.5% 16 - - 131.9195 1064.2 - 32 - - 128.3758 694.5 - Table 6.4: Comparison between LlamaOS and Native for epidemic model simulation. Currently the memory cap for LlamaOS forces it to max-out at 8 nodes, but this still boasts a significant improvement over threaded WARPED running with 16 and 32 worker threads. After 32 threads the run-time appears to reach its limit.

These results show that the epidemic model is functioning properly, and at a significant performance improvement over a native solution. It also shows a decrease in execution time as the number of nodes increase. This is expected as more resources are dedicated to running the simulation model. The machine these tests were run on was fairly limited from a memory standpoint. Tests being run on a machine with a better memory capacity could allow for more than 8 nodes which would potentially improve the performance beyond what it is at currently. The top speed at which LlamaOS can run these simulation models has yet to

Figure 6.2: Comparison of Native and LlamaOS Simulation Times

51 CHAPTER 6. PERFORMANCE RESULTS 6.2. WARPED TESTS be found due to hardware limitations.

In addition to the performance gains attained, a .csv file containing data relevant to the simulation was also written to the disk image supplied to LlamaOS proving that the create and write aspects of the file system are functional. Ultimately this means that more complex simulations and applications in general can now be run using LlamaOS.

52 Chapter 7

Conclusion and Future Work

7.1 Summary

LlamaOS provides a great foundation to explore various aspects of research within the world of fine-grained parallel discrete event simulation. With the addition of LlamaNET and LlamaMPI, LlamaOS was able to show that it could add significant performance gains to certain simulations by reducing the latency of net- work communications. The results shown in this document are yet another example of LlamaOS evolving.

As verified by the bonnie benchmark and a case-study using a couple of WARPED simulation models, Lla- maOS now has a functional file system which will allow it to run more complex simulation models including those that require multiple configuration options and those that need to store their results to a file. This is a powerful inclusion and hopefully one that will be able to help with future research.

7.2 Suggestions for Future Work

Although LlamaOS can already provide significant performance gains when compared to native solutions, it still has room for improvement. This section will discuss further improvements that I believe could be made to both LlamaOS and my current file system implementation. However, when dealing with an operating system that is aiming to provide the smallest memory resource possible each addition can also be a drawback. This needs to be kept in mind when weighing the benefits that the following improvements may provide.

53 CHAPTER 7. CONCLUSION AND FUTURE WORK 7.2. SUGGESTIONS FOR FUTURE WORK

7.2.1 LlamaOS

LlamaOS already excels at providing low-latency network communications and providing a cheap virtual cluster to perform tests on. It still has some issues when it comes to library compatibility and basic operating system features. The main feature that still needs to be integrated is the ability to fork processes and create multiple threads. LlamaOS currently functions on a single-process single-thread environment and this has various limitations when it comes to both program compatibility and program execution. A major compati- bility issue that was run into throughout the course of this project was the inclusion of the bonnie benchmark.

As simple as the bonnie benchmark is, it still required a few modifications to be able to successfully run within LlamaOS. A few of these modifications were superficial. Modifications such as swapping out one time counter with another that has its library supported by LlamaOS. These modifications don’t provide any significant loss (if any) to the performance of the application, but serve as a minor time annoyance in debugging compatibility. The big modifications that needed to be made had to do with the ability to fork multiple processes. An entire section of the bonnie benchmark was unable to be run on LlamaOS because it required the use of multiple processes.

The bonnie benchmark is not the only application that requires multiple processes to function properly.

Throughout the course of his research with LlamaMPI, John Gideon, ran into issues porting WARPED and its models to LlamaOS. He stated that he also needed to make concessions due to LlamaOS’s lack of multi- threading. In his conclusion he mentions that future work could place LlamaMPI into a background thread to improve buffering and scalability [20]. Future work should certainly move in the direction of adding multi-threading to LlamaOS. It is the next logical step in the evolution of the platform.

7.2.2 File System Implementation

There are still significant improvements to be made to the LlamaOS File System. These improvements can range from adding additional functionality to the Ext2 driver to adding new file system drivers completely.

The current implementation of the Ext2 driver only provides extremely basic functionality. It is certainly an improvement from the flat single-file file system that LlamaOS was using before, but it still leaves room for improvement. The current implementation is built for only the most basic file I/O. This keeps the driver lightweight while still providing a solid amount of functionality. The operations the Ext2 driver can

54 CHAPTER 7. CONCLUSION AND FUTURE WORK 7.2. SUGGESTIONS FOR FUTURE WORK perform are read, write, create, and re-write. It also allows for directories and multiple files. This functions perfectly for minimalistic applications that don’t need to perform significant file I/O; however, for more complex applications additional functionality should be added. These additional functions can include, but are not limited to: symbolic link creation, file deletion, file copying, and file moving. Adding this additional functions would provide LlamaOS with better compatibility for more complex applications, but it may not be worth implementing them until an application that requires these functions appears.

In addition to the Ext2 driver, more functionality could also be added to the Virtual File System (VFS) layer that the core of LlamaOS communicates with. Although throughout the course of my research I de- termined that Ext2 would be the best file system to implement first into LlamaOS, there are various other

file systems that I researched that could also be useful for either the minimalistic qualities they provide

(MinixFS) or for additional compatibility with outside data (FAT). The more file system drivers that Lla- maOS has access to, the easier compatibility will be with outside applications. The LlamaOS VFS is already built to support multiple file systems so the addition of newer ones should only be as difficult as adding their driver.

Ultimately LlamaOS is already in great condition and with the exception of multi-threading most future improvements will only aid outside application compatibility. It will be interesting to see which direction

LlamaOS is taken in the future.

55 Bibliography

[1] NTFS.com. https://www.ntfs.com/. Accessed: 2014-06-16.

[2] OSDev. https://www.osdev.org/. Accessed: 2014-06-16.

[3] The Linux Kernel Archives. https://www.kernel.org/. Accessed: 2014-06-16.

[4] VirtualBox by Oracle. https://www.virtualbox.org/. Accessed: 2014-06-16.

[5] Microsoft fat specification, Aug. 30 2005. File System Specification.

[6] K. Adams and O. Agesen. A comparison of software and hardware techniques for x86 virtualization.

SIGARCH Comput. Archit. News, 34(5):2–13, Oct. 2006.

[7] E. Altieri and P. N. Howe. The ext2 filesystem, June 2002.

[8] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield.

Xen and the art of virtualization. SIGOPS Oper. Syst. Rev., 37(5):164–177, Oct. 2003.

[9] W. A. Bhat and S. Quadri. Review of fat data structure of fat32 file system. Oriental Journal of

Computer Science & Technology, 3(1), 2010.

[10] T. Bray. Bonnie benchmark, 1988. Google Code.

[11] E. Bugnion, S. Devine, and M. Rosenblum. System and method for virtualizing computer systems,

Dec. 17 2002. US Patent 6,496,847.

[12] R. Card, T. Ts’o, and S. Tweedie. Design and Implementation of the Second Extended Filesystem.

Proceedings of the First Dutch International Symposium on Linux.

56 BIBLIOGRAPHY BIBLIOGRAPHY

[13] P. Chaganti. Xen Virtualization: A Fast and Practical Guide to Supporting Multiple Operating Systems

with the Xen Hypervisor. Packt Publishing, 2007.

[14] D. Chisnall. The Definitive Guide to the Xen Hypervisor. Prentice Hall open source software develop-

ment series. Prentice Hall, 2008.

[15] G. Duranceau. The crocos kernel, Apr. 2009.

[16] D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr. Exokernel: An operating system architecture for

application-level resource management. SIGOPS Oper. Syst. Rev., 29(5):251–266, Dec. 1995.

[17] A. Ferscha and S. K. Tripathi. Parallel and distributed simulation of discrete event systems. Technical

report, College Park, MD, USA, 1994.

[18] R. M. Fujimoto. Parallel discrete event simulation. Commun. ACM, 33(10):30–53, Oct. 1990.

[19] D. Giampaolo. Practical File System Design with the Be File System. Morgan Kaufmann Publishers

Inc., San Francisco, CA, USA, 1st edition, 1998.

[20] J. Gideon. The integration of llamaos for fine-grained parallel simulation. Master’s thesis, University

of Cincinnati, 2013.

[21] B. Goodheart and J. Cox. The Magic Garden Explained: The Internals of UNIX System V Release 4:

an Open Systems Design. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1994.

[22] J. H. Howard, M. L. Kazar, S. G. Menees, D. A. Nichols, M. Satyanarayanan, R. N. Sidebotham,

and M. J. West. Scale and performance in a distributed file system. ACM Transactions on Computer

Systems (TOCS), 6(1):51–81, 1988.

[23] D. Jefferson, B. Beckman, F. Wieland, L. Blume, and M. Diloreto. Time warp operating system.

SIGOPS Oper. Syst. Rev., 21(5):77–93, Nov. 1987.

[24] D. R. Jefferson. Virtual time. ACM Trans. Program. Lang. Syst., 7(3):404–425, July 1985.

[25] G. Kecskemeti, G. Terstyanszky, P. Kacsuk, and Z. Nemth. An approach for virtual appliance distri-

bution for service deployment. Future Generation Computer Systems, 27(3):280 – 289, 2011.

57 BIBLIOGRAPHY BIBLIOGRAPHY

[26] S. J. Leffler, M. J. Karels, and M. K. McKusick. The design and implementation of the 4.3BSD Unix

operating system. Addison-Wesley, Reading, MA, 1989.

[27] J. Liedtke. On micro-kernel construction. SIGOPS Oper. Syst. Rev., 29(5):237–250, Dec. 1995.

[28] J. Liu, J. Wu, and D. K. Panda. High performance rdma-based mpi implementation over infiniband.

Int. J. Parallel Program., 32(3):167–198, June 2004.

[29] W. A. Magato and P. A. Wilsey. llamaos: A solution for virtualized high-performance computing

clusters. In Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium

Workshops, IPDPSW ’14, pages 1140–1149, Washington, DC, USA, 2014. IEEE Computer Society.

[30] D. E. Martin, P. A. Wilsey, R. J. Hoekstra, E. R. Keiter, S. A. Hutchinson, T. V. Russo, and L. J. Waters.

Redesigning the warped simulation kernel for analysis and application development. In Proceedings

of the 36th Annual Symposium on Simulation, ANSS ’03, pages 216–, Washington, DC, USA, 2003.

IEEE Computer Society.

[31] A. Mathur, M. Cao, S. Bhattacharya, A. Dilger, A. Tomas, and L. Vivier. The new ext4 filesystem:

current status and future plans. In Proceedings of the Linux Symposium, volume 2, pages 21–33.

Citeseer, 2007.

[32] M. F. Mergen, V. Uhlig, O. Krieger, and J. Xenidis. Virtualization for high-performance computing.

SIGOPS Oper. Syst. Rev., 40(2):8–11, Apr. 2006.

[33] W. D. Norcott and D. Capps. Iozone filesystem benchmark. URL: www. iozone. org, 55, 2003.

[34] K. S. Perumalla and S. K. Seal. Discrete event modeling and massively parallel execution of epidemic

outbreak phenomena. Simulation, page 0037549711413001, 2011.

[35] D. Poirier. The Second Extended File System. Foundation, 2011.

[36] G. J. Popek and R. P. Goldberg. Formal requirements for virtualizable third generation architectures.

Commun. ACM, 17(7):412–421, July 1974.

[37] P. Reiher, S. Bellenot, and D. Jefferson. Temporal decomposition of simulations under the time warp

operating system. In In Proc. SCS Parallel and Distributed Simulation Conference, pages 47–54, 1991.

58 BIBLIOGRAPHY BIBLIOGRAPHY

[38] P. L. Reiher, F. Wieland, and D. Jefferson. Limitation of optimism in the time warp operating system.

In Proceedings of the 21st Conference on Winter Simulation, WSC ’89, pages 765–770, New York,

NY, USA, 1989. ACM.

[39] T. Sterling, E. Lusk, and W. Gropp, editors. Beowulf Cluster Computing with Linux. MIT Press,

Cambridge, MA, USA, 2 edition, 2003.

[40] J. Sugerman, G. Venkitachalam, and B.-H. Lim. Virtualizing i/o devices on vmware workstation’s

hosted virtual machine monitor. In USENIX Annual Technical Conference, General Track, pages 1–

14, 2001.

[41] A. S. Tanenbaum and A. S. Woodhull. Operating Systems Design and Implementation (3rd Edition).

Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2005.

[42] D. Tang and M. Seltzer. Lies, damned lies, and file system benchmarks. VINO: The 1994 Fall Harvest,

1994.

[43] A. Traeger, E. Zadok, N. Joukov, and C. P. Wright. A nine year study of file system and storage

benchmarking. Trans. Storage, 4(2):5:1–5:56, May 2008.

59