Masaryk University Faculty of Informatics

Filesystem Tool

Bachelor’s Thesis

Tomáš Zvoník

Brno, Spring 2019 Masaryk University Faculty of Informatics

Filesystem Benchmark Tool

Bachelor’s Thesis

Tomáš Zvoník

Brno, Spring 2019 This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document. Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Tomáš Zvoník

Advisor: RNDr. Lukáš Hejtmánek, Ph.D.

i Acknowledgements

I would like to thank my thesis leader Lukáš Hejtmánek for guiding me through the process of creating this work. I would also like to thank my family and friends for being by my side, providing support, laughs and overall a nice environment to be in. Lastly I would like to thank my parents for supporting me throughout my whole life and giving me the opportunity to study at a university.

ii Abstract

In this thesis I have created a filesystem benchmark tool that com- bines best features of already existing tools. It can measure read/write speeds as well as speed of metadata operations. It can run on multi- ple threads and on multiple network connected nodes. I have then used my benchmark tool to compare performance of different storage devices and file systems.

iii Keywords benchmark, file, system, iozone, fio, bonnie++

iv Contents

Introduction 1

1 Existing benchmarks 3 1.1 IOzone ...... 3 1.2 Fio ...... 5 1.3 Bonnie++ ...... 6 1.4 Comparison of features provided by different benchmarks .. 7

2 My benchmark 8 2.1 Used libraries ...... 8 2.1.1 LibSSH ...... 8 2.2 How it works ...... 9 2.2.1 User input ...... 10 2.2.2 Reporting to user ...... 10 2.2.3 Multinode communication ...... 11 2.2.4 Performing a benchmark ...... 12 2.2.5 Benchmarks ...... 14

3 Using the software 16 3.1 Comparison of different benchmarks in read/write measure- ments ...... 16 3.2 Comparison of different file systems using my benchmark .. 18 3.3 Comparison of HDD vs SSD performance ...... 20

4 Future work 23

5 Conclusion 24

Bibliography 25

v Introduction

Everyone who uses a computer also uses storage devices. Whether it be flash drives, hard drives or cloud, we expect them to workand be reasonably fast when it comes to read/write speeds. Read speeds meaning how fast a file can be read by the computer and write speeds meaning how fast it can be written to. The average user can look at the speeds advertised by the manu- facturer and expect the storage device to operate more or less at that speed. Unlike these users, system administrators cannot just look at the box their hard drive came in and accept the speeds written there. They might put the drives in RAID arrays1, use a different file sys- tem than the manufacturer, or have multiple machines connected to it at the same time. All of these practices might have an impact on the storage device’s performance. You also cannot entirely trust the man- ufacturer with the hardware speeds, they usually test performance in a laboratory with perfect conditions, so it is almost impossible to achieve same results in real world environments. System administra- tors need to know exactly what performance they can expect from their machines, which is why benchmarks2 exist. Goal of my thesis is to create a benchmarking tool that combines best parts of already existing benchmarking solutions. My benchmark can measure read/write speeds as well as speed of metadata opera- tions3. The software is also able to run on multiple network connected computers simultaneously to test the performance of storage space which is connected to multiple computers at once. All parameters (file size, file count, number of threads and number of connected comput- ers to run on) are configurable by the user.

1. RAID is short for redundant array of independent disks. Originally, the term RAID was defined as redundant array of inexpensive disks, but now it usually refers to a redundant array of independent disks. RAID storage uses multiple disks in order to provide fault tolerance, to improve overall performance, and to increase storage capacity in a system. This is in contrast with older storage devices that used only a single disk drive to store data. [1] 2. In computing, a benchmark is the act of running a computer program, a set of programs, or other operations, in order to assess the relative performance of an object, normally by running a number of standard tests and trials against it. [2] 3. Speed of operations move, delete, create.

1 I have divided my thesis into 4 main chapters. In the first chapter I write about existing benchmarks and how mine is different. In the second chapter I go into more detail about my implementation. In the third chapter I have compiled results from different measurements of storage performance and wrote how different strategies of storage management impact usability. In the last chapter I talk a bit about what could be improved upon in my software.

2 1 Existing benchmarks

There are of course existing solutions for benchmarking performance. I will introduce 3 well known benchmark tools for Unix- like operating systems1. The most commonly used feature of these benchmarks is measuring read/write performance, which is a func- tionality all of these solutions provide. Not all of them however pro- vide the option to measure speed of metadata operations, which can be also very useful to know.

1.1 IOzone

IOzone can measure the following tasks: read, write, re-read, re-write, read backwards, read strided, random read/write, read/write with offset. I will focus only on read and write because my benchmark does not implement the other tests. Read measures speed of reading data from a file with specified block size2. Write measures speed of writing data to a file with specified block size. It can also run these benchmarks on multiple threads and nodes. IOzone uses only one file per thread and must be run with the number of threads equal to or greater than the number of nodes the benchmark is run on. This benchmark tool is written in ANSII and uses POSIX3 stan- dards, so it is possible to run it on just about anything. It is considered one of the best open source file system benchmarks.

1. The term Unix-like is widely used to describe operating systems that share many of the characteristics of the original UNIX, which was written in 1969 by Ken Thompson at Bell Labs, and its early successors. [3] These include , BSD and MacOS. 2. Block size means the size of data that can be read from/written to a file at one time. 3. POSIX is an acronym for Portable Operating System Interface for UNIX, a set of IEEE and ISO standards that define an interface between programs and operating systems. [4]

3 1. Existing benchmarks

Other useful features include export of results into Excel importable file, possibility of async4and DIRECT5 I/O operations, whether to include close() and fsync() into the measurement and so on. [6] Of course it is also possible to specify desired file size and block size, however it is not possible to specify number of files to be used, that is set to one file per thread. IOzone uses RSH to connect to network clients and execute com- mands by default, but can be configured to use SSH. For network communication to work, connections must not prompt for password when connecting via RSH/SSH. Via this connection benchmarks are not just started, but synchronized as well, ensuring that every bench- mark starts on all nodes at the same time. Multithreaded benchmarking is managed by using POSIX thread library pthread. Since IOzone uses POSIX library, standard open(), write() and read() functions are used to work with files. DIRECT I/O is accomplished by passing O_DIRECT flag when opening a file descriptor. DIRECT I/O is a way of accessing storage device directly, without use of cache. This can be useful when you want to measure perfor- mance when cache runs out or if you use direct access by default on your server. Direct access can be useful when storage device is faster than system RAM. This situation can happen when you have thou- sands of hard drives in a RAID array (provided your RAID controller is fast enough), or more recently even by having only a few NVMe SSDs in RAID 0. ASYNC I/O is done by calling functions aio_write and aio_read which are both defined in standard header aio.h. These functions create a queue of I/O operations with possibility of specifying impor- tance of each operation. This queue is then processed in the background and the main program can find out if/which have been completed when it needs to. [7]

4. Async I/O means the I/O operation is run asynchronously, meaning it does not have to wait for the program to tell it what to do. The I/O operation is run “in the background” while the main program continues. The program can then ask if the operation has been completed before continuing. 5. DIRECT I/O operations write directly to the disk without utilizing RAM as cache. [5]

4 1. Existing benchmarks

IOzone is written mainly in C and is distributed under its own open license. This license permits anyone to use and distribute the software, but the author keeps the exclusive right to publish any and all works derived from the original source code.

1.2 Fio

Fio is a very versatile benchmark tool that can be run on multiple systems (including Linux, Solaris, MacOS, BSD, Windows, …). It is configured via job files where nearly anything can be configured, from basic things like file size and name to very advanced options like which function to use for preallocation of space on disk before creating a file. Fio runs a read/write benchmark (both sequential and random6), but does not perform one for metadata. One of the defining features of Fio is its use of I/O engines, which define a set of functions that are to be used during the benchmark. There are many of these engines, so basically any workflow can be emulated. I/O engines range from general I/O functions like POSIX to very specific libraries like GFAPI which uses libgfapi for direct access to GlusterFS volumes. Fio supports multithreaded benchmarking just like IOzone, it de- faults to using fork(), but it is possible to set it up to use POSIX threads7. Multinode benchmark is also possible, but does not pro- vide much control. Worker nodes are running a Fio server, to which a client connects and only shares the job file, the servers then work independently, so it is not possible to ensure simultaneous usage.

6. Sequential read/write means that the data file is read/written to from beginning to the end, bit by bit. Random read/write means that bits are read from/written to file in random order, going back and forth until all parts of the file are read/written. Random benchmarks usually take longer than sequential ones because not only do they measure speed of read/write but also the speed of seeking inside a file. Random read/write benchmark can be useful to simulate a real world usage, where a user opens lots of different files scattered all over the storage device. 7. Fork is a UNIX function that creates a copy of current process, this copy is its own process with its own memory and state. Threads are like sub-processes, they run in parallel to the process that created them, but share memory. This can be good as the threads can easily communicate, but can also cause problems because without proper guards multiple threads can be changing the same portion of memory at the same time. [8] 5 1. Existing benchmarks

The default Fio I/O Engine is psync (except on Windows) which uses functions pwrite() and pread(), which are the same as write() and read() but can specify offset at which the file descriptor should be opened. Using these functions can be beneficial for random read- s/writes because they can open the file at desired offset instead of opening it at the beginning and then moving to the offset programmat- ically. There is no difference between pwrite(), pread() and write(), read() when reading/writing from the beginning of the file. On Win- dows the default I/O Engine is windowsaio (native asynchronous I/O). Fio is written in C and is distributed under GPL v2 license. [9]

1.3 Bonnie++

Similarly to IOzone and Fio, Bonnie++ can also measure read and write speed, but unlike them it can also measure how many files can be created and deleted per second. Compared to IOzone and Fio it is also much less configurable, you cannot for example specify which I/O functions to use. Bonnie++ only uses standard write() and read() functions from C library. Output of this program can be either in human readable form, CSV8 or HTML9. Bonnie++ supports multithreading but only for file seeks, other than that everything runs on a single thread. Since my benchmark does not measure file seeking, I will treat Bonnie++ as if it didnot have any multithreading capability whatsoever. Bonnie++ is written in C++ and is available for Unix-like operating systems under GPL v2 license. [12]

8. A CSV file is a comma separated values file commonly used by spreadsheet programs such as Microsoft Excel or OpenOffice Calc. It contains plain text data sets separated by commas with each new line in the CSV file representing a new database row and each database row consisting of one or more fields separated by a comma. [10] 9. HTML stands for “Hypertext Markup Language”. HTML is the language used to create webpages. “Hypertext” refers to the hyperlinks that an HTML page may contain. “Markup language” refers to the way tags are used to define the page layout and elements within the page. [11]

6 1. Existing benchmarks 1.4 Comparison of features provided by different benchmarks

Benchmark I/O library Metadata Multi- Multinode bench- threaded bench- mark mark IOzone POSIX No Yes Yes Fio Multiple No Yes Yes Bonnie++ POSIX Yes No No IOBench POSIX Yes Yes Yes Table 1.1: Comparison of features provided by different benchmarks

7 2 My benchmark

I have created my own benchmark called IOBench that is able to measure speeds of: writing, reading, moving, deleting and creating files. It is able to run benchmarks on multiple threads as wellason multiple network connected nodes. In this chapter I will explain how it works.

2.1 Used libraries

Other than standard C++ and POSIX libraries I use library libSSH, which is needed for network communication/control between nodes that are running a benchmark. The whole program is written using C++11 standard, so it can be compiled on all major Linux distribu- tions.

2.1.1 LibSSH LibSSH provides functions to communicate with and control other machines via SSH. SSH is an acronym for Secure Shell, it is a network protocol that uses encryption to secure connection between two ma- chines. Via this connection it is possible to control another machine by sending commands to it. Both client (machine initiating the con- nection) and server (machine that is being connected to) have 2 SSH keys, one public and one private. These keys are used to prove their identity and for encryption. After initiating an SSH connection the server sends its public key. Client can at this point verify that they are connecting to the correct machine by comparing the received key with their database of keys. Then parameters of connection are negotiated between machines. At this point the client must verify their identity, either by providing their password or sending their public key. If this public key is stored in authorized key database on the server, client can connect without providing their password. [13] I have written an object oriented wrapper for the libSSH library which provides functionality I need for connecting to, communicating with and executing commands on a network connected node. I have chosen to create this wrapper so handling resources would be simpler.

8 2. My benchmark

C++ has a programming idiom called RAII1 in object oriented pro- gramming. RAII makes sure to release all allocated resources when the object that has them dies, meaning the program goes out of scope where the object has been declared. Thanks to this I do not have to manually allocate and deallocate any resources, thus lowering the risk of an accidental memory leak2 significantly. This wrapper has functions to verify the host by its public key, au- thenticate the user using their public key or by providing a password, start a command, read output blockingly, and send input to the cur- rently executing software. Blocking read means that the program waits until it receives a message, I have implemented this by having the first 1–256 bytes read blockingly and then reading non-blockingly until the message is read fully. Non-blocking read reads a message while it is available and stops when there is nothing to be read, allowing the program to continue.

2.2 How it works

First we need to get user’s input, so we know what benchmark and with what parameters to run. Then we need to ensure that provided settings make sense (number of files is not zero, at least one benchmark has been specified, etc.). At this point benchmarking can start. We determine in which mode to run the benchmark (local, server, client) based on provided parameters. Then the benchmark is run and results printed. This continues until all benchmarks are finished. Now that we have the basic functionality covered we can have a closer look at each step.

1. RAII encapsulate each resource into a class, where the class constructor acquires the resource and establishes all class invariants or throws an exception if that cannot be done, then the destructor releases the resource and never throws exceptions at the end of object’s life cycle. [14] 2. Memory leak is when program loses pointer to allocated memory, meaning it has no way of deallocating it now. The memory is in use, but cannot be accessed by the program.

9 2. My benchmark

2.2.1 User input

It is possible to configure the benchmark parameters via command line or a configuration file. I am using GNU3 library getopt to parse command line options. This library provides a simple way of defining all legal options, alerts user when they provide an illegal option and makes it easy to set variables based on provided parameters. For parsing of configuration files I use C++’s file streams, using whichI go line by line in the file, compare each line with defined keywords and set appropriate variables. When an unknown keyword is found, the user is alerted and execution is stopped. For configuration of network nodes users can provide a filewith nodes listed. All that needs to be in the file is the address of the node with optional port number (22 is the fallback value), user under which to connect via SSH (if this parameter is not provided user who started the software is used) and path to IOBench executable. This file is also parsed using C++’s file streams and string operations. For example of a valid configuration see Example of a configuration file on page 11.

2.2.2 Reporting to user

After a benchmark is finished, its results are reported to the userin the console output. The results include the average and minimum speed across all threads/nodes written in human readable format. The highest possible unit is used when converting to the readable format going up to a petabyte. I am using 1024 as a divider between units. The program also creates a log file in the path where benchmarks are run where information about when a benchmark has been started, what is being done and what went wrong is stored.

3. GNU is an operating system that is — that is, it respects users’ free- dom. The GNU operating system consists of GNU packages (programs specifically released by the GNU Project) as well as free software released by third parties. The development of GNU made it possible to use a computer without software that would trample your freedom. [15]

10 2. My benchmark

Example of a configuration file files 10 size 100M creation-files 100000 block-size 256K threads 4 benchmarks write read move delete creation node localhost /usr/local/bin/iobench node remotehost:8022 benchmarkuser /usr/bin/iobench path /storage/benchmark

2.2.3 Multinode communication As stated in Used libraries, I am using libSSH for all network communi- cation. Multinode benchmark is started by running an SSH command which starts an IOBench instance in node mode. This means that the program will wait for instruction from the server when to start which benchmark. This method is used rather than letting each node run all benchmarks at once, thus simulating concurrent usage. If each node ran benchmarks without supervision, one node could finish all benchmarks while other nodes are only halfway through. Sending and receiving messages is handled by using stdin and stdout4 and reading from/writing to them is handled by the libSSH library. The server waits for all worker nodes to send a message saying they are ready to receive benchmark information. At this point the server sends each node a message telling them what benchmark to run. Server then waits until it gets a benchmark result from each worker node. The nodes perform their given benchmark and then send a message to the server containing time spent performing the benchmark on each thread. When the server receives all messages, it computes average speed of all nodes, minimum speed of a node and minimum speed of

4. Under normal circumstances every UNIX program has three streams opened for it when it starts up, one for input, one for output, and one for printing diagnostic or error messages. The input stream is referred to as standard input, the output stream is referred to as standard output and the error stream is referred to as standard error. These terms are abbreviated to form the symbols used to refer to these files, namely stdin, stdout, and stderr. [16]

11 2. My benchmark

a thread. Then the server tells workers what benchmark to run next and this cycle continues until all benchmarks have been run. When all benchmarks are finished, the worker nodes copy their log files at the end of server’s log file (this assumes that the benchmark was run on a shared storage, otherwise the program will try to create a file with the log’s contents in the directory where the server was started) and delete all of their files and directories, unless option to keep files is specified. The server waits until all of these tasksare finished before closing the SSH connection and ending the program.

2.2.4 Performing a benchmark Here I will explain how the benchmark itself is performed. At this point we are sure that options specified by the user are sane thanks to the parsing function (so at least one benchmark will run). First of all we need to make sure nothing will break as the benchmarks run. The order of benchmarks is decided by the order they were written on the command line or in the configuration file. By analyzing the order of benchmarks we can avoid useless creation of files (e.g. write benchmark is the first one, so there is no need to create files forit)as well as making sure that all benchmarks have everything they need for execution (e.g. read benchmark has files that it can read). This is done by inserting a pseudo benchmark preparation into the list of benchmarks, which does not measure anything, just creates files with specified size. The place where to insert preparation is com- puted based on two facts — are there files currently present? and does the benchmark need files to work correctly? The only time we need to insert preparation is when files are not present, but need to be. For example if we had benchmark order write read move delete we do not need to insert preparation, write benchmark will create the files and all other benchmarks will use those. If on the other hand we had this order of benchmarks: move delete read write, we need to add preparation like so: preparation move delete preparation read write. This will ensure that move benchmark has files to move, delete benchmark has files to delete and read benchmark has files to read after they have been deleted by the delete benchmark. At this point we are sure that all benchmarks that need files will have them prepared and nothing will break. But what if the user

12 2. My benchmark

specified the option to keep files in the previous benchmark? This would mean that files are present and creating them again would be useless (and probably something the user does not want). This is resolved in preparation itself, if the function encounters existing files, it checks that each of them have the size specified in benchmark options. If the file has correct size, it is left the way it is, if the size is wrong the file is deleted and a new one is created. This is done on per file basis, we cannot assume correctness of file size from a single file, because after a few runs with the option to keep files enabled the file structure could look something like this:

file_0 - 10G file_1 - 10G file_2 - 10G file_3 - 10G file_4 - 100K file_5 - 100K file_6 - 10 file_7 - 10

Now since we are sure all benchmarks will have files to work with that are the correct size, we can start measuring. The benchmarking itself is written in a very generic way so as to allow easy expansion. I have written one function that takes in all the parameters (file names, size, function that measures desired action, function that puts files back the way they were before the benchmark, etc.). This means that adding a new benchmark is as easy as adding one function that mea- sures whatever file system action you want. The only requirement for this new function is that it must have the same parameters as all other measuring functions and return the measured time in the required variable type. Example of such a function can be seen in Example of a generic benchmark function on page 15. Before each benchmark the order of files to process is randomized to lower the possibility of files staying in RAM (which would cause faster access and skewed results). After this the filenames are changed so they contain absolute path to the files. This does not apply to cre- ation benchmark, because it creates files, so they cannot be in RAM when the benchmark starts, so in this case filenames are generated

13 2. My benchmark during the benchmark. The measurement of file system operations is done using C++’s std::chrono::steady_clock, which is a monotonic clock, meaning that unlike std::chrono::system_clock it cannot decrease. System_clock can decrease because, like the title suggests it takes time from system’s real time wall clock, which can be changed at any time. Steady_clock on the other hand measures number of ticks which have constant time between them, this makes steady_clock the ideal way to measure intervals. [17] This clock has a function called now() which returns current time, this function is called right before the measured file system operations start and then again after the operations have finished. Then we subtract the starting time from the ending time and we get the time it took to complete the operations. After each bench- mark is run, we compute the average speed of the action as well as minimal speed per thread and write it to the standard output. When all benchmarks are finished it is possible to keep the files/di- rectories as they currently are (except the ones for creation benchmark, because that would not make sense) by specifying --keep-files in command line arguments. If this option has not been specified, a cleanup is performed where all benchmark files/directories are deleted (except the log file).

2.2.5 Benchmarks

Read benchmark continuously reads number of bytes set in block size until it reaches the end of a file. This benchmark uses function read(). Write benchmark continuously writes number of bytes set in block size to a file until the file has specified size. This benchmark uses function write(). Move benchmark changes files’ names by adding “moved_” prefix to the original file names. This benchmark uses function rename(). Delete benchmark deletes files with specified size. This benchmark uses function remove(). Creation benchmark creates a specified number of empty files. This benchmark uses function open().

14 2. My benchmark

Example of a generic benchmark function In the example below there are 2 unused parameters, this is done so the function satisfies the generic parameters. These parameters would otherwise be file size and number of files, this benchmark however does not need this information, so those parameters are unused. long double removeBenchmark( std::vector< std::string > &files, uint64_t /*UNUSED*/, uint64_t /*UNUSED*/, const std::string &path ) { scrambleVector( files );

std::vector< std::string > files_path; for ( auto &file : files ) { files_path.push_back( path + "/" + file ); } auto result = 0;

auto start = std::chrono::steady_clock::now(); for ( auto &file : files_path ) { result = FSLib::remove( file ); if ( result != 0 ){ return LDBL_MAX; } } auto end = std::chrono::steady_clock::now();

long double average = std::chrono::duration_cast< std::chrono::nanoseconds >( end - start ).count();

average /= files.size(); // average in seconds average /= 1000000000; // files per second return 1.0 / average; }

15 3 Using the software

I have compared results of measurement between my benchmark and benchmarks mentioned in Existing benchmarks. After this comparison I have use my benchmark to measure differences between a variety of hardware and software solutions for file storage.

3.1 Comparison of different benchmarks in read/write measurements

This benchmark has been run on a computer with 40 physical CPU cores, which run at 2.40 GHz, 80 threads and 378 GB of RAM. It has been run on a GPFS file system which in total had 3.7 PB. It hasbeen run in multithreaded mode with 16 threads, each thread created 1 file with size of 100 GB. I have elected these values based on the hardware and software installed in this computer so the benchmark would yield useful results. Each benchmark has been run 4 times and the results in the table below are averages of these runs and their standard deviation.

Benchmark Read (GB/s) Write (GB/s) IOzone 3.071 ± 0.00033 3.068 ± 0.00048 Fio 3.074 ± 0.00097 3.080 ± 0 IOBench 3.071 ± 0.00082 3.1 ± 0.0677 Table 3.1: Comparison of read/write speeds reported by different benchmarks

In this comparison we can see that my benchmark comes very close to the measurements of the other two (only a few MB/s difference). From this we can conclude that it measures write and read speeds correctly. I have run another benchmark on the same machine, this time comparing creation speeds reported by Bonnie++ and my benchmark. This benchmark created 1000*1024 files, this amount has been chosen so the benchmark would run for at least a few seconds, thus ensuring

16 3. Using the software

sane results. Results in the table below are again averages of 4 runs and their standard deviation.

Benchmark Creation (files/s) Bonnie++ 8640 ± 69.73 IOBench 7500 ± 129.02 Table 3.2: Comparison of creation speed reported by different bench- marks

Here we can see that my benchmark is about 10% slower than Bonnie++, which is not ideal, but still within tolerable margin. Finally I have also run multinode benchmark on 8 nodes, each creating 1 100 GB file on 2 threads. The hardware of the nodes was the same as in the previous 2 benchmarks. Results in the table below are again averages of 4 runs and their standard deviation.

Benchmark Read (GB/s) Write (GB/s) IOzone 11.72 ± 0.05429 6.31 ± 0.03956 IOBench 13.4 ± 0.55676 6.61 ± 0.03439 Table 3.3: Comparison of read/write speeds on multi-node cluster

In this benchmark we can see that multinode benchmark is much faster than a single node one. This is caused by the RAID array being connected via network over Fibre Channel and 3 GB/s being maximum speed of the Fibre Channel card in the computer. In this benchmark we instead see the maximum performance of the RAID array when not throttled by connection.

17 3. Using the software 3.2 Comparison of different file systems using my benchmark

This benchmark has been run in a virtual machine with 2 CPU cores, which run at 2.3 GHz and 32 GB of RAM. The virtual machine was running Debian 9. Since this is a much different machine than the previous one, I have decided to use different values to yield sane results. The benchmark writes 50 1 GB files on 2 threads (so 100 files in total), then reads them. After that I ran a metadata benchmark with 100 000 files for move, delete and creation. This benchmark has also been run on 2 threads and files for move and delete benchmark had1 byte. I have compared performance of 5 different file systems: ext41, NTFS2, exFAT3, XFS4 and Btrfs5. Results of this benchmark are in table 3.4 Benchmark of different file systems.

1. Ext4 is a journaling file system found on most Linux distributions. Many dis- tributions use it as the default file system. Ext4 supports files up to 16 TB insize and partitions up to 1 EB. This file system supports extended attributes for files and directories. [18] 2. NTFS is Microsoft’s journaling file system. It is possible to use this file system on Linux, but it is used via FUSE (file system in user space). This means that when using this file system you are not communicating with kernel directly, instead you communicate with a program that sits between kernel and the file system. This way of handling things causes performance to suffer. [19] 3. exFAT is a file system that extends functionality of FAT32. Compared to FAT32it allows larger files (4 GB vs 16 EB) and larger partitions (8 TB vs 128 PB). Thisfile system does not have journaling. It is an older file system, but is still used today because it is supported by almost every device in existence. [20] 4. XFS is a high-performing, journaling Linux file system. It is designed for high scalability and provides near native I/O performance even when spanning multiple storage devices. XFS is able to support file systems as large as 8 EB. This file system also supports extended attributes. [21] 5. Btrfs is a modern file system which uses copy on write principle (only copy file when there are changes made to the copy, otherwise just point to the original file). Btrfs has many features, which include snapshots (both writable and read only), data checksums, multiple device support, incremental backup, deduplication and many more. The theoretical maximum size of a Btrfs partition is 16 EB. [22]

18 3. Using the software

File system Read Write Move Delete Creation (MB/s) (MB/s) (files/s) (files/s) (files/s) ext4 270 ± 310 ± 60000 ± 40000 ± 60000 ± 8.975 22.357 15349.06 11329.68 12304.09 NTFS 177 ± 69 ± 2.803 5100 ± 11500 ± 8200 ± 4.242 523.12 761.06 228.52 exFAT 143 ± 354 ± 14.9 ± 160 ± 60 ± 0.251 1.020 8.263 0.225 2.002 XFS 310 ± 300 ± 15700 ± 16000 ± 37000 ± 15.070 34.591 919.05 1337.13 1042.61 Btrfs 273 ± 320 ± 22000 ± 22600 ± 37000 ± 7.177 11.246 1340.43 554.20 2699.78 Table 3.4: Benchmark of different file systems

Here we see that no single file system has the best result in all categories. XFS has the best overall read/write speeds, but is outper- formed by ext4 when it comes to metadata operations. We can also see that NTFS really suffers a big performance loss on Linux, considering it is a modern file system it should have speeds similar to ext4, but thanks to the non-optimal driver it is about half as fast when it comes to reads and only about a fifth as fast in writes. In this benchmark we can also see that even though exFAT has pretty good read/write speeds, it has terrible metadata performance, this is caused by it not having a journaling system, so all metadata operations need to be ex- ecuted immediately instead of being written to a journal and then being executed all at once. We can also see a pretty big standard de- viation in the metadata benchmarks, this is caused by the fact that metadata operations are so fast, which causes every little change in the environment to have a noticeable impact on performance and it is almost impossible to ensure the same conditions over multiple runs of the benchmark. After this benchmark I have decided to see how different block size impacts performance, this has been run in the same environment as the benchmark above. The HDD was formatted with ext4 and the benchmark wrote and read 50 files, each being 1 GB on 2 threads (so 100 files in total). Results of this benchmark are in table 3.5 Benchmark of different block sizes.

19 3. Using the software

Block Size Read (MB/s) Write (MB/s) 64 KB 300 ± 28.504 520 ± 62.915 128 KB 320 ± 26.208 500 ± 121.907 256 KB 290 ± 23.059 550 ± 46.820 512 KB 290 ± 23.021 520 ± 66.267 1024 KB 300 ± 32.190 540 ± 75.790 2048 KB 300 ± 28.973 500 ± 73.899 4096 KB 310 ± 27.827 510 ± 71.160 Table 3.5: Benchmark of different block sizes

From this benchmark we can see that different block sizes do not have that big of an effect on raw performance. Even though there are differences in speed, they are pretty minuscule and do not seemto follow any pattern whatsoever. Thanks to this we can conclude that different block size does not affect speed of read/write performance when using read() write() on a file system with caching.

3.3 Comparison of HDD vs SSD performance

This benchmark has been run on a computer with 72 cores, which are running at 2.30GHz, 144 threads and 755 GB of RAM. It was running Debian 9.9 and disks on which I was testing were an HDD with capacity of 3.7 TB and an NVMe SSD with capacity of 1.8 TB. Both of the disks were formatted with ext4 file system. I have run two benchmarks for this comparison. The first benchmark was for read/write performance. In it Ihave set parameters to write and then read 40 files, each having 5 GB on8 threads. This adds up to total of 1.6 TB of data (I needed this much data so it would not all get stored in RAM, skewing the results). Block size was set to 256 KB. Results of this benchmark are in table 3.6 Benchmark of different storage devices (read/write).

20 3. Using the software

Storage Device Read (MB/s) Write (MB/s) HDD 150 ± 18.226 180 ± 23.058 SSD 1 540 ± 94.031 2 650 ± 35.080 Table 3.6: Benchmark of different storage devices (write/read)

The second benchmark was for metadata operations (move, delete, create). I have configured this one to use 100 000 files. The file size was set to 1 byte for move and delete benchmarks, creation benchmark created empty (0 bytes) files. This benchmark also ran on 8 threads. Results of this benchmark are in the table below.

Storage Device Move (1000 files/s) Delete (1000 files/s) Create (1000 files/s) HDD 210 ± 12.399 30 ± 29.310 150 ± 65.634 SSD 410 ± 29.391 210 ± 14.831 300 ± 24.915 Table 3.7: Benchmark of different storage devices (metadata)

During this benchmark I have noticed strange behavior. After fin- ishing the benchmark with 100 000 files per thread I decided to start another one, this time with 1 000 000 files per thread. I thought this would result in a benchmark running more or less 10 times as long as the previous one. I was wrong, whereas the previous benchmark ran for only a minute or two, this one was taking over 10 hours before I decided to end it. This was probably caused by the journal of the file system being overflown. So instead of gathering information about which metadata operations it should do and then do them all at once it needs to do the operations whenever the journal overflows. This hinders the performance incredibly (some rename system calls6 were taking up to 90 seconds to complete). On SSD it was possible to finish this benchmark in a reasonable time frame because of just how fast it is. The performance was about 20% compared to performance with 100 000 files for move and delete benchmark and about 50% for creation benchmark.

6. System call is the invocation of an operating system routine. Operating systems contain sets of routines for performing various low-level operations. For example, all operating systems have a routine for creating a directory or opening a file. [23]

21 3. Using the software

In the above benchmarks we can see that the NVMe SSD is about 10 times faster when it comes to reading and 15 times faster at writing. At the same time we can see that the SSD is about twice as fast when it comes to IOPS (I/O operations per second), except for delete where it is 5 times as fast as the HDD.

22 4 Future work

There is possibility to enhance this software further. In the future I could add more benchmarks (re-write, re-read, copy, …) as well as add more configurations options. For example there could be a choice of which functions should be used to perform file system tasks, like in Fio benchmark. At the very least I think DIRECT and asynchronous I/O would be useful to add and perhaps the newly added io_uring I/O1 Another configuration option could be to make it possible to specify how to move files, whether to just rename them or move them to a different directory. The software could also be ported onto dif- ferent operating systems to make it more useful. I think the software could also benefit from a graphical user interface, so it would be easier to use for people without much computer knowledge, thus broaden- ing the user base. This would also probably require translation of the software into multiple languages, to gain mass appeal. Another en- hancement could be to create default tests that are configured based on the computer they are running on. This way the user would just choose write benchmark and things like file size and number of files would be generated by the program. It would not be useful to sys- tem administrators, who know what they are doing, but to a layman who just wants to know how fast their hard drive is, it could be very appealing.

1. IO_uring is a new way of buffering I/O in the Linux kernel. With this setup itis possible to do asynchronous I/O with a single system call and future developments will enable polled I/O which will enable I/O without any system calls whatsoever. Thanks to io_uring, I/O on Linux should be much faster and more efficient. This feature has been added to the Linux kernel in version 5.1. [24]

23 5 Conclusion

The goal of this thesis was to study existing filesystem benchmarks and create a new benchmark that combines their best features and then use this new benchmark to compare variety of hardware and software storage solutions. I have studied benchmarks IOzone, Fio and Bonnie++ and from those I have decided to combine the following features: measuring read/write performance, measuring metadata performance, running on multiple threads, running on multiple nodes. I have successfully programmed a benchmarking tool that has all of these features. This software measures performance of storage devices adequately well, when it comes to read/write performance it reports nearly identically to IOzone and Fio and in metadata benchmarking it only differs about 10% compared to Bonnie++. With completed software I have measured performance of different hardware and software storage solutions. As is to be expected, SSDs are faster than HDDs, modern file systems are faster than old ones and native Linux file systems (like ext4 or XFS) are faster on aLinux machine than file systems running in user space (NTFS).

24 Bibliography

1. RAID - redundant array of independent disks [online] [visited on 2019-03-29]. Available from: https : / / www . webopedia . com / TERM/R/RAID.html. 2. FLEMING, Philip J.; WALLACE, John J. How Not to Lie with Statistics: The Correct Way to Summarize Benchmark Results. Commun. ACM. 1986, vol. 29, no. 3, pp. 218–221. ISSN 0001-0782. Available from DOI: 10.1145/5666.5673. 3. Unix-like Definition [online] [visited on 2019-03-23]. Available from: http://www.linfo.org/unix-like.html. 4. POSIX [online] [visited on 2019-03-29]. Available from: https: //www.webopedia.com/TERM/P/POSIX.html. 5. What is direct I/O anyway? [online] [visited on 2019-03-29]. Avail- able from: http://www.alexonlinux.com/what-is-direct-io- anyway. 6. IOzone Filesystem Benchmark [online] [visited on 2019-03-23]. Available from: http://iozone.org/. 7. aio(7) - Linux man page [online] [visited on 2019-05-01]. Available from: https://linux.die.net/man/7/aio. 8. Overview of Forks, Threads, and Asynchronous I/O [online] [visited on 2019-05-04]. Available from: https://www.remwebdevelopment. com/blog/overview-of-forks-threads-and-asynchronous- io-133.html. 9. FIO - Flexible I/O Tester Synthetic Benchmark [online] [visited on 2019-03-23]. Available from: https://media.readthedocs.org/ pdf/fio/latest/fio.pdf. 10. What is a CSV file? [online] [visited on 2019-04-02]. Available from: https://fileinfo.com/extension/csv. 11. HTML [online] [visited on 2019-04-02]. Available from: https: //techterms.com/definition/html. 12. Bonnie++ Documentation [online] [visited on 2019-03-23]. Avail- able from: https://www.coker.com.au/bonnie++/readme. html.

25 BIBLIOGRAPHY

13. SSH (Secure Shell) [online] [visited on 2019-05-01]. Available from: https://www.ssh.com/ssh/. 14. RAII [online] [visited on 2019-03-29]. Available from: https: //en.cppreference.com/w/cpp/language/raii. 15. What is GNU? [online] [visited on 2019-05-01]. Available from: https://www.gnu.org/. 16. STDIN(3) [online] [visited on 2019-05-02]. Available from: http: //man7.org/linux/man-pages/man3/stdin.3.html. 17. std::chrono::steady_clock [online] [visited on 2019-05-04]. Available from: https://en.cppreference.com/w/cpp/chrono/steady_ clock. 18. Choosing Linux File System [online] [visited on 2019-05-04]. Avail- able from: https://www.tekyhost.com/choosing-linux-file- system/. 19. libfuse [online] [visited on 2019-05-04]. Available from: https: //github.com/libfuse/libfuse. 20. exFAT vs. FAT32 Comparison [online] [visited on 2019-05-04]. Available from: http://www.ntfs.com/exfat-comparison.htm. 21. XFS, High Performance Scalable File System [online] [visited on 2019-05-04]. Available from: https://www.oracle.com/technetwork/ server-storage/linux/technologies/xfs-overview-1917772. html. 22. Btrfs Wiki - Main Page [online] [visited on 2019-05-04]. Available from: https://btrfs.wiki.kernel.org/index.php/Main_ Page. 23. System Call [online] [visited on 2019-05-01]. Available from: https: //www.webopedia.com/TERM/S/system_call.html. 24. Linux Kernel Getting io_uring To Deliver Fast & Efficient I/O [online] [visited on 2019-05-12]. Available from: https://www.phoronix. com/scan.php?page=news_item&px=Linux-io_uring-Fast- Efficient.

26