Maximizing the Performance of Scientific Data Transfer by Optimizing the Interface Between Parallel File Systems and Advanced Research Networks

Nicholas Millsa,∗, F. Alex Feltusb, Walter B. Ligon IIIa

aHolcombe Department of Electrical and Computer Engineering, Clemson University, Clemson, SC bDepartment of Genetics and Biochemistry, Clemson University, Clemson, SC

Abstract The large amount of time spent transferring experimental data in fields such as genomics is hampering the ability of scientists to generate new knowledge. Often, computer hardware is capable of faster transfers but sub-optimal transfer software and configurations are limiting performance. This work seeks to serve as a guide to identifying the optimal configuration for performing genomics data transfers. A wide variety of tests narrow in on the optimal data transfer parameters for parallel data streaming across Internet2 and between two CloudLab clusters loading real genomics data onto a parallel file system. The best throughput was found to occur with a configuration using GridFTP with at least 5 parallel TCP streams with a 16 MiB TCP socket buffer size to transfer to/from 4–8 BeeGFS parallel file system nodes connected by InfiniBand. Keywords: Software Defined Networking, High Throughput Computing, DNA sequence, Parallel Data Transfer, Parallel , Data Intensive Science

1. Introduction of over 2,500 human genomes from 26 populations have been determined (The 1000 Genomes Project [3]); genome Solving scientific problems on a high-performance com- sequences have been produced for 3,000 rice varieties from puting (HPC) cluster will happen faster by taking full ad- 89 countries (The 3000 Rice Genomes Project [4]). These vantage of specialized infrastructure such as parallel file raw datasets are but a few examples that aggregate into systems and advanced software-defined networks. The first petabytes with no end of growth in sight. In fact, if DNA step in a scientific workflow is the transfer of input data sequencing technology continues to improve in terms of from a source repository onto the compute nodes in the resolution and cost, one could view DNA sequencing as an HPC cluster. When the amount of data is small enough Internet of Things application with associated computa- it is possible to store datasets within the local institution tional challenges across the DNA data life cycle [5]. for quick retrieval. However, for certain domains ranging Domain scientists who are not experts in computing from genomics to social analytics the amount of data is often resort to transferring these large datasets using FTP becoming too large to store locally. In such cases the data over the commodity Internet at frustratingly slow speeds. must be optimally transferred from an often geographically Even if a fast network is available, large transfers can take distant central repository. on the order of several days to complete if the correct In genomics, our primary area of interest, the veloc- transfer techniques are not used. The HPC and Big Data ity of data accumulation due to high-throughput DNA se- communities have developed several tools for fast data quencing should not be underestimated, as data accumu- transfers but these tools are often not utilized. Several lation is accelerating into the Exascale. As of this writing factors contribute to the lack of utilization by scientists of there are six quadrillion base pairs in the sequence read the proper tools. First, awareness of the proper tools in archive database at NCBI [1] with each A,T,G,C stored as the community is lacking to the point where researchers re- at least two bytes. These data represent publicly mineable sult to familiar but technically inferior tools such as FTP. DNA datasets across the Tree of Life from viruses to rice Old habits are hard to break. Second, a lack of adequate to humans to elephants. New powerful genetic datasets documentation means even when researchers are pointed such as The Cancer Genome Atlas [2] contain deep se- to the right tools they are lost in a maze of configuration quencing from tumors of over 11,000 patients; sequences options. Manuals give an overview of various parameters but fail to explain what parameter values are suitable for high-performance data transfers. ∗Corresponding author Email addresses: [email protected] (Nicholas Mills), An end-to-end data transfer involves at least three ma- [email protected] (F. Alex Feltus), [email protected] (Walter jor components: a high-performance storage system, a B. Ligon III) high-performance network, and the software to tie it all

Preprint submitted to Future Generation Computer Systems October 25, 2016 together. The various research communities have all ex- easily test different modern hardware and software con- amined optimized performance parameters for their re- figurations without the usual troubles associated with re- spective components. Unfortunately, our experience has installing the and re-configuring the net- been that independently optimizing a single component work [15]. CloudLab allows the allocation of bare metal leads to slower performance in the system as a whole. machines with no virtualization overhead. Thus, Cloud- This work attempts to fill the gaps by suggesting opti- Lab is a realistic experimental environment for moving mized transfer parameters based on experimentally mea- real datasets within and between sites before moving al- sured end-to-end transfer performance. Where possible gorithms and procedures into production. the experimentally-determined parameters are supported A major capability of CloudLab is the ability to con- with appropriate theory. nect clusters together over Internet2 using software-defined networking. Our experiments use two of these clusters: the 1.1. Storage component Apt cluster in the University of Utah’s Downtown Data When considering storage requirements it is important Center in Salt Lake City, and the Clemson cluster of Clem- to remember that data transfer is only the first step in son University in Anderson, South Carolina [16]. an HPC workflow. Presumably there will later be a sig- Communication between the Apt and Clemson Cloud- nificant amount of computation performed on the data Lab clusters occurs over the Internet2 network using the in parallel. While technologies such as large single-node Advanced Layer-2 Service (AL2S) [17]. AL2S allows nodes flash storage arrays may provide the highest raw storage in the two clusters to transparently communicate as if they throughput during a transfer, those arrays will quickly be- were connected to the same switch. That is, they appear come a bottleneck during the computation phase of the to be on the same IP subnetwork. The only discernible workflow as multiple compute nodes compete for access to difference is the relatively high communication latency of the centralized storage component. Instead, a parallel file 26 ms when compared to a more typical latency of 180 µs. system (PFS) provides a good compromise between raw Configuring a new path over AL2S would normally in- storage bandwidth and parallel processing scalability. volve multiple technical and administrative hurdles. The A PFS is a special case of a clustered file system, where experimenter would need to contact the appropriate net- the data files are stored on multiple server nodes. Storing work administrator and convince them to set up a new a file on n nodes where the nodes can be accessed simulta- path between sites. The network administrator would need neously theoretically allows for a speedup of n times over to configure the new path with Internet2 via the Global- the single-node case. In practice, the achievable speedup NOC [18]. The delay involved in this process could be is often much less than n and depends on both the parallel significant. For a production network that may run for file system implementation and the access patterns of the several years or more such a delay may not matter. But application performing I/O. It is therefore advantageous for a temporary experimental network intended to run on to evaluate multiple parallel PFSs to find the one most the order of weeks or less the large initial setup delay can suited to a particular workload. be significant. In this paper some of the more popular PFSs are eval- A major advantage of CloudLab for multi-site experi- uated. Among the file systems compared are BeeGFS [6], ments is that the CloudLab software configures new links [7], GlusterFS [8], and OrangeFS [9]. BeeGFS is over AL2S automatically at the beginning of each experi- later used for file transfer tests once it is determined to ment. In most cases this process is seamless and requires have the best performance in a benchmark. is an- only a couple minutes of waiting. A researcher need only other popular PFS that was not evaluated because it is specify the two sites to be connected and the CloudLab not supported for the Ubuntu operating system [10]. infrastructure handles the rest.

1.2. Network component 1.3. Software component Access to a fast network with plenty of excess capac- Two software tools drive experiments. The first tool, ity is essential for high performance data transfers. The XDD [19], is used to perform benchmarks of parallel file Science DMZ model [11] is well suited to data transfers systems. As a benchmark tool XDD makes use of its because of its support for the long-lived elephant flows knowledge of disks to achieve maximum performance by typical of large transfers. In particular the lack of fire- queueing multiple large I/O requests simultaneously. In walls in the model prevents slowdown caused by packet our experiments XDD uses multiple threads to access dif- processing overhead. The network infrastructure for our ferent regions of the same file in parallel. experiments is provided as a part of the CloudLab envi- The second tool, GridFTP [20], is a well-known appli- ronment [12, 13, 14]. cation for performing large data transfers. GridFTP en- CloudLab is a platform for exploring cloud architec- hances the original FTP protocol with extensions designed tures and evaluating design choices that exercise hard- to support faster data transfers. It also has a wide vari- ware and software capabilities. CloudLab is a great ben- ety of networking configuration options. GridFTP uses a efit to researchers because it allows them to quickly and client-server model where the server provides access to the

2 interface according to1. Table two Intel Xeon E5-2660 10-core CPUs at sysctl Table 1: Network tuning parameters. Clemson: 2.2 GHz with 256node GiB has of ECC two7200 main 1 memory. RPM. TB Each A SATABand 3.0 HCA QLogic is hard QLE7340 used disk tofile 40 communicate system drives with on Gbit/s at the the parallel other Infini- 82599ES Clemson 10 nodes, Gbit/s and Ethernet an NICnicate Intel is with used the to Apt commu- cluster over Internet2. In order to realize the configuration of Figure1 the All the bare metal nodes in the experiment are installed The operating system’s network layer is configured based InfiniBand drivers are installed from the operating sys- The CloudLab infrastructure attempts to reserve band- 2. Low-latency TCP tuningPacket queue size enabled TCP buffer maxTCP congestion controlTCP selective ACKTCP H-TCP timestamps 30,000 option 32 disabled MiB disabled Parameter Value The total amount of storageis available similar, at but Apt the and Apt Clemson drives. nodes have The twice the Clemson number nodescores of as disk have the 2.5 Apt nodes times and the 16 times number as of much mainCloudLab memory. graphical configuration tool,configure Jacks, was the used to twothem. clusters and Jacks produces thefile that a connections can Resource between stances be Specification of the (RSpec) used same to experiment.iments is create The available RSpec identically-configured at for in- [21]. our exper- 2.3. Installed software with Ubuntu 14.04.1 LTS running Linuxgeneric. kernel v3.13.0-68- For file transfersserver GridFTP client v10.4 version were 9.19 and installedrepository. from For file the system Ubuntuof benchmarks XDD distribution a is modified used version [22].are The BeeGFS various release parallel fileterFS 2015.03, systems v3.7, Ceph used and v10.2.2-1trusty, OrangeFS Glus- v2.9.3. 2.4. Node network configuration on recommendations fromedge the Base ESnet [23], which Fasterdatamaximum provides Knowl- network advice performance. for All tuningvia nodes hosts the are for configured tem’s package repository.abled CPU to frequency prevent scaling measurementcessor speed is errors changing caused dis- dynamically.manager by An is the InfiniBand running subnet pro- on the InfiniBand switch. width across the experimentaling network in using the traffic Linux kernel. shap- begin The dropping traffic packets shaper whenceeds is a configured the certain to threshold. outbound This flow traffic rate shaping was ex- disabled 3

Apt

… 1–8 FS nodes CloudLab 1 client 1 client

… AL2S TCP TCP 1–64 Internet2 Internet2 streams streams 1 client 1 client

… an Intel Xeon E5-2450 8-core CPU at 2.1 GHz FS nodes 1–8 CloudLab Clemson with 16 GiB500 of main GB memory. SATA 2.6The Each network hard node interface disk has cardwith drives four one is at port a running 7200 56 Mellanoxmunication Gbit/s RPM. with MX254A InfiniBand the local for com- parallelother file system port and the runningnication 10 with Gbit/s the Ethernet Clemson for cluster commu- over Internet2. Apt: As seen in Figure1, a data transfer experiment is com- The test dataset used for file transfer experiments is 1. posed of nodes innodes the in Apt the CloudLabAdvanced cluster Clemson Layer-2 connected CloudLab Service. to clusterthese The over clusters round-trip Internet2’s across timepending AL2S between on is the approximately experimentfile both 52 clusters system ms. have nodes, 1–8nodes parallel De- and is the the number same ofthe in parallel parallel both file file clusters. system protocols. systems Communication There uses with is oneusing either GridFTP file to TCP system transfer client or datastreams. node in InfiniBand The per parallel with node cluster 1–64 hardware TCP at each cluster is as follows: 2.2. Testbed configuration 2.1. Test dataset composed of DNA sequencing data informat. Sequence Read There Archive are 345 filesfrom in 643 the dataset MB thatOver range to in a 11 size quarter GB.than The of 11% median the are file files greater size than are 4.5 is less GB. 2.4 than GB. 1 GB, and less 2. Materials and methods host file system, and therelative client to can either the push server. or pull data Figure 1:Data Dataset are transfer read inClemson in the client. parallel CloudLabover The from environment. Internet2 Clemson 1–8 Advanced client file Layer-2using transfers Service 1–64 system TCP the (AL2S) (FS) streams.to to data nodes 1–8 The the in file Apt to Apt parallel system client the client nodes. writes the data in parallel Table 2: XFS mount options for transfers with BeeGFS. 2. Ceph: The ceph-deploy tool was used to install and configure Ceph on four nodes using all available data Option Value allocsize 131072k partitions. A Ceph file system was created with a logbsize 256k single metadata server running on the first file sys- logbufs 8 tem node. In order to save space the pool size was atime off changed so that only one copy of every object is barrier off stored. The number of placement groups for all pools diratime off was set to 360. A single client mounts the file system. largeio on 3. GlusterFS: GlusterFS was installed and a striped GlusterFS volume using TCP as a transport was cre- ated on four nodes. A single client mounts the file on all nodes in the experiment in order to help achieve the system. maximum possible throughput. 4. OrangeFS: OrangeFS was installed with one data server per node. There is a single metadata server 2.5. File system configuration on the first node. The network transport used was The parallel file systems used to store data for exper- TCP. The file system was mounted on one client node iments must in turn store their data on the local file sys- by installing the kernel module included in the Or- tem of each node. XFS [24] is configured as the local file angeFS distribution. The configuration file was gen- system for each experiment. For file systems other than erated with the standard pvfs2-genconfig script with BeeGFS the default XFS creation and mount options are default configuration options. This default configu- used. For BeeGFS the default mount options are used in ration stripes file data across all storage nodes. the file system benchmark tests, but in dataset transfer tests the optimized mount options shown in Table2 are Several experiments using BeeGFS or OrangeFS re- used. These tuning parameters follow the recommenda- quire changing the number of server nodes over the course tions of the BeeGFS tuning guide [25]. of the experiment, an expensive operation whose proce- The local file system in turn stores its data on a block dure varies depending on the type of server. The number device. This device can either be a virtual device created of storage nodes is typically reduced from 8 down to 1. by the logical volume manager or a raw disk device. A For BeeGFS, the built-in administration tool is run from virtual disk device is created by using the Linux Logical the client to migrate data off the node being removed onto Volume Manager to stripe data across the available data the remaining nodes, and the BeeGFS server process is partitions on all disks. Striping, also known as RAID 0, stopped on the node to prevent new data from being placed combines disk storage and in theory increases throughput there. In the case of OrangeFS the file system is created by allowing disks to be accessed in parallel transparent to from scratch each time because there is no built-in mi- the application program. GlusterFS and OrangeFS always gration tool. Unfortunately this technique requires all the use a virtual disk device. Ceph always uses a raw disk de- data to be copied to the new OrangeFS file system. vice. BeeGFS uses a striped virtual disk in the file system benchmark test and a raw disk device in all other tests. 2.6. Software configuration Further configuaration details depend on the parallel File system benchmark tests use XDD to measure the file system being used. The file systems were configured throughput when reading and writing a parallel file sys- as follows: tem. XDD has a plethora of configuration options, but for these experiments the chief options are the target of the 1. BeeGFS: BeeGFS was installed with one data server I/O operation and its size, the block size, and the number per node. The first node runs the management dae- of parallel I/O threads. In the file system benchmarks the mon and metadata server in addition to the data only option that changes is the operation (read or write). server. A single client mounts the file system. The The target is always a single file filled with random data, network transport is switched between InfiniBand the block size is always 4 MiB, and the number of I/O and TCP depending on the experiment. A connec- threads is always 4. tion filter file is used to limit communication of the The dataset transfer tests use GridFTP to transfer BeeGFS servers to the experimental network inter- data. GridFTP has both a client and a server, so there face (either Ethernet or InfiniBand). The filter file are both client options and server options. The server is prevents the client and servers from communicat- always run in anonymous mode (no authentication) with ing over the slower management interfaces. In order a block size equal to the TCP socket buffer size and the to achieve the maximum amount of parallelism the number of threads equal to the number of parallel TCP BeeGFS administration tool -ctl is used to tell streams. The client is run in fast mode with a block size the file system to stripe every file across all available equal to the TCP socket buffer size. The number of paral- storage devices on all storage nodes. lel data connections and the exact value of the buffer sizes depend on the particular experiment. 4 2.7. Experiments 1250 All experimental measurements are in units of time. The average throughput is calculated by dividing the total 1000 number of bytes transferred by the wall time. For XDD the time value is output as the end of the benchmark. For GridFTP the time is calculated using the elapsed run time 750 output by the standard UNIX time tool. The first experiment in Figure2 was a benchmark of read write the BeeGFS, Ceph, GlusterFS, and OrangeFS file sys-

Throughput (MB/s) Throughput 500 tems on the Apt cluster. BeeGFS was configured to use a striped virtual disk for storage and TCP for the network.

All file systems used four nodes for the server plus one 250 node as a client. The client node mounted the parallel file system and used XDD to benchmark read and write throughput when operating on a single approximately 864- 0 BeeGFS Ceph GlusterFS OrangeFS gigabyte random test file. The read and write performance Parallel file system of each file system was tested 3 times. The test file was Figure 2: Comparing parallel file system read/write bench- deleted between runs of the write test. marks. A single client performs I/O to a single file on 4 servers The second experiment in Figure3 was a comparison using 4 parallel application threads. Benchmark tool is XDD. Error between BeeGFS and OrangeFS for transfers of the actual bars show the standard error of the mean. MB/s = millions of bytes dataset. For this experiment BeeGFS was configured to per second. use an XFS file system on the raw disk devices while Or- angeFS was configured the same as the first experiment. 300 Files in the test dataset were transferred using GridFTP from Clemson to Apt with 4 parallel TCP streams and varying the number of servers from 1 to 8 with a 4 MiB TCP socket buffer. Each test was run 3 times. Test files 200 were deleted from the Apt cluster between runs. The third experiment in Figure4 compares the per- formance of a GridFTP dataset transfer from Clemson to BeeGFS OrangeFS Apt with BeeGFS when using InfiniBand versus TCP for the local cluster network. The number of servers is varied (MB/s) Throughput 100 from 1 to 8 and the number of streams is 1–16, 32, and 64. When the number of servers is varied the number of streams is fixed at 4, and when the number of streams is varied the number of servers is fixed at 5. The Internet2 connection always uses TCP with a socket buffer size of 0 1 2 3 4 5 6 7 8 4 MiB. Each test was run 3 times. Test files were deleted Number of file system server nodes from the Apt cluster between runs. The fourth experiment in Figure5 compares 4 MiB Figure 3: Comparing transfer rates between BeeGFS and OrangeFS parallel file systems. For a dataset transfer with 4 and 16 MiB TCP buffers when transferring with 8 BeeGFS parallel TCP streams using GridFTP. TCP socket buffer sizes set to servers running InfiniBand and 1–16, 32, 64 streams. Test 4 MiB. Error bars show the standard error of the mean. MB/s = files were deleted from the Apt cluster between runs. millions of bytes per second. The final experiment in Figure6 uses BeeGFS over InfiniBand and a 16 MiB TCP buffer size. The throughput streams. These parameters are of high interest for ge- is measured with 1–16, 32, 64 streams and 2–6, 8 servers. nomics and other domain scientists across the technical Test files were deleted from the Apt cluster between runs. skill spectrum who need to move massive data sets.

3. Experimental results 3.1. Parallel file system configuration The first choice that had to be made with respect to In this study we wanted to identify optimal data trans- parallel file transfers was the parallel file system imple- fer parameters for parallel data streaming across AL2S and mentation to use for storage of the dataset. Because ac- between two CloudLab sites loading real genomics data tual transfer tests can take a long time to run a faster and onto a parallel file system. The key parameters modified more simple benchmark comparison was performed. The were the file system type, number of file system nodes, first test compared the BeeGFS, Ceph, GlusterFS, and Or- network transport protocol, and number of parallel TCP angeFS parallel and distributed file systems when reading 5 and writing a single large file. 1000 According to the benchmark results in Figure2, BeeGFS has the fastest read throughput, and its write throughput is faster than Ceph but slower than GlusterFS and Or- 750 angeFS. Ceph has both the slowest read throughput and the slowest write throughput. The read throughput of GlusterFS is faster than Ceph but slower than BeeGFS and OrangeFS, and its write throughput is faster than 500 4 MiB BeeGFS and Ceph but slower than OrangeFS. The read 16 MiB throughput of OrangeFS is faster than Ceph and Glus- (MB/s) Throughput terFS but slower than BeeGFS. OrangeFS has the fastest write throughput. For all file systems except for BeeGFS 250 write throughput is greater than read throughput. For BeeGFS read throughput is greater than write through- put. 0 0 20 40 60 After examining the results of the file system bench- Number of parallel TCP streams mark it appeared that no file system has both the best read throughput and the best write throughput. BeeGFS Figure 5: Comparing dataset transfer rates with 4 MiB and has the best read throughput but the third-best write 16 MiB TCP socket buffer sizes. For a dataset transfer with GridFTP. Parallel file system is BeeGFS over InfiniBand on 8 server throughput. OrangeFS has the best write throughput but nodes. Error bars show the standard error of the mean. MB/s = the second-best read throughput. Finding the highest- millions of bytes per second. 1 MiB = 1,048,576 bytes. performing file system for the purpose of parallel file trans- fers required further testing. In order to decide between BeeGFS and OrangeFS an- higher than the throughput with InfiniBand. However, at other experiment was conducted to compare the through- 9 streams the throughput with InfiniBand overtakes the put of the two file systems during an actual transfer of the throughput with TCP and remains higher all the way to test dataset. Figure3 shows the results of this transfer 64 streams. Altogether the maximum InfiniBand through- test. In this test BeeGFS has consistently higher through- put of 880 MB/s at 16 streams is over 40% higher than the put than OrangeFS from 1 to 8 server nodes. BeeGFS maximum TCP throughput of 601 MB/s at 13 streams. reaches 99% of its maximum throughput at 2 nodes while Because of the higher peak throughput with InfiniBand OrangeFS requires 5 nodes. Additionally, OrangeFS shows compared to TCP, InfiniBand was used for the remainder a slight drop in throughput from 1 to 2 nodes before climb- of the experiments. ing steadily at 3 and 4 nodes. Based upon the results of The last network configurations investigated before run- this test we concluded that BeeGFS was the best file sys- ning a full parameter sweep were the TCP send and receive tem for parallel file transfers. socket buffer sizes. The size of these buffers influences the TCP window size which can have a large effect on the per- 3.2. Network configuration formance of long distance transfers [26]. As previous tests all used a buffer size of 4 MiB, a larger size of 16 MiB was The preferred parallel file system, BeeGFS, has the chosen because it was believed to be much closer to the ability to communicate over the network using both TCP bandwidth-delay product of the network. and InfiniBand. The experimental hardware used for In- Figure5 shows a comparison between transfers with finiBand has a at least a 4x higher theoretical through- 4 MiB and 16 MiB TCP socket buffer sizes. From 1 to put (40 Gbps) compared to the hardware used for TCP 13 streams the throughput with 16 MiB buffer size is sig- (10 Gbps), but it was desired to see if the difference dur- nificantly larger than the throughput with 4 MiB buffer ing an actual dataset transfer was large enough to justify size. The maximum throughput with 4 MiB buffer size the substantially higher complexity and cost of InfiniBand of 898 MB/s occurs at 16 streams, while the maximum when compared to 10 Gb Ethernet. throughput with 16 MiB buffer size of 934 MB/s occurs Figure 4a shows the results of comparing the Infini- at 7 streams. Future experiments used the 16 MiB buffer Band and TCP transports for a dataset file transfer with size, because based on these results it is possible to reach BeeGFS. In this test the number of Internet2 TCP streams a higher throughput with fewer streams. is held constant at 4 while the number of servers changes. The results show that using TCP as a transport outper- 3.3. Parameter sweep of streams and server nodes forms InfiniBand for 1 to 8 servers. Figure 4b also compares InfiniBand and TCP trans- Up to this point BeeGFS was chosen for the parallel file ports, but in this experiment the number of servers is held system, InfiniBand for the network transport, and 16 MiB constant at 5 while the number of parallel TCP streams for the TCP socket buffer size. The major parameters varies. For 1 to 8 streams the throughput with TCP is left to investigate are the number of parallel TCP streams

6 400 1000

300 750

200 InfiniBand 500 InfiniBand TCP TCP Throughput (MB/s) Throughput (MB/s) Throughput

100 250

0 0 1 2 3 4 5 6 7 8 0 9 20 40 60 Number of file system server nodes Number of parallel TCP streams (a) Number of parallel TCP streams held constant (b) Number of server nodes held constant at 5. at 4.

Figure 4: Comparing InfiniBand and TCP for dataset transfers. Datasets transferred with GridFTP. Parallel file system is BeeGFS. TCP socket buffer sizes set to 4 MiB. Error bars show the standard error of the mean. MB/s = millions of bytes per second. 1 MiB = 1,048,576 bytes.

1000 4. Discussion

4.1. File system comparison The goal of the test results in Figure2 was to predict 750 the throughput of various parallel file systems during a dataset transfer by benchmarking their performance dur- 2 nodes ing a read/write benchmark. The benchmark leaves out 3 nodes 500 4 nodes any Internet2 network traffic and focuses only on the local 5 nodes 6 nodes performance of the parallel file system and its client. Be- Throughput (MB/s) Throughput 8 nodes cause there is no long distance communication involved in this test we believe the benchmark approximates the upper 250 bound of the throughput during an actual dataset trans- fer. We expected the throughput during an actual dataset transfer to be significantly lower than the throughput of the corresponding benchmark. Our experience has been 0 0 5 20 40 60 Number of parallel TCP streams that for a long distance data transfer the performance of the transfer as a whole is lower than the performance of Figure 6: Dataset transfer while varying the number of the components of that transfer. TCP streams and file system server nodes. For a dataset Before discussing the actual results we first explain transfer using GridFTP. Parallel file system is BeeGFS over Infini- Band. TCP socket buffer sizes set to 16 MiB. Error bars show the what we perceive as limitations in the test itself. First, standard error of the mean. MB/s = millions of bytes per second. because the full dataset transfer test was not run for ev- ery file system, we cannot prove that lower throughput on the benchmark correlates to lower throughput in an ac- and the number of parallel file system data server nodes. tual file transfer. We believe the benchmark is still useful Figure6 shows the results of sweeping these parameters. as a guide for determining the parallel file system to use The curves for 4–8 nodes are more or less overlapping. for subsequent tests. Second, the benchmark test by de- The only exceptions may be 4 servers, 16 streams and sign runs only within a local cluster with no long distance 5 nodes, 32 streams. These two points have a large amount network access. We have chosen to run the benchmark of error. At 3 nodes the curve has the same general shape test on the Apt cluster in Utah. In contrast, an actual but the throughput is less than the curve for 4–8 nodes. file transfer involves both the Apt cluster and the Cloud- The throughput at 2 nodes is less than the throughput Lab cluster at Clemson. As described in section 2.2, these at 3 nodes. The highest average throughput of 940 MB/s two clusters have different hardware characteristics, espe- occurs at 5 nodes and 7 streams. cially when it comes to disk storage and main memory. We 7 suspect, however, that the relative performance levels of 4.2. BeeGFS vs. OrangeFS the parallel file systems would be the same on both clus- The goal of the BeeGFS vs. OrangeFS comparison test ters. Third, out of necessity the benchmark uses different of Figure3 was to determine which of the two file systems software from the actual file transfer. However, both the supported faster data transfers. This experiment varies benchmark tool (XDD) and the transfer tool (GridFTP) the number of parallel file system server nodes, and at the are optimized for high disk throughput. Last, and per- time the experiment was designed we believed the number haps most important, the benchmark uses a single large of nodes to be the most important factor in determining 863 GB file whereas an actual transfer uses 345 files rang- the throughput of the file transfer. However, later tests ing in size from 0.6 GB to 11 GB and an aggregate size of seem to show that the number of nodes is the least impor- 863 GB. Parallel file systems are known to have vastly dif- tant variable. ferent performance with large files compared to small files Both BeeGFS and OrangeFS are layered on top of a lo- [27]. Parallel file systems also have different performance cal file system on each node. We chose to use the XFS file characteristics with a large number of files compared to system [24] in both cases. For the XFS file system used in a small number of files. However, we believe the number OrangeFS tests we used the default mount options. How- of files during an actual transfer to be small enough and ever, for the XFS file system used in BeeGFS tests we the median file size of 2.4 GB to be large enough to avoid used the optimized mount options in Table2 that were causing performance issues. suggested by the BeeGFS documentation [25]. The differ- ent local file system mount options could be responsible 4.1.1. BeeGFS for the relative performance of BeeGFS and OrangeFS. BeeGFS is notable for having the highest read through- A limitation of the BeeGFS vs. OrangeFS compari- put of the file systems tested. Interestingly, it was also the son test may be that the disk configuration was different only file system whose read throughput was higher than for the two tests, although both file systems were config- its write throughput. The higher read throughput could ured for the best performance. BeeGFS does not make use be a result of read-ahead by the BeeGFS server nodes. of any RAID configuration while OrangeFS uses striped RAID. Striped RAID is currently the only option for Or- 4.1.2. Ceph angeFS because it is not possible to configure OrangeFS Ceph had the lowest throughput of any file system to use multiple data storage targets. RAID 0 striping was in this benchmark. The read throughput of Ceph was not used for BeeGFS in this test and all further tests be- 207 MB/s less than the throughput of the next-slowest cause during a benchmark it was discovered to have lower file system for read, GlusterFS. The write throughput of performance than the configuration that used the drives Ceph was 509 MB/s less than the next-slowest file system independently. For unknown reasons the throughput of a for write, BeeGFS. It’s possible that changing the default transfer with OrangeFS drops from 1 to 2 nodes. This configuration of Ceph by storing only one copy of each drop could be explained by an increase in network over- object caused a drop in performance. However, the fact head when contacting more than one server. remains that there was not enough storage to contain the default of three copies of each object. It’s also possible 4.3. Network configuration that Ceph, as an object store, may not be well suited to After selecting BeeGFS as the parallel file system there the type of workload tested by our experiment. were two major choices for the network transport. The first choice was the very popular and well-supported TCP. 4.1.3. GlusterFS The second choice was InfiniBand which has a 4–5.6x higher GlusterFS has the second-highest write throughput af- theoretical maximum throughput but is significantly more ter OrangeFS, but its read throughput is the second-lowest expensive than the 10 GigE used for TCP. It was desired after Ceph. GlusterFS also has the biggest gap between to determine if the higher theoretical throughput of In- read and write throughput—the read throughput is only finiBand translated to an actual increase in throughput 36% of the write throughput. during a data transfer. Another possible protocol that could have been tested 4.1.4. OrangeFS is IP over InfiniBand (IPoIB). As the name implies, IPoIB OrangeFS is notable for having the highest write through- allows IP (and its associated protocols such as TCP) to be put and the second-highest read throughput. The write carried over InfiniBand hardware. However, the native IB throughput alone is impressive for reaching close to 90% protocol offers benefits such as lower latency and support of the maximum throughput on a 10 gigabit/s network. for remote direct memory access. As long as the software OrangeFS may be the only file system to hit a network is capable of supporting native IB protocol there is little bottleneck instead of a disk bottleneck for writes. reason to use legacy IPoIB. The results of the test that varied the number of servers shown in Figure 4a suggest that InfiniBand in fact has lower throughput compared to TCP. But the test that var- ied the number of streams in Figure 4b showed that after 8 8 streams InfiniBand has higher throughput than TCP. In Equation1, the BDP is usually a property of the It is possible that some overhead causes InfiniBand to be network and should not change. An exception to this rule slower than TCP up to 8 streams. After 8 streams, the may be when there are multiple paths connecting nodes higher theoretical throughput of InfiniBand allows it to in the network. In that case it can be expected that each surpass TCP. path might have a different BDP. It is far more useful to Figure5 shows the results of increasing the socket buffer manipulate the other two parameters. The TCP socket size to 16 MiB and compares it to the results with a 4 MiB buffer size is relatively easy to set on Linux systems. A socket buffer size. A larger socket buffer size allows for possible constraint might be the maximum buffer size de- a larger TCP window, and a larger TCP window poten- fined by the system administrator. The number of parallel tially allows for a more continuous stream of data flowing TCP streams can be easy or difficult to change depending through the network. The fact that throughput increases on the application. For GridFTP changing the number with the socket buffer size suggests that the initial size of of TCP streams is as simple as setting a command line 4 MiB was too small for the test network. option, but other applications may need to be completely redesigned to incorporate thread-level parallelism. Addi- 4.4. Parameter sweep tionally, the number of practical parallel TCP streams is The parameter sweep in Figure6 explores the effects likely bounded by the number of physical CPU cores. of the number of parallel TCP streams and the number of A final constraint that occurs during a real data trans- parallel file system server nodes on throughput when the fer is that the storage system must be able to keep up socket buffer size is 16 MiB. It was expected that when the with the network throughput. A network that is capable number of storage nodes decreased the throughput curve of 100 gigabits/s is not going to be fully utilized when it would lower as the storage system would not have enough is connected to a storage system capable of only 1 giga- available throughput to feed the data transfer. The curves bit/s. The 2- and 3-node results of Figure6 show what for 4–8 nodes are similar, but as predicted at 3 nodes happens to throughput when there is not enough storage the performance drops sharply by about 345 MB/s at 4 bandwidth available. For a large dataset (such as those streams. There is another large drop of about 176 MB/s seen in genomics) the storage system must be able to sus- from 3 to 2 nodes at 4 streams. tain a high throughput. This fact is one of the reasons our transfers were run “at scale” with real data—using small 4.5. Maximizing throughput test files would have not so much tested the full storage By examining the results in Figure5 a pattern emerges throughput as the throughput of the cache. with respect to the maximum throughput, the TCP buffer size, and the number of parallel streams. In Figure5 when the buffer size is 4 MiB the peak throughput occurs at 5. Conclusion 16 streams. Similarly, when the buffer size is 16 MiB the A wide variety of tests we performed narrowed the op- peak (within error) occurs at 4 streams. In other words, timal data transfer parameters for parallel data streaming changing the buffer size by a factor of 4 changed the num- across AL2S and between two CloudLab clusters loading ber of streams by a factor of 1/4. real genomics data onto a parallel file system. The best Further insight can be made with additional background throughput was found to occur with GridFTP using at information. The maximum amount of data that can be least 5 parallel TCP streams with a 16 MiB TCP socket “in flight” in a network is its bandwidth-delay product, buffer size to transfer to/from 4–8 BeeGFS parallel file or BDP. As the name suggests, the BDP is calculated by system nodes connected by InfiniBand. An attempt at multiplying the link speed of the network by the round- generalizing the results was made in Equation1, where trip delay time (RTT). The experimental network has a the TCP buffer size and number of parallel streams were bandwidth of 10 gigabits/s and an RTT of approximately related to the bandwidth-delay product of the network. 52 ms as measured by the ping tool. The BDP is there- There are a few experimental thrusts available for fur- fore 10 gigabits/s × 0.052 s = 0.52 gigabits. Converting ther optimization. First, it is possible that several of the the units yield a value of 62 MiB. network settings outlined in Table1 contributed negatively Therefore, in order to expect the maximum through- to the throughput of the transfer tests. In particular we are put from the experimental network we must ensure the concerned about the TCP timestamps, low-latency, and amount of data available to be transmitted is at least selective acknowledgement options because their settings 62 MiB. The amount of data that can be transmitted is appear contrary to common sense for a fast network with a the TCP window size, estimated by the TCP buffer size, high RTT. Although the settings were based on the ESnet times the number of streams. This formula matches with recommendations, the exact values were copied from an the experimental data: the maximum throughput occurs HPC cluster that we later discovered was tuned for local at 4 MiB × 16 = 64 MiB and 16 MiB × 4 = 64 MiB. This low-latency message passing type communication. relation between BDP, buffer size, and RTT that gives Second, an important limitation to Equation1 may be maximum throughput is given in Equation1. that it is appropriate only for networks with very little loss. BDP ≤ buffer × streams (1) In the case of loss, it is likely that more streams should be 9 favored over a larger buffer size, as a greater number of [8] A. Davies, A. Orsaria, Scale out with glusterfs, Linux Journal parallel streams will quicken the recovery of the transfer 2013 (235). in the face of loss. [9] P. Carns, W. B. L. III, R. B. Ross, R. Thakur, Pvfs: A parallel file system for linux clusters, in: Proceedings of the 4th annual Third, our experiments are designed such that all data Linux Showcase and Conference, 2000, pp. 391–430. flow through the clients which act as data transfer nodes [10] Lustre software release 2.x [online]. URL: http://doc.lustre. (DTNs). This configuration is common in Science DMZs, org/lustre_manual.xhtml#installinglustre. [11] E. Dart, B. Tierney, E. Pouyoul, J. Breen, Achieving the science but in CloudLab there is no technical requirement since all dmz, presentation, Joint Techs, Baton Rouge, LA, Jan 2012 nodes are capable of communicating with each other. Us- (Jan. 2012). URL: http://www.internet2.edu/presentations/ ing n times as many data transfer nodes has the potential jt2012winter/ScienceDMZ-Tutorial-Jan2012-1.pdf. of unlocking n times the performance of the single-node URL http://www.internet2.edu/presentations/ jt2012winter/ScienceDMZ-Tutorial-Jan2012-1.pdf case as long as the parallel file system is capable of sup- [12] R. Ricci, E. Eide, The CloudLab Team, Introducing Cloud- plying enough storage bandwidth. Fortunately, parallel Lab: Scientific infrastructure for advancing cloud archi- file systems are in fact designed for just such a situation. tectures and applications, USENIX ;login: 39 (6). URL: https://www.usenix.org/publications/login/dec14/ricci. URL https://www.usenix.org/publications/login/dec14/ 5.1. Future work ricci It is important to note that dataset transfers are only [13] B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad, a means to an end and that the actual objective is to run M. Newbold, M. Hibler, C. Barb, A. Joglekar, An integrated experimental environment for distributed systems and net- the genomic analysis workflow. As such, future work will works, in: Proceedings of the Fifth Symposium on Operat- focus not only on transfers but also on analysis, possibly ing Systems Design and Implementation, USENIX Association, both at the same time. Boston, MA, 2002, pp. 255–270. [14] Cloudlab [online] (Sep. 2016). URL: http://cloudlab.us/. The effects of packet loss on the transfer throughput [15] Cloudlab overview [online]. URL: http://cloudlab.us/files/ need to be more closely examined. Future work will use cloudlab-overview.pdf. the Linux socket statistics tool, ss, to record loss events [16] Cloudlab hardware [online] (Sep. 2016). URL: http://docs. such as TCP retransmissions. The loss will be controlled cloudlab.us/hardware.html. [17] Internet 2 layer 2 services [online] (Sep. 2016). with Linux traffic shaping. We will then attempt to fit URL: http://www.internet2.edu/products-services/ this loss into our model. advanced-networking/layer-2-services/. [18] Globalnoc [online]. URL: http://globalnoc.iu.edu/ i2network/index.html. 6. Acknowledgements [19] B. W. Settlemyer, J. D. Dobson, S. W. Hodson, J. A. Kuehn, S. W. Poole, T. M. Ruwart, A technique for moving large data This work was supported by the National Science Foun- sets over high-performance long distance networks, in: 2011 dation [grant number 1447771]. IEEE 27th Symposium on Mass Storage Systems and Technolo- gies (MSST), 2011, pp. 1–6. doi:10.1109/MSST.2011.5937236. [20] W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, The globus striped gridftp framework and server, in: Supercomputing, References 2005. Proceedings of the ACM/IEEE SC 2005 Conference, 2005, pp. 54–54. doi:10.1109/SC.2005.72. [1] NCBI. Sequence read archive [online] (Aug. 2016). URL: http: [21] N. Mills. Cloudlab transfer experiment rspec [online] //www.ncbi.nlm.nih.gov/Traces/sra/. (Sep. 2016). URL: https://www.cloudlab.us//p/CloudLab/ [2] R. N. The Cancer Genome Atlas, J. N. Weinstein, E. A. Collis- nlmpfs8xfer. son, G. B. Mills, K. R. M. Shaw, B. A. Ozenberger, K. Ellrott, [22] Xdd [online]. URL: https://github.com/nlmills/xdd/commit/ I. Shmulevich, C. Sander, J. M. Stuart, The cancer genome at- cdfb9a5a1ba14ca398f621508f7386c9d756322e. las pan-cancer analysis project, Nature genetics 45 (10) (2013) [23] ESnet. Linux tuning [online] (Aug. 2016). URL: http:// 1113–1120. URL: http://dx.doi.org/10.1038/ng.2764. fasterdata.es.net/host-tuning/linux/. URL http://dx.doi.org/10.1038/ng.2764 [24] C. Hellwig, Xfs: the big storage file system for linux, ; login:: [3] The 1000 Genomes Project Consortium, A global reference for the magazine of USENIX & SAGE 34 (5) (2009) 10–18. human genetic variation, Nature 526 (7571) (2015) 68–74. URL: [25] Beegfs tuning and advanced configuration [online] http://dx.doi.org/10.1038/nature15393. (Sep. 2016). URL: http://www.beegfs.com/wiki/ URL http://dx.doi.org/10.1038/nature15393 TuningAdvancedConfiguration. [4] L. Zhi-Kang, G.-Y. Zhang, K. L. McNally, W.-S. Wang, J. Li, [26] N. S. V. Rao, D. Towsley, G. Vardoyan, B. W. Settlemyer, I. T. N. A. Alexandrov, J. A. et al., The 3,000 rice genomes project, Foster, R. Kettimuthu, Sustained wide-area tcp memory trans- GigaScience 3 (1) (2014) 1–6, iD: ref1. URL: http://dx.doi. fers over dedicated connections, in: High Performance Com- org/10.1186/2047-217X-3-7. puting and Communications (HPCC), 2015, pp. 1603–1606. URL http://dx.doi.org/10.1186/2047-217X-3-7 doi:10.1109/HPCC-CSS-ICESS.2015.86. [5] F. A. Feltus, J. R. B. III, J. Deng, R. S. Izard, C. A. Konger, [27] P. Carns, S. Lang, R. Ross, M. Vilayannur, J. Kunkel, T. Lud- W. B. L. III, D. Preuss, K.-C. Wang, The widening gulf between wig, Small-file access in parallel file systems, in: Parallel Dis- genomics data generation and consumption: A practical guide tributed Processing, 2009. IPDPS 2009. IEEE International to big data transfer technology, Bioinformatics and biology in- Symposium on, 2009, pp. 1–11. doi:10.1109/IPDPS.2009. sights 9 (Suppl 1) (2015) 9. 5161029. [6] J. Heichler. An introduction to beegfs [online] (Nov. 2014). URL: http://www.beegfs.com/docs/Introduction_to_ BeeGFS_by_ThinkParQ.pdf. [7] S. A. Weil, Ceph: reliable, scalable, and high-performance dis- tributed storage, Ph.D. thesis, University of California Santa Cruz (2007). 10