Maximizing the Performance of Scientific Data Transfer By
Total Page:16
File Type:pdf, Size:1020Kb
Maximizing the Performance of Scientific Data Transfer by Optimizing the Interface Between Parallel File Systems and Advanced Research Networks Nicholas Millsa,∗, F. Alex Feltusb, Walter B. Ligon IIIa aHolcombe Department of Electrical and Computer Engineering, Clemson University, Clemson, SC bDepartment of Genetics and Biochemistry, Clemson University, Clemson, SC Abstract The large amount of time spent transferring experimental data in fields such as genomics is hampering the ability of scientists to generate new knowledge. Often, computer hardware is capable of faster transfers but sub-optimal transfer software and configurations are limiting performance. This work seeks to serve as a guide to identifying the optimal configuration for performing genomics data transfers. A wide variety of tests narrow in on the optimal data transfer parameters for parallel data streaming across Internet2 and between two CloudLab clusters loading real genomics data onto a parallel file system. The best throughput was found to occur with a configuration using GridFTP with at least 5 parallel TCP streams with a 16 MiB TCP socket buffer size to transfer to/from 4{8 BeeGFS parallel file system nodes connected by InfiniBand. Keywords: Software Defined Networking, High Throughput Computing, DNA sequence, Parallel Data Transfer, Parallel File System, Data Intensive Science 1. Introduction of over 2,500 human genomes from 26 populations have been determined (The 1000 Genomes Project [3]); genome Solving scientific problems on a high-performance com- sequences have been produced for 3,000 rice varieties from puting (HPC) cluster will happen faster by taking full ad- 89 countries (The 3000 Rice Genomes Project [4]). These vantage of specialized infrastructure such as parallel file raw datasets are but a few examples that aggregate into systems and advanced software-defined networks. The first petabytes with no end of growth in sight. In fact, if DNA step in a scientific workflow is the transfer of input data sequencing technology continues to improve in terms of from a source repository onto the compute nodes in the resolution and cost, one could view DNA sequencing as an HPC cluster. When the amount of data is small enough Internet of Things application with associated computa- it is possible to store datasets within the local institution tional challenges across the DNA data life cycle [5]. for quick retrieval. However, for certain domains ranging Domain scientists who are not experts in computing from genomics to social analytics the amount of data is often resort to transferring these large datasets using FTP becoming too large to store locally. In such cases the data over the commodity Internet at frustratingly slow speeds. must be optimally transferred from an often geographically Even if a fast network is available, large transfers can take distant central repository. on the order of several days to complete if the correct In genomics, our primary area of interest, the veloc- transfer techniques are not used. The HPC and Big Data ity of data accumulation due to high-throughput DNA se- communities have developed several tools for fast data quencing should not be underestimated, as data accumu- transfers but these tools are often not utilized. Several lation is accelerating into the Exascale. As of this writing factors contribute to the lack of utilization by scientists of there are six quadrillion base pairs in the sequence read the proper tools. First, awareness of the proper tools in archive database at NCBI [1] with each A,T,G,C stored as the community is lacking to the point where researchers re- at least two bytes. These data represent publicly mineable sult to familiar but technically inferior tools such as FTP. DNA datasets across the Tree of Life from viruses to rice Old habits are hard to break. Second, a lack of adequate to humans to elephants. New powerful genetic datasets documentation means even when researchers are pointed such as The Cancer Genome Atlas [2] contain deep se- to the right tools they are lost in a maze of configuration quencing from tumors of over 11,000 patients; sequences options. Manuals give an overview of various parameters but fail to explain what parameter values are suitable for high-performance data transfers. ∗Corresponding author Email addresses: [email protected] (Nicholas Mills), An end-to-end data transfer involves at least three ma- [email protected] (F. Alex Feltus), [email protected] (Walter jor components: a high-performance storage system, a B. Ligon III) high-performance network, and the software to tie it all Preprint submitted to Future Generation Computer Systems October 25, 2016 together. The various research communities have all ex- easily test different modern hardware and software con- amined optimized performance parameters for their re- figurations without the usual troubles associated with re- spective components. Unfortunately, our experience has installing the operating system and re-configuring the net- been that independently optimizing a single component work [15]. CloudLab allows the allocation of bare metal leads to slower performance in the system as a whole. machines with no virtualization overhead. Thus, Cloud- This work attempts to fill the gaps by suggesting opti- Lab is a realistic experimental environment for moving mized transfer parameters based on experimentally mea- real datasets within and between sites before moving al- sured end-to-end transfer performance. Where possible gorithms and procedures into production. the experimentally-determined parameters are supported A major capability of CloudLab is the ability to con- with appropriate theory. nect clusters together over Internet2 using software-defined networking. Our experiments use two of these clusters: the 1.1. Storage component Apt cluster in the University of Utah's Downtown Data When considering storage requirements it is important Center in Salt Lake City, and the Clemson cluster of Clem- to remember that data transfer is only the first step in son University in Anderson, South Carolina [16]. an HPC workflow. Presumably there will later be a sig- Communication between the Apt and Clemson Cloud- nificant amount of computation performed on the data Lab clusters occurs over the Internet2 network using the in parallel. While technologies such as large single-node Advanced Layer-2 Service (AL2S) [17]. AL2S allows nodes flash storage arrays may provide the highest raw storage in the two clusters to transparently communicate as if they throughput during a transfer, those arrays will quickly be- were connected to the same switch. That is, they appear come a bottleneck during the computation phase of the to be on the same IP subnetwork. The only discernible workflow as multiple compute nodes compete for access to difference is the relatively high communication latency of the centralized storage component. Instead, a parallel file 26 ms when compared to a more typical latency of 180 µs. system (PFS) provides a good compromise between raw Configuring a new path over AL2S would normally in- storage bandwidth and parallel processing scalability. volve multiple technical and administrative hurdles. The A PFS is a special case of a clustered file system, where experimenter would need to contact the appropriate net- the data files are stored on multiple server nodes. Storing work administrator and convince them to set up a new a file on n nodes where the nodes can be accessed simulta- path between sites. The network administrator would need neously theoretically allows for a speedup of n times over to configure the new path with Internet2 via the Global- the single-node case. In practice, the achievable speedup NOC [18]. The delay involved in this process could be is often much less than n and depends on both the parallel significant. For a production network that may run for file system implementation and the access patterns of the several years or more such a delay may not matter. But application performing I/O. It is therefore advantageous for a temporary experimental network intended to run on to evaluate multiple parallel PFSs to find the one most the order of weeks or less the large initial setup delay can suited to a particular workload. be significant. In this paper some of the more popular PFSs are eval- A major advantage of CloudLab for multi-site experi- uated. Among the file systems compared are BeeGFS [6], ments is that the CloudLab software configures new links Ceph [7], GlusterFS [8], and OrangeFS [9]. BeeGFS is over AL2S automatically at the beginning of each experi- later used for file transfer tests once it is determined to ment. In most cases this process is seamless and requires have the best performance in a benchmark. Lustre is an- only a couple minutes of waiting. A researcher need only other popular PFS that was not evaluated because it is specify the two sites to be connected and the CloudLab not supported for the Ubuntu operating system [10]. infrastructure handles the rest. 1.2. Network component 1.3. Software component Access to a fast network with plenty of excess capac- Two software tools drive experiments. The first tool, ity is essential for high performance data transfers. The XDD [19], is used to perform benchmarks of parallel file Science DMZ model [11] is well suited to data transfers systems. As a benchmark tool XDD makes use of its because of its support for the long-lived elephant flows knowledge of disks to achieve maximum performance by typical of large transfers. In particular the lack of fire- queueing multiple large I/O requests simultaneously. In walls in the model prevents slowdown caused by packet our experiments XDD uses multiple threads to access dif- processing overhead. The network infrastructure for our ferent regions of the same file in parallel. experiments is provided as a part of the CloudLab envi- The second tool, GridFTP [20], is a well-known appli- ronment [12, 13, 14].