Transparent Optimization of Parallel File System I/O Via Standard System Tool Enhancement

Transparent Optimization of Parallel File System I/O via Standard System Tool Enhancement Paul Z. Kolano NASA Advanced Supercomputing Division, NASA Ames Research Center M/S 258-6, Moffett Field, CA 94035 U.S.A. [email protected] Abstract—Standard system tools employed by users on a daily disks to achieve higher aggregate performance than is possible basis do not take full advantage of parallel file system I/O on a single-disk file system. Unlike CXFS and GPFS, however, bandwidth and do not understand associated idiosyncrasies such Lustre striping must be specified explicitly before a file is as Lustre striping. This can lead to non-optimal utilization of both the user’s time and system resources. This paper describes a set first written. The stripe count determines how many Object of modifications made to existing tools that increase parallelism Storage Targets (OSTs) a file will be divided across, which and automatically handle striping. These modifications result can significantly impact I/O performance. A greater number in significant performance gains in a transparent manner with of OSTs provides more available bandwidth but also produces maximum speedups of 27x, 15x, and 31x for parallelized cp, tar greater resource contention during metadata operations. Strip- creation, and tar extraction, respectively. ing cannot be changed after a file is created without copying I. INTRODUCTION the file in its entirety so should be set carefully. In a typical HPC workflow, users invoke a variety of stan- Figure 1 shows the time to perform the four basic metadata dard system tools to transfer data onto the system and prepare operations on 10,000 files while varying the stripe count using it for processing before analyzing it on large numbers of CPUs an otherwise idle Lustre file system. As can be seen, the impact and preparing/retrieving the results. User data is typically of striping can be significant even when file contents are not stored on a parallel file system that is capable of serving being read or written. The greatest impact moving between large numbers of clients with high aggregate bandwidth. The 1 and 64 stripes was on the stat operation with a slowdown standard tools used to manipulate the data, however, are not of 9.8x while the unlink operation was least impacted with typically written with parallel file systems in mind. They either a slowdown of 2.9x. The open and create operations slowed do not employ enough concurrency to take full advantage down by 8.4x and 5.9x, respectively. of file system bandwidth and/or leave files in a state that 60 leads to inefficient processing during later access by parallel create computations. This results in non-optimal utilization of both 50 open the user’s time and system resources. stat 40 This paper presents a set of modifications to standard file unlink manipulation tools that are optimized for parallel file systems. 30 These tools include highly parallel versions of cp and tar that Time (s) 20 achieve significant speedups on several file systems as well as Lustre-specific versions of bzip2, gzip, rsync, and tar that 10 have embedded knowledge of Lustre striping to ensure files 0 are striped appropriately for efficient parallel access. These 1 2 4 8 16 32 64 tools can be dropped into place over standard versions to Stripes transparently optimize I/O whenever the user would normally invoke them resulting in greater productivity and resource Fig. 1. Run time of Lustre metadata operations on 10k files utilization. These modifications will be discussed as well as detailed performance numbers on various parallel file systems. To better understand the read/write effects of striping, I/O This paper is organized as follows. Section II discusses performance was measured by running parallel instances of the addition of stripe-awareness into commonly used tools. the dd command on varying numbers of 3.0 GHz dual quad- Section III describes high performance modifications to the core Xeon Harpertown front-ends while reading or writing cp and tar utilities. Section IV details related work. Finally, disjoint blocks of the same file to /dev/null or from /dev/zero, section V presents conclusions and future work. respectively, via direct I/O. File sizes ranged from 1 GB to 64 GB with stripe counts varying from 1 to 64. Figures 2, 3, 4, II. STRIPE-AWARE SYSTEM TOOLS and 5 show the times to write files of each size and stripe count Parallel file systems such as CXFS [20], GPFS [18], and on 1, 2, 4, and 8 hosts, respectively, while Figures 6, 7, 8, and Lustre [19] utilize parallel striping across large numbers of 9 show the corresponding read times. As can be seen, write 1 Proc. of the 2nd Intl. Wkshp. on High Performance Data Intensive Computing, Boston, MA, May 24, 2013 times consistently decrease as the stripe count increases for all 300 64GB file sizes and numbers of hosts with more significant drops as 250 32GB the number of hosts increases. In the read case, however, read 16GB times only decrease up to a certain number of stripes after 200 8GB which they increase again, but larger stripe counts become 150 4GB more effective as the number of hosts increases. 2GB Time (s) 100 1GB If the stripe count of the directory containing a newly created file has ben explicitly set, that file will be striped 50 according to the parent directory settings. Otherwise, it will be 0 striped according to the default striping policy. As just shown, 1 2 4 8 16 32 64 however, different file sizes may behave better with different Stripes stripe counts. A high default value causes small files to waste space on OSTs and generates an undesirable amount of OST Fig. 2. 1-host dd write traffic during metadata operations. A low default value results in significantly reduced parallel performance for large files and 250 imbalanced OST utilization. Users can also explicitly specify 64GB the stripe count for files and directories. Many users, however, 200 32GB 16GB may not know about striping, may not remember to set it, or 150 8GB may not know what the appropriate value should be. Directory 4GB striping may be set but the typical mixtures of large and small 100 2GB files mean that users face the same dilemma as in the default Time (s) 1GB case. Even when specified correctly, carefully striped files may 50 revert to the default during manipulation by common system tools. 0 1 2 4 8 16 32 64 Since neither approach is optimal, a new approach was developed that utilizes stripe handling embedded into common Stripes system tools to stripe files dynamically according to size as Fig. 3. 2-host dd write users perform normal activities. This allows the default stripe count to be kept low for more common small files while larger 300 files are transparently striped across more OSTs as they are 64GB manipulated. The tools selected for modification were based on 250 32GB typical HPC workflows in which a user (1) remotely transfers 16GB 200 data to the file system using scp, sftp, rsync, bbftp, gridftp, 8GB 150 4GB etc., (2) prepares data for processing using tar -x, gunzip, 2GB bunzip2, unzip, etc., (3) processes data on compute resources Time (s) 100 1GB using arbitrary code, (4) prepares results for remote transfer using tar -c, gzip, bzip2, zip, etc., and (5) remotely retrieves 50 results from the file system. The key to the approach is that by 0 the third step, the input data will have hopefully been striped 1 2 4 8 16 32 64 appropriately in order for processing to achieve the highest Stripes I/O performance. The fifth step can be ignored as data is being written to an external file system outside the local organization. Fig. 4. 4-host dd write Additional tools occur in other common activities such as administrators copying data between file systems to balance 450 utilization using cp, rsync, etc., users copying data between file 400 64GB 32GB 350 systems (e.g. home/backup directory to scratch space) using 16GB cp, rsync, etc., or users retrieving data from archive systems 300 8GB using scp, sftp, rsync, bbftp, gridftp, etc. 250 4GB Adding stripe-awareness to existing tools is a straightfor- 200 2GB Time (s) 150 1GB ward process. Since striping needs to be specified at file 100 creation, the first step is to find instances of the open() call 50 that use the O_CREAT flag and determine if the target file is 0 on Lustre using statfs(). If so, the projected size of the file 1 2 4 8 16 32 64 is computed and used to determine the desired stripe count. Stripes Finally, the open() call is switched to the Lustre API equiva- lent, llapi_file_open(), with the given count. Determining the Fig. 5. 8-host dd write 2 Proc. of the 2nd Intl. Wkshp. on High Performance Data Intensive Computing, Boston, MA, May 24, 2013 1200 projected size may involve greater complexity for applications 64GB 1000 32GB such as tar where the target size may be the sum of many 16GB individual file sizes. 800 8GB From the figures, an approximate function for computing 600 4GB optimal stripe count can be extrapolated by examining the 2GB points of diminishing or negative returns. Based on this limited Time (s) 400 1GB size·hosts data, the optimal write stripe count is projected to be 8GB 200 size·hosts while the optimal read stripe count is projected to be 16GB . 0 Since the number of hosts that users will run on is unknown, 1 2 4 8 16 32 64 a value must be chosen to represent anticipated usage (e.g.

Transparent Optimization of Parallel File System I/O Via Standard System Tool Enhancement

Early Experiences with Storage Area Networks and CXFS John Lynch

CXFSTM Client-Only Guide for SGI® Infinitestorage

Silicon Graphics, Inc. Scalable Filesystems XFS & CXFS

CXFSTM Administration Guide for SGI® Infinitestorage

Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques

CXFS Customer Presentation

SGI's Complex Data Management Architectures

Designing High-Performance and Scalable Clustered Network Attached Storage with Infiniband

CXFS • CXFS XFS • File System Features: XFS XVM • Volume Management: XVM FC Driver

Analyzing Metadata Performance in Distributed File Systems

IRIX® 6.5.15 Update Guide

Guide to OS X SAN Software