A Scaling Analysis of Linux I/O Performance
Total Page:16
File Type:pdf, Size:1020Kb
A Scaling Analysis of Linux I/O Performance Yannis Klonatos, Manolis Marazakis, and Angelos Bilas Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) 100 N. Plastira Av., Vassilika Vouton, Heraklion, GR-70013, Greece {klonatos, maraz, bilas}@ics.forth.gr I. INTRODUCTION processors running at 2.26 GHz, with two hardware threads The widespread availability of multicore processors in per core, 12 GB of DDR-III DRAM, and 24 32-GB enterprise- server configurations with aggressive I/O resources (I/O con- grade Intel X25-E SLC NAND-Flash SSDs, connected to four trollers along with multiple storage devices, such as solid- LSI MegaSAS 9260 storage controllers. The SSDs are orga- state drives) is very promising for increasing levels of I/O nized in a RAID-0 configuration using the default md Linux performance. However, in this work, we raise concerns about driver, with a chunk-size of 64KB. The Linux distribution the actual scalability of I/O performance observed in current installed is CentOS, v.5.5, with version 2.6.37 of the mainline server configurations. We provide experimental evidence of (“vanilla”) kernel. We select this specific kernel because it scalability issues, using the latest stable Linux kernel release includes numerous recent scalability related patches. We use (v.2.6.37). We identify scaling limitations based on extensive two filesystems, XFS and EXT4 for our experiments. measurements and analysis of overheads, mostly lock con- For our evaluation, we use a modified version of the tention, using a synthetic workload, which consists of basic fsmark [3] benchmark. fsmark is widely used by Linux kernel filesystem tasks and employs a high degree of I/O concurrency. developers because it issues low-level filesystem-intensive Starting with a fairly common system configuration as a operations, a property that makes it suitable for studying baseline, we find that for several cases I/O performance does filesystem scalability. The original version of the benchmark not scale adequately with the number of available hardware provides support for open, write, read, unlink and fsync threads, for a variety of filesystem operations. Moreover, we system calls; we have added support for create, seek and observe that performance may actually deteriorate when more stat system calls. We have also modified the benchmark to CPU cores become available. We attempt to mitigate over- report operations per second for each system call, instead of heads, as identified in our analysis, with targeted tuning. Still, average execution times the original version reported. For all significant scaling limitations remain, motivating a critical re- our experiments we use 128 application threads, each one examination of the entire I/O path in the Linux kernel. executing operations on 128 private files. Each file contains 512KB of data. Finally, we vary the number of CPU threads II. RELATED WORK in the range from 1 to 16. Scalability in multicore environments, including filesystem We collect lock contention statistics using the lock stat [4] performance, is nowadays a hot topic. Extensive analysis and mechanism provided by recent Linux kernels. Lock stat re- experimentation is currently being performed by the Linux ports for each lock: i. It’s average contention (how many times kernel developers themselves, as evident by numerous mail- threads failed to acquire the lock), and ii. The total wait time threads in the Linux Kernel Mailing List. An important result associated with it. We use the aforementioned information to of this ongoing debate has been the introduction in modern locate contention bottlenecks. kernels of the Read-Copy-Update (RCU) [1] mechanism. The IV. EXPERIMENTAL RESULTS most recent analysis of Linux Scalability is to our knowl- edge [2]. However, in this publication the authors do not: We make our experimental analysis in three parts. First, we 1) Examine the scalability of commonly deployed filesystems, present scalability issues with the XFS filesystem running on such as XFS and EXT4, 2) Consider block-level I/O, as they top of a ramdisk. We believe it is a fairly common system performed all their storage-related experiments using tmpfs, configuration to use as a baseline, given the performance gap which operates entirely in the DRAM-based buffer-cache. Our between DRAM and SSDs. However, since modern server work focuses on these two specific aspects, presenting further systems can not always rely on DRAM-based caches to scalability issues in the common I/O path. achieve the highest performance possible, we then present our scalability concerns using 24SSDs, which provide high III. EXPERIMENTAL METHODOLOGY throughput and IOPS performance, resulting in the block- We perform our evaluation on a x86-based Linux server, devices never being the performance bottleneck. Finally, we consisting of the following components: a dual-socket Tyan show that our observations also hold when using another S7025 motherboard with two 4-core Intel Xeon 5520 64-bit filesystem like EXT4, instead of XFS. 12810000 280000 340000 100000 11529000 252000 306000 90000 10248000 224000 272000 80000 8967000 196000 238000 70000 7686000 168000 204000 60000 6405000 140000 170000 50000 ops/sec 5124000 ops/sec 112000 ops/sec 136000 ops/sec 40000 3843000 RAMDISK_SHAREDDIR_DEFAULTLOG 84000 102000 30000 RAMDISK_PRIVATEDIR_DEFAULTLOG 2562000 56000 68000 20000 RAMDISK_PRIVATEDIR_LARGELOG 1281000 24 SSDs 28000 34000 10000 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10111213141516 0 1 2 3 4 5 6 7 8 9 10111213141516 0 1 2 3 4 5 6 7 8 9 10111213141516 0 1 2 3 4 5 6 7 8 9 10111213141516 #CPUs #CPUs #CPUs #CPUs (a) Open (b) Write (c) Read (d) Unlink Fig. 1. Operations per second for the open, write, read and unlink system calls, for varying number of cores. We run four configurations: i) Using ramdisk with all threads operating on the same directory (RAMDISK-SHAREDDIR-DEFAULTLOG), ii) Using ramdisk with each threads operating on its own directory (RAMDISK-PRIVATEDIR-DEFAULTLOG), iii) Using ramdisk with each thread operating on its own directory and 2GB log (instead of the default 128MB, RAMDISK-PRIVATEDIR-LARGELOG) and, finally iv) Using 24 SSDs. To begin with, for the open system call using the ramdisk, 8 respectively), while after this point the request queue lock we observe the same behavior for all ramdisk configurations. is the main performance bottleneck. This is important, since As shown in Figure 1, performance is first increased when resolving the request queue contention point will result in moving from 1 to 2 cores, however it then begins to drop as we helping both ramdisk and I/O subsystems found in commonly further increase the available CPU count. The main contention found in Linux server environments. point is the request queue lock, and the wait time is increased Finally, we wanted to observe what happens when using significantly (up to 800%) when moving to a higher number another filesystem, rather that the XFS used so far. For this of available cores. The directory lock (dlock), used in path purpose, we run again our benchmark with the EXT4 filesys- resolution, exhibits the same behaviour when moving to more tem. We notice, that although EXT4 achieves significantly less than 4 cores. performance compared to XFS for all systems calls (especially Furthermore, for the read and writes system calls using for writes), the most significant observation is that EXT4 the ramdisk, we observe that there is high wait time for the exhibits the same contention behaviour for the same locks as zone lock. This lock is responsible for holding the buffer XFS. This is important, since it reveals that if these contentions cache radix tree organization consistent, when it is accessed are solved or mitigated, both filesystems (and most probably through multiple threads. Although we notice that the wait others as well) will significantly benefit. time for this thread is not decreased when we operate on V. CONCLUSIONS AND FUTURE WORK different directories instead of a single one (RAMDISK- Having collected experimental evidence of I/O scaling lim- SHAREDDIR-DEFAULTLOG vs RAMDISK-PRIVATEDIR- DEFAULTLOG), we notice that this time is significantly itations on modern server configurations, as presented so far, our current work aims in touching several layers along the I/O reduced when using the larger log option (RAMDISK- path in the Linux kernel, including caching for filesystem data PRIVATEDIR-LARGELOG). and metadata, and efficient submission and scheduling of I/O Then, for the unlink system call, we observe high con- requests. We are also using more complex benchmarks to em- tention (>90% of the threads are continuously waiting) for ulate server I/O workloads, operating at high I/O concurrency the superblock lock employed by XFS, when using a shared levels. With the rapidly growing gap between processor and directory. We argue this is due to the fact that the superblock I/O subsystem performance, we expect this research direction is updated in each unlink, causing the threads to be syn- to become critical for achieving adequate performance levels chronized at this point. When moving to different directories, for data-intensive application workloads. XFS spreads out the directories throughout the SSD address space, in partitions called allocation groups, and uses a single ACKNOWLEDGMENTS superblock for each of them. Thus, less contention is observed We thankfully acknowledge the support of the European for each superblock, and consequently each thread must wait Commission under the 6th and 7th Framework Programs for less time in order to acquire a lock. However, at this point, through the IOLanes (FP7-STREP-248615), HiPEAC (NoE- the XFS log begins to be a significant performance bottleneck. 004408), and HiPEAC2 (FP7-ICT-217068) projects. However, if we increase the log size to 2GB, and although the log lock does not disappear from lock stat output, the wait REFERENCES time observed for this lock is significantly reduced, resulting [1] P.