Automatic I/O Scheduler Selection Through Online Workload Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Automatic I/O scheduler selection through online workload analysis Ramon Nou1, Jacobo Giralt1, Toni Cortes1;2 Barcelona Supercomputing Center (Spain)1 - Technical University of Catalonia(UPC)2 [email protected], [email protected], [email protected] Abstract workloads. The I/O scheduler is assigned statically on a per- I/O performance is a bottleneck for many workloads. The disk basis and is normally predefined by the Linux distribution. I/O scheduler plays an important role in it. It is typically On the CPU side, the additional transistors given by the configured once by the administrator and there is no selection advance of the scale of integration are currently addressed that suits the system at every time. Every I/O scheduler has a mainly to increase the number of cores in the chip. Perfor- different behavior depending on the workload and the device. mance achieved by CPUs keeps getting higher, and so does We present a method to select automatically the most suitable the number of CPUs in a system. A similar progression can be I/O scheduler for the ongoing workload. This selection is found for disk capacity, but not for I/O performance. This per- done online, using a workload analysis method with small I/O formance gap is widening with multicores, since the available traces, finding common I/O patterns. Our dynamic mechanism storage bandwidth must be shared among several cores. The adapts automatically to one of the best schedulers, sometimes effects of this performance gap are typically reduced using achieving improvements on I/O performance for heteroge- more storage devices in parallel (like RAID systems), which neous workloads beyond those of any fixed configuration leads to an increased performance in terms of the number of (up to 5%). This technique works with any application and operations per second (IOPS) and bandwidth. device type (RAID, HDD, SSD), as long as we have a This paper explores the new opportunity provided by mul- system parameter to tune. It does not need disk simulations or ticores, to add smarter components to the I/O path. Under I/O hardware models, which are normally unavailable. We evaluate bound workloads, cores can stand idle during large periods of it in different setups, and with different benchmarks. time, and hence we can get better performance dedicating them Index Terms—automatic; I/O Scheduling; pattern matching; to tasks that will alleviate this bottleneck. Therefore, we are optimization; proposing a workload analysis mechanism to tune an important I/O parameter: the I/O scheduler. We can use techniques I. INTRODUCTION like time-series analysis of I/O traces, to identify previous The I/O path comprises a highly developed set of com- patterns, unite them to a performance metric, and provide an ponents, to the very least composed by an application, file automatic way to select the best-performing I/O scheduler for cache, file system and an I/O scheduler (if we restrict ourselves the current workload. Patterns are created by the interaction to the software part). The objective of this architecture is of simultaneous workloads, for example, multiple processes or arranging I/O operations in a better way for the underlying several Virtual Machines. Some of the samples seem random storage device. Nonetheless, it is well known that workloads at first sight, but in fact, they can still be compared and expose have diverse behavior depending on the storage device [1]. No a similar behavior to others. The technique we are proposing uniformity can be assumed for a brand and model, not even can link a certain pattern to a performance value, and find if for a single disk, since aging [2] can produce different results one pattern matches another. over time. I/O parameters need to change dynamically, even The proposed mechanism should not be used where fairness if the workload does not, to maintain performance. or low latency is important, specifically real-time systems However, many of these I/O parameters either come pre- should avoid using it. In order to link patterns and perfor- configured with default values or are tuned only once in the mance, periodically, we check all different schedulers implying lifetime of a system. In many cases, the default values come potential inefficiencies for some intervals. from analyses that took place on different setups and it is Although the evaluation is done on I/O bound applications, assumed they will suit the majority. our method runs in low priority threads being able to reduce We studied the I/O scheduler and found out it has a big the impact on CPU bound applications. impact on I/O performance, more with multiple and indepen- In summary, we present the following contributions: dent parallel applications, and there is certainly not a single 1) A novel method to compare I/O traces and extract behav- configuration that provides the best performance for every ior patterns online, using an algorithm based on DTW workload nor hardware. A mainline Linux OS implements four (Dynamic Time Warping [3], FastDTW variant [4]) in different schedulers, with slightly distinct targets: for example, multi-cores. fairness or low latency, and they all underperform for some 2) The proposal and evaluation of an automatic scheduler selector implementation for Linux: IOAnalyzer. BENCH 1 / DISK 1 BENCH 2 / DISK 1 3) The addition to IOAnalyzer of a simple dynamic prob- 1,0 ability guided scheduler selector for random workloads, 0,8 selecting an I/O scheduler when no pattern is detected. 0,6 The feasibility of this technique is demonstrated with an 0,4 evaluation in different setups. We use widely accepted industry benchmarks like TPC-E [5] and TPC-H in a setup with HDD 0,2 technology. Additionally, we examine multiple workloads that 0,0 NOOP AS DL CFQ NOOP AS DL CFQ produce a near-random behavior to demonstrate we can obtain BENCH1 / DISK 2 BENCH 2 / DISK 2 good performance where no trivial manual setup is possible. 1,0 Finally, we evaluate our technique in four Virtual Machines 0,8 on a 16-SSD RAID-0 running a real application. Time Execution Normalized 0,6 II. MOTIVATION 0,4 With our method, we assume that similar I/O patterns 0,2 will be repeated over time. Moreover, a certain pattern is 0,0 typically followed by another one. The mechanism to manage NOOP AS DL CFQ NOOP AS DL CFQ the dynamic pattern probability chain is further explained in Figure 1. I/O Scheduler comparison between two different benchmarks (row) Section III-A. with two different hardware devices (column). Execution Time is normalized If we can predict this situation, we can try different system for each test. Each color bar represents a I/O scheduler (NOOP, AS, DL, CFQ). parameters (in our case I/O schedulers). We will be able to see (learn) the effects that those system parameters have over the same I/O pattern. Once we have a set of learned values, It is worth mentioning that for SSD disks, NOOP scheduler we can start choosing the most beneficial system parameter usually performs best. But, eventually a specific one will be for the current workload. created (for example, [6], [7]), which will change this fact. On the next subsections, we will show why clustering I/O To summarize, I/O Scheduler should be adaptive and not fixed patterns is a valid idea and why the I/O scheduler is an as it is now. important system parameter to tune. B. I/O clustering inside known traces A. I/O Scheduler effects Being able to identify or classify the workload is a difficult The I/O scheduler selection affects the performance of task, we cannot use file names, applications or other high level the system. Nowadays, Linux distributions implement four properties. The same application, with identic parameters and different I/O schedulers: Completely Fair Scheduler (CFQ), files can generate different behavior due to its interaction with Deadline (DL), Anticipatory (AS) and No-Op (NOOP). The other applications or files. Even if we isolate the applications, default one is usually CFQ or DL. However, we claim no internal fragmentation or disk aging can convert a simple scheduler performs well for every workload and device. In sequential read in something totally different. Figure 1, we show two different workloads 1 (row) with two different hardware devices (column), both HDD technology 2. Our methodology uses, mainly, traces obtained from the Each bar is the execution time (normalized to the maximum) kernel elevator. These traces include the I/O requests that will when using a static I/O scheduler. be issued to the disks before entering the I/O scheduler. The scheduler will order and merge them, but it is assumed that Figure 1 exposes three main ideas: i) looking at the first row a similar trace will produce the same result. For our task, we (different BENCH, same DISK) we see how the scheduler are using Dynamic Time Warping (DTW) [3] (specifically the performance is totally different. Selecting CFQ for BENCH FastDTW [4] implementation) to compare and get the edit 1 provides us with the best performance, but for BENCH 2 distance between two time-series. In our case, the I/O traces the situation changes and the best scheduler is NOOP. For are timeseries. The edit distance ( ) is the minimal number these workloads on DISK 1, a static selection would not be a N0 of changes that need to be applied to a trace to convert it into good choice. ii) Observing the second column (same BENCH, another one. different DISK) we also see how the best-performing scheduler varies.