Reducing File System Tail Latencies with Chopper 1 Introduction

Reducing File System Tail Latencies with Chopper Jun He, Duy Nguyeny, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Department of Computer Sciences, yDepartment of Statistics University of Wisconsin–Madison Abstract One critical contributor to tail latency is the local file We present Chopper, a tool that efficiently explores the system [8]. Found at the heart of most distributed file sys- vast input space of file system policies to find behav- tems [20, 47], local file systems such as Linux ext4, XFS, iors that lead to costly performance problems. We fo- and btrfs serve as the building block for modern scalable cus specifically on block allocation, as unexpected poor storage. Thus, if rare-case performance of the local file layouts can lead to high tail latencies. Our approach system is poor, the performance of the distributed file sys- utilizes sophisticated statistical methodologies, based on tem built on top of it will suffer. Latin Hypercube Sampling (LHS) and sensitivity analy- In this paper, we present Chopper, a tool that en- sis, to explore the search space efficiently and diagnose ables developers to discover (and subsequently repair) intricate design problems. We apply Chopper to study the high-latency operations within local file systems. Chop- overall behavior of two file systems, and to study Linux per currently focuses on a critical contributor to unusual ext4 in depth. We identify four internal design issues in behavior in modern systems: block allocation, which the block allocator of ext4 which form a large tail in the can reduce file system performance by one or more or- distribution of layout quality. By removing the underly- ders of magnitude on both hard disk and solid state ing problems in the code, we cut the size of the tail by an drives [1, 11, 13, 30, 36]. With Chopper, we show how to order of magnitude, producing consistent and satisfactory find such poor behaviors, and then how to fix them (usu- file layouts that reduce data access latencies. ally through simple file-system repairs). The key and most novel aspect of Chopper is its usage 1 Introduction of advanced statistical techniques to search and investi- As the distributed systems that power the cloud have ma- gate an infinite performance space systematically. Specif- tured, a new performance focus has come into play: tail ically, we use Latin hypercube sampling [29] and sensi- latency. As Dean and Barroso describe, long tails can dra- tivity analysis [40], which has been proven efficient in matically harm interactive performance and thus limit the the investigation of many-factor systems in other appli- applications that can be effectively deployed at scale in cations [24,31,39]. We show how to apply such advanced modern cloud-based services [17]. As a result, a great techniques to the domain of file-system performance anal- deal of recent research effort has attacked tail latency di- ysis, and in doing so make finding tail behavior tractable. rectly [6,49,51]; for example, Alizadeh et al. show how to We use Chopper to analyze the allocation performance reduce the network latency of the 99th percentile by a fac- of Linux ext4 and XFS, and then delve into a detailed tor of ten through a combination of novel techniques [6]. analysis of ext4 as its behavior is more complex and var- The fundamental reason that reducing such tail latency ied. We find four subtle flaws in ext4, including behav- is challenging is that rare, corner-case behaviors, which iors that spread sequentially-written files over the entire have little impact on a single system, can dominate when disk volume, greatly increasing fragmentation and induc- running a system at scale [17]. Thus, while the well- ing large latency when the data is later accessed. We also tested and frequently-exercised portions of a system per- show how simple fixes can remedy these problems, result- form well, the unusual behaviors that are readily ignored ing in an order-of-magnitude improvement in the tail lay- on one machine become the common case upon one thou- out quality of the block allocator. Chopper and the ext4 sand (or more) machines. patches are publicly available at: To build the next generation of robust, predictably per- research.cs.wisc.edu/adsl/Software/chopper forming systems, we need an approach that can read- The rest of the paper is organized as follows. Section 2 ily discover corner-case behaviors, thus enabling a de- introduces the experimental methodology and implemen- veloper to find and fix intrinsic tail-latency problems be- tation of Chopper. In Section 3, we evaluate ext4 and fore deployment. Unfortunately, finding unusual behav- XFS as black boxes and then go further to explore ext4 as ior is hard: just like exploring an infinite state space a white box since ext4 has a much larger tail than XFS. for correctness bugs remains an issue for today’s model We present detailed analysis and fixes for internal allo- checkers [10, 19], discovering the poorly-performing tail- cator design issues of ext4. Section 4 introduces related influencing behaviors presents a significant challenge. work. Section 5 concludes this paper. 2 Diagnosis Methodology in Table 1 (introduced in detail later), there are about 8 × 9 We now describe our methodology for discovering inter- 10 combinations to explore. With an overly optimistic esting tail behaviors in file system performance, particu- speed of one treatment per second, it still would take 250 larly as related to block allocation. The file system input compute-years to finish just one such exploration. space is vast, and thus cannot be explored exhaustively; Latin Hypercube Sampling (LHS) is a sampling method we thus treat each file system experiment as a simulation, that efficiently explores many-factor systems with a large and apply a sophisticated sampling technique to ensure input space and helps discover surprising behaviors [25, that the large input space is explored carefully. 29, 40]. A Latin hypercube is a generalization of a Latin In this section, we first describe our general experimen- square, which is a square grid with only one sample point tal approach, the inputs we use, and the output metric of in each row and each column, to an arbitrary number of di- choice. We conclude by presenting our implementation. mensions [12]. LHS is very effective in examining the in- fluence of each factor when the number of runs in the ex- 2.1 Experimental Framework periment is much larger than the number of factors. It aids The Monte Carlo method is a process of exploring simu- visual analysis as it exercises the system over the entire lation by obtaining numeric results through repeated ran- range of each input factor and ensures all levels of it are dom sampling of inputs [38,40,43]. Here, we treat the file explored evenly [38]. LHS can effectively discover which system itself as a simulator, thus placing it into the Monte factors and which combinations of factors have a large in- Carlo framework. Each run of the file system, given a set fluence on the response. A poor sampling method, such of inputs, produces a single output, and we use this frame- as a completely random one, could have input points clus- work to explore the file system as a black box. tered in the input space, leaving large unexplored gaps in- Each input factor Xi (i = 1; 2; :::; K) (described fur- between [38]. Our experimental plan, based on LHS, con- ther in Section 2.2) is estimated to follow a distribution. tains 16384 runs, large enough to discover subtle behav- For example, if small files are of particular interest, one iors but not so large as to require an impractical amount can utilize a distribution that skews toward small file sizes. of time. In the experiments of this paper, we use a uniform distribution for fair searching. For each factor Xi, we draw 2.2 Factors to Explore a sample from its distribution and get a vector of values 1 2 3 N File systems are complex. It is virtually impossible to (Xi ;Xi ;Xi ; ::; Xi ). Collecting samples of all the factors, we obtain a matrix M. study all possible factors influencing performance. For example, the various file system formatting and mounting 2 1 1 1 3 2 1 3 X1 X2 ::: XK Y options alone yield a large number of combinations. In 2 2 2 2 6 X1 X2 ::: XK 7 6 Y 7 addition, the run-time environment is complex; for exam- M = 6 7 Y = 6 7 4 ::: 5 4 ::: 5 ple, file system data is often buffered in OS page caches in N N N N X1 X2 ::: XK Y memory, and differences in memory size can dramatically Each row in M, i.e., a treatment, is a vector to be used change file system behavior. as input of one run, which produces one row in vector Y . In this study, we choose to focus on a subset of factors In our experiment, M consists of columns such as the size that we believe are most relevant to allocation behavior. of the file system and how much of it is currently in use. As we will see, these factors are broad enough to dis- Y is a vector of the output metric; as described below, cover interesting performance oddities; they are also not we use a metric that captures how much a file is spread so broad as to make a thorough exploration intractable. out over the disk called d-span. M and Y are used for There are three categories of input factors in Chopper. exploratory data analysis.

Load more