Understanding and Taming SSD Read Performance Variability: HDFS Case Study

Understanding and taming SSD read performance variability: HDFS case study Mar´ıa F. Borge Florin Dinu Willy Zwaenepoel University of Sydney University of Sydney EPFL and University of Sydney Abstract found in hardware. On the other hand, they can introduce their own performance bottlenecks and sources of variabil- In this paper we analyze the influence that lower layers (file ity [6, 21, 26]. system, OS, SSD) have on HDFS’ ability to extract max- In this paper we perform a deep dive into the performance imum performance from SSDs on the read path. We un- of a critical layer in today’s big data storage stacks, namely cover and analyze three surprising performance slowdowns the Hadoop Distributed File System (HDFS [25]). We par- induced by lower layers that result in HDFS read through- ticularly focus on the influence that lower layers have on put loss. First, intrinsic slowdown affects reads from every HDFS’ ability to extract the maximum throughput that the new file system extent for a variable amount of time. Sec- underlying storage is capable of providing. We focus on the ond, temporal slowdown appears temporarily and periodi- HDFS read path and on SSD as a storage media. We use cally and is workload-agnostic. Third, in permanent slow- ext4 as the file system due to its popularity. The HDFS read down, some files can individually and permanently become path is important because it can easily become a performance slower after a period of time. bottleneck for an application’s input stage which accesses We analyze the impact of these slowdowns on HDFS and slower storage media compared to later stages which are of- show significant throughput loss. Individually, each of the ten optimized to work fully in memory [31]. slowdowns can cause a read throughput loss of 10-15%. Central to our exploration is the storage access pattern of However, their effect is cumulative. When all slowdowns a single HDFS read request: single threaded, sequential ac- happen concurrently, read throughput drops by as much as cess to large files (hundreds of megabytes) using buffered 30%. We further analyze mitigation techniques and show IO. This access pattern is simple but tried and tested and has that two of the three slowdowns could be addressed via in- remained unchanged since the beginnings of HDFS. It is increased IO request parallelism in the lower layers. Unfortu- creasingly important to understand whether a single HDFS nately, HDFS cannot automatically adapt to use such addi- read using this access pattern can consistently extract by it- tional parallelism. Our results point to a need for adaptabil- self the maximum throughput that the underlying storage can ity in storage stacks. The reason is that an access pattern that provide. As many big data processing systems are heavily maximizes performance in the common case is not necessar- IO provisioned and the ratio of cores to disks reaches 1 [1], ily the same one that can mask performance fluctuations. arXiv:1903.09347v1 [cs.OS] 22 Mar 2019 relying on task-level parallelism to generate enough parallel requests to saturate storage is no longer sufficient. Moreover, 1 Introduction relying on such parallelism is detrimental to application performance since each of the concurrent HDFS reads is served Layering is a popular way to design big data storage slower than it would be in isolation. stacks [24]. Distributed file systems (HDFS [25], GFS [9]) With proper parameter configuration, the HDFS read ac- are usually layered on top of a local file system (ext4 [19], cess pattern is sufficient to extract maximum performance Btrfs [23], F2FS [17], XFS [28], ZFS [2]) running in an OS. out of our SSDs. However, we uncover and analyze three This approach is desirable because it encourages rapid devel- surprising performance slowdowns that can affect the HDFS opment by reusing existing functionality. An important goal read path at different timescales (short, medium and long) for such storage stacks is to extract maximum performance and result in throughput loss. All three slowdowns are from the underlying storage. With respect to this goal, lay- caused by lower layers (file system, SSDs). Furthermore, ering is a double-edge sword [22]. On one hand, the OS and our analysis shows that the three slowdowns can affect not the file system can compensate for and mask inefficiencies only HDFS but any application that uses buffered reads. As opposed to related work that shows worn out SSDs could • We analyze mitigation techniques and show that two of cause various performance problems [3, 10], our slowdowns the three slowdowns can be addressed via increased par- occur on SSD with very low usage throughout their lifetime. allelism. The first slowdown, which we call intrinsic slowdown, af- The rest of the paper continues as follows. x2 presents fects HDFS reads at short time scales (seconds). HDFS read HDFS architecture and the configuration that we use. x3 throughput drops at the start of every new ext4 extent read. presents our experimental setup and the metrics used. x4, A variable number of IO requests from the start of each ex- x5 and x6 introduce and analyze intrinsic, temporal and per- tent are served with increased latency regardless of the extent manent slowdown respectively. x7 discusses the findings, x8 size. The time necessary for throughput to recover is also presents related work and x9 concludes. variable as it depends on the number of request affected. The second slowdown, which we call temporal slowdown, affects HDFS reads at medium time scales (tens of minutes 2 Background to hours). Tail latencies inside the drive increase periodically and temporarily and cause HDFS read throughput loss. 2.1 HDFS Architecture While this slowdown may be confused with write-triggered HDFS is a user-level distributed file system that runs layered SSD garbage collection [8], we find, surprisingly, that it ap- on top of a local file system (e.g. ext4, zfs, btrfs). An HDFS pears in a workload-agnostic manner. file is composed of several blocks, and each block is stored The third slowdown, which we call permanent slowdown, as one separate file in the local file system. Blocks are large affects HDFS reads at long time scales (days to weeks). Af- files; a common size is 256 MB. ter a period of time, HDFS read throughput from a file per- The HDFS design follows a server-client architecture. The manently drops and never recovers for that specific file. Im- server, called the NameNode, is a logically centralized meta- portantly this is not caused by a single drive-wide malfunc- data manager that also decides file placement and performs tion event but rather it is an issue that affects files individu- load balancing. The clients, called DataNodes, work at the ally and at different points in time. The throughput loss is level of HDFS blocks and provide block data to requests caused by latency increases inside the drive, but, compared coming from compute tasks. to temporal slowdown all requests are affected, not just the To perform an HDFS read request, a compute task (e.g. tail. Hadoop mapper) first contacts the NameNode to find all the Interestingly, we find that two of the three slowdowns can DataNodes holding a copy of the desired data block. The be completely masked by increased parallelism in the lower task then chooses a DataNode and asks for the entire block layers, yet HDFS cannot trigger this increased parallelism of data. The DataNode reads data from the drive and sends and performance suffers. Our results point to a need for it to the task. The size DataNode’s reads is controlled by adaptability in storage stacks. An access pattern that maxi- the parameter io.file.buffer.size (default 64KB). The DataN- mizes performance in the common case is not necessarily the ode sends as much data as allowed by the OS socket buffer one that can mask unavoidable hardware performance fluc- sizes. A DataNode normally uses the sendfile system call tuations. to read data from a drive and send it to the task. This ap- With experiments on 3 SSD based systems, we show that proach is general and handles both local tasks (i.e. on the each of the slowdowns we identified can individually intro- same node as the DataNode) as well as remote tasks. As duce at least 10-15% HDFS read throughput degradation. an optimization, a task can bypass the DataNode for local The effect is cumulative. In the worst case, all slowdowns reads (short-circuit reads) and read data directly using stan- can overlap leading to a 30% throughput loss. dard read system calls. The contributions of the paper are as follows: • We identify and analyze the intrinsic slowdown affect- 2.2 HDFS Access Pattern ing reads at the start of every new ext4 extent for a vari- We now summarize the main characteristics of the HDFS able amount of time. read access pattern since this pattern is central to our work. • We identify and analyze the temporal slowdown that • Single-threaded. One request from a compute task is affects reads temporarily and periodically while being handled by a single worker thread in the DataNode. workload-agnostic. • Large files. HDFS blocks are large files (e.g. 128MB, • We identify and analyze the permanent slowdown af- 256MB). Each HDFS block is a separate file in the local fecting reads from an individual file over time. file system. • We analyze the impact of these slowdowns on HDFS • Sequential access. The HDFS reads access data se- performance and show significant throughput loss. quentially for performance reasons. 2 • Buffered IO. HDFS uses buffered IO in the DataNode 3.1 Experimental Setup (via sendfile or read system calls).

Understanding and Taming SSD Read Performance Variability: HDFS Case Study

The Google File System (GFS)

A Review on GOOGLE File System

A Large-Scale Study of File-System Contents

F1 Query: Declarative Querying at Scale

A Study of Cryptographic File Systems in Userspace

Latency-Aware, Inline Data Deduplication for Primary Storage

Cloud Computing Bible Is a Wide-Ranging and Complete Reference

On the Performance Variation in Modern Storage Stacks

Filesystems HOWTO Filesystems HOWTO Table of Contents Filesystems HOWTO

Mapreduce: Simplified Data Processing On

CD ROM Drives and Their Handling Under RISC OS Explained

Google File System 2.0: a Modern Design and Implementation